Citation
Bayesian Semiparametric Regression and Related Applications

Material Information

Title:
Bayesian Semiparametric Regression and Related Applications
Creator:
Bhadra, Dhiman
Place of Publication:
[Gainesville, Fla.]
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (145 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Statistics
Committee Chair:
Ghosh, Malay
Committee Co-Chair:
Daniels, Michael J.
Committee Members:
Agresti, Alan G.
Andresen, Elena M.
Graduation Date:
8/7/2010

Subjects

Subjects / Keywords:
Case control studies ( jstor )
Diseases ( jstor )
Income estimates ( jstor )
Median income ( jstor )
Modeling ( jstor )
School dropouts ( jstor )
Semiparametric modeling ( jstor )
Statistical estimation ( jstor )
Statistics ( jstor )
Trajectories ( jstor )
Statistics -- Dissertations, Academic -- UF
bayesian, case, current, mcmc, odds, penalized, random, semiparametric
Genre:
Electronic Thesis or Dissertation
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
Statistics thesis, Ph.D.

Notes

Abstract:
Case-Control studies and small area estimation are two distinct areas of modern Statistics. The former deals with the comparison of diseased and healthy subjects with respect to risk factor(s) of a disease with the aim of capturing disease - exposure association specially for rare diseases. The later area is concerned with the measurements of characteristics of small domains - regions whose sample size is so small that the usual survey based estimation procedures cannot be applied in the inferential routines. Both these areas are important in their own right. Case-control studies forms one of the pillars of modern biostatistics and epidemiology and has diverse applications in various health related issues, specially those involving rare diseases like Cancer. On the other hand, estimates of characteristics for small areas are widely used by Federal and local governments for formulating policies and decisions, in allocating federal funds to local jurisdictions and in regional planning. My dissertation deals with the application of Bayesian semiparametric procedures in modeling unorthodox data scenarios that may arise in case control studies and small area estimation. The first part of the dissertation deals with an analysis of longitudinal case-control studies i.e case-control studies for which time varying exposure information are available for both cases and controls. In a typical case-control study, the exposure information is collected only once for the cases and controls. However, some recent medical studies have indicated that a longitudinal approach of incorporating the entire exposure history, when available, may lead to more precise estimates of the odds ratios of disease. We use semiparametric regression procedures to model the exposure profiles of the cases and controls and also the influence pattern of the exposure profile on the disease status. This enables us to analyze how the present disease status of a subject is influenced by his/her past exposure conditions conditional on the current ones. Analysis is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) algorithms. The proposed methodology is motivated by, and applied to a case-control study of prostate cancer where longitudinal biomarker information is available for the cases and controls. The second and third part of my dissertation deals with univariate and multivariate semiparametric procedures for estimating characteristics of small areas across the United States. In the second part, we put forward a semiparametric modeling procedure for estimating the median household income for all the states of the U.S. and the District of Columbia. Our models include a nonparametric functional part for accomodating any unspecified time varying income pattern and also a state specific random effect to account for the within-state correlation of the income observations. Model fitting and parameter estimation is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) methodology. It is seen that the semiparametric model estimates can be superior to both the direct estimates and the Census Bureau estimates. Overall, our study indicates that proper modeling of the underlying longitudinal income profiles can improve the performance of model based estimates of household median income of small areas. In the third part of the dissertation, we put forward a bivariate semiparametric modeling procedure for the estimation of median income of four-person families for the different states of the U.S. and the District of Columbia while explicitly accommodating for the time varying pattern in the income observations. Our estimates tend to have better performances than those provided by the Census Bureau and also have comparable performances to some established methodologies specially those involving time series modeling techniques. Based on our findings in parts two and three, we come to the conclusion that semiparametric and nonparametric regression models can be a attractive alternative to the more traditional modeling frameworks specially in situations where information on different characteristics of small areas are available at multiple time points in the past. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2010.
Local:
Adviser: Ghosh, Malay.
Local:
Co-adviser: Daniels, Michael J.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-08-31
Statement of Responsibility:
by Dhiman Bhadra.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Embargo Date:
8/31/2011
Resource Identifier:
004979600 ( ALEPH )
769020145 ( OCLC )
Classification:
LD1780 2010 ( lcc )

Downloads

This item has the following downloads:


Full Text





estimates of four-person families for 1989 using 1979 as the base year. They compared

their estimates with the CPS median income estimates and Bureau of Census estimates

by treating the decennial census values as "gold standard". They used both univariate

and bivariate model formulations. In all the cases, the time series model with the

adjusted census median income as covariates performed better than the ones with

either the base year census median as covariates or both the base year and adjusted

census medians as covariates. In all the cases, the time series model performed better

than the non-time series one which only utilized the census median income figures for

1979, the CPS median income estimates for 1989 and the per capital income incomes

for 1979 and 1989. Last but not the least, the bivariate time series model using the

median incomes of four and five person families performed the best and outperformed

both the CPS and Bureau of Census estimates of median income.

Semiparametric regression methods have not been used in small area estimation

contexts until recently. This was mainly due to methodological difficulties in combining

the different smoothing techniques with the estimation tools generally used in small

area estimation. The pioneering contribution in this regard is the work by Opsomer

et al. (2008) in which they combined small area random effects with a smooth,

non-parametrically specified trend using penalized splines (Eilers and Marx, 1996). In

doing so, they expressed the non-parametric small area estimation problem as a mixed

effects regression model and analyzed it using restricted maximum likelihood. They also

presented theoretical results on the prediction mean squared error and likelihood ratio

tests for random effects. Inference was based on a simple non-parametric bootstrap

approach. They applied their model to a non-longitudinal, spatial dataset concerning the

estimation of mean acid neutralizing capacity (ANC) of lakes in the north eastern states

of U.S.









Datta, G., Ghosh, M., Nangia, N., and Natarajan, K. (1993). Estimation of median
income of four-person families : A Bayesian approach, in W.A. Berry, K.M. Chaloner
and J.K. Geweke (Eds),. Bayesian Analysis in Statistics and Econometrics pages
129-140.

Denison, D., Mallick, B., and Smith, A. (1998). Automatic Bayesian curve fitting. Journal
of the Royal Statistical Society, Series B 60, 333-350.

Diggle, P., Heagerty, P., Liang, K., and Zeger, S. (2002). The analysis of longitudinal
data, 2nd Edition. New York : Oxford University Press.

Diggle, P., Morris, S., and Wakefield, J. (2000). Point source modeling using matched
case-control data. Biostatistics 1, 89-109.

DiMatteo, I., Genovese, C., and Kass, R. (2001). Bayesian curve fitting with free knot
splines. Biometrika 88, 1055-1071.

Durban, M., Harezlak, J., Wand, M., and Carroll, R. (2004). Simple fitting of subject
specific curves for longitudinal data. Statistics in Medicine 00, 1-24.

Eilers, P. and Marx, B. (1996). Flexible smoothing with B-splines and penalties. Statisti-
cal Science 11, 89-121.

Ericksen, E. and Kadane, J. (1985). Estimating the population in census year : 1980 and
beyond (with discussion). Journal of the American Statistical Association 80, 98-131.

Escobar, M. and West, M. (1995). Bayesian density estimation and inference using
mixtures. Journal of the American Statistical Association 90, 577 588.

Etzioni, R., Pepe, M., Longton, G., Hu, C., and Goodman, G. (1999). Incorporating the
time dimension in receiver operating characteristic curves : A case study of prostate
cancer. Medical Decision Making 19, 242-251.

Eubank, R. (1988). Spline smoothing and nonparametric regression. New York : Marcel
Dekker.

Eubank, R. (1999). Nonparametric regression and spline smoothing. New York : Marcel
Dekker.

Fan, J. and Gijbels, I. (1996). Local polynomial modeling and its applications. Chapman
and Hall.

Fay, R. (1987). Application of multivariate regression to small domain estimation, in R.
Platek, J.N.K. Rao, C.E. Srndal, and M.P. Singh (Eds). SmallArea Statistics.

Fay, R. and Herriot, R. (1979). Estimation of income from small places : an application
of James-Stein procedures to census data. Journal of the American Statistical
Association 74, 269-277.


140









patterns which may result in unstable parameter estimates in those patterns since

some of the parameters maybe unidentifiable. There are different ways to get around

this problem Hogan and Laird (1998) suggested parameters to be shared across

patterns. Hogan et al. (2004) suggested ways to group the T dropout times into m < T

groups in an adhoc fashion. Roy (2003) proposed an automated mechanism to do

the above grouping using a latent variable approach within the context of normal

models for continuous data. This approach assumes the existence of a discrete latent

variable that explains the dependence between the response vector and the dropout

time and allows incorporation of uncertainty about the groupings, conditional on a

fixed number of groups. Roy and Daniels (2008) extended the above approach by

incorporating uncertainty in the number of classes through approximate Bayesian model

averaging. In their approach, the marginal mean is assumed to follow a generalized

linear model, while the mean conditional on the latent class and random effects is

specified separately. Since the dimension of the parameter vector of interest (the

marginal regression coefficients) does not depend on the assumed number of latent

classes, they treat the number of latent classes as a random variable. A prior distribution

is assumed for the number of classes and approximate posterior model probabilities

are calculated. In order to avoid the complications with implementing a fully Bayesian

model, they propose a simple approximation to these posterior probabilities. Lastly, they

apply their methodology to a dataset dealing with the longitudinal study of depression in
HIV-infected women.

Heagerty (1999) proposed marginally specified logistic normal models for

longitudinal binary data. In doing so, he proposed an alternative parametrization of

the logistic normal random effects model and studied both likelihood and estimation

equation approaches to parameter estimation. A notable feature of his approach

was that the marginal regression parameters still permit individual level predictions

or contrasts. Heagerty (2002) also proposed a general parametric class of serial


109










M-y (Zzu 6zs7Z"

MVY = 6Z> Zy
U


and


Zu q 1(06
- E zi-/
';


X'/ b v,)


5. [vl, /3,, 0, b, I, Ev, v2, X, Z] ~ N(MA ZE) where


-1 and


Mv m\q!-1
M~


V ')


6. [vjlP,7, 0, b, Zv, vj_, vj+, X,Z] ~ N(M,Z) (j = 2,... t 1) where


- = (m l -

M = (m1j


2-)- and


2 Z V


7. [vt /3,7, I, b, v, Vt-, X, Z] ~ N(Mtv, Z) where


v m\q-1
t t

Mv m\q!-1
t t


an-1
Zv1) and


-1)


(q i tt


A = S,+ (0, -X' -Z bi


9. [ZEv] ~/W(S


10. [Zo b]~ /W so


11. [ZE,-7]~ /W(S,.


v)(0e, x, -


1,..., t) where
Z b ,v)'


assuming vo


- bib', do


77', d. + 1)


138


"(q jOil


- X Zi,"7 bi) + EvV .


q( (06


X'/ Z'7 bi) + l(v,+ + v,)).


X/ Zt, bi) +


8. [\| 7, b, V, vt_-, X, Z] ~ /W(Aj, d +m) (j


(v v-1)(v viy1)', dv + t)
I









Suppose nab be the number of subjects for whom (D = a, D = b; a = 0, 1; b = 0, 1),

D and D being the observed and predicted disease status for a particular subject. Then,

no00 n11 (n n ll\( n01 nl n ( no n0 (no + nio
n nn n ny I n
1 ( f nil noi + nill noo + noi nor + ni
n n n n

where n = noo + no0 + n10 + n11.

The observed disease status (vis-a-vis case or control status) of a subject is

obtained from the dataset while the predicted disease status is calculated from the

posterior estimates of the parameters. At iteration n of the Gibbs sampler, we can

calculate the quantity p(n) = (n)(D, = 1lX,(t+ ad), t e [-c, 0]) = L(n)(a +/'M,+ b'Qi,)

where L(.) can be either the exact logit cdf or the approximate Student-t cdf (with 8

degrees of freedom). Based on the value of ,n), we can assign

b if fn) > 0.5

0 if (n) < 0.5

Based on the values of {(Di, bi} ); i = 1,..., N}, we can form a 2 x 2 table, and

hence can calculate a value of kappa, say, K(n) at iteration n of the Gibbs sampler. The

posterior means and 95% credible intervals of K provide a measure of the amount of

agreement that our model provides.

2.5.3 Case Influence Analysis

Case influence (or case deletion) diagnostics are often used as a tool for model

assessment in various statistical problems. The procedure hinges on the idea that the

influence of a particular observation on a parameter can be measured by the difference

in the parameter estimate based on the full data and the data with that observation

deleted (Hampel et al., 1987). These diagnostics can be used to detect observations

with an unusual effect on the fitted model and thus may lead to identification of data

or model errors. Bradlow and Zaslavsky (1997) applied case influence tools in









5 CONCLUSION AND FUTURE RESEARCH ................... 104

5.1 Adaptive Knot Selection ............................ 105
5.2 Analyzing Longitudinal Data with Many Possible Dropout Times using
Latent Class and Transitional Modelling . 107
5.2.1 Introduction and Brief Literature Review ..... 107
5.2.2 Modeling Framework .......................... 110
5.2.3 Likelihood, Priors and Posteriors ... 114
5.2.4 Specification of Priors ......................... 117

APPENDIX

A PROOF OF BAYESIAN EQUIVALENCE RESULTS .... 122

B PROOF OF POSTERIOR PROPRIETY FOR THE SMALL AREA MODELS .128

B.1 Univariate Small Area Model ..... .. ... ... 128
B.2 Bivariate Small Area Model .......................... 130

C FULL CONDITIONAL DISTRIBUTIONS .. .. 135

C.1 Semiparametric Case Control Model . 135
C.2 Semiparametric Small Area Models . 136
C.2.1 Semiparametric Univariate Small Area Model .... 136
C.2.2 Univariate Random Walk Model .. .. 137
C.2.3 Bivariate Random Walk Model ..................... 137

R EFER ENC ES . . 139

BIOGRAPHICAL SKETCH ................................ 145









that ud and ed are mutually independent with u ~- N(0,
are the sampling standard deviations corresponding to the CPS direct median income

estimates obtained using the "generalized variance function" technique mentioned in

Section 3.1.1. In the datasets provided by the Census Bureau, these estimates are

given for all the states at each of the time points. The knots (-, ..., rK) are usually

placed on a grid of equally spaced sample quantiles of xj's.

From (3-1) and (3-2), we have

OU = f(x) + bi + ud
which reflects our basic assumption that the true unknown household median income

may have an unspecified variational pattern with the IRS mean (or median) income.

Thus, the covariate effect is expressed by the unspecified nonparametric function f(xy)

which reflects the possible nonlinear effect of xy on 6y.

3.2.2.2 Model II : Semiparametric Random Walk Model (SPRWM)

Since, for each state, the response and the covariates are collected over time,

there may be a definite trend in their behavior. Thus, we added a time specific random

component to (3-1) and modeled it as a random walk as follows

Yu =X' + Z',>y + bi + v, + u, + eu

= 0 + e (3-3)

where 0y = X1/ + Z',- + b, + vj + u,

Here, vj denotes the time specific random component. We assume that, (vjv_ -_, O-) ~

N(vj-_, O-) with vo = 0. Alternatively, we may write, vj = vj-_+ wj where wj ~- N(0, ov).

This is the so-called random walk model and is similar to the systems equations used in

dynamic linear models.

Before proceeding to the next section, we may note that unlike the models of Ghosh

et al. (1996), the models given in (3-2) and (3-3) incorporate state specific random

effects (bi). This rectifies a limitation of the former as pointed out in Rao (2003).









3.4.3 Analytical Results

Data on CPS median income and IRS mean incomes were available for 50 states

and the District of Columbia for the time span 1995-2004. CPS median income ranged

from $24,879.68 to $52,778.94 with a mean of $36,868.48 and standard deviation of

$5954.94 while IRS mean annual income ranged from $27,910 to $72,769.38 with a

mean of $41,133.45 and standard deviation of $7196.56.

We fitted Model I (SPM) with all possible knot choices from 0 to 40 but the best

results were achieved with 5 knots. The estimates (with 5 knots) improved significantly

over the CPS estimates based on all the four comparison measures. Addition of more

knots seemed to degrade the fit of the model. This may happen as pointed out in

Ruppert (2002). On the other hand, the SAIPE model based estimates were slightly

superior to the SPM estimates.

Next, we fitted the semiparametric random walk model (SPRWM) to our data.

Overall, the random walk structure lead to some improvement in the performance of

the estimates. However, for the model with 5 knots, the performance of the estimates

remained nearly the same. This may be because 5 knots is sufficient to capture the

underlying pattern in the income trajectory and the random walk component doesn't

lead to any further improvement. Last but not the least, the random walk model

estimates, although generally better than those of the basic semiparametric model,

still cannot claim to be superior to the SAIPE estimates for all the comparison measures.

Table 3-1 reports the posterior mean, median and 95% Cl for the parameters of the

SPRWM with 5 knots.

It is of interest that the 95% Cl for 71, 74 and 75 doesn't contain 0 indicating the

significance of the first, fourth and fifth knots. This is indicative of the relevance of knots

in the penalized spline fit on the CPS median income observations. The same is true for

the coefficients of SPM.








In the most general case, y(t + a) can also be modeled as a P-spline i.e
K*
7(t +a,) = o+01(t+a )+... + 0(t + )a O+kr(t + a ():
k=l
= r.dC(t+ a)' (2-4)

where Vrc(t + a) = [1, (t + a),..., (t + af), (t a ) ,..., (t + af K) ]

S= (40 .... OK* r)' and ((1, ..., K*) are the knots.
As special cases of (2-4), we may consider y(t + a) = 0, in which case the
covariate is the area under the PSA process {X,(t a), -c < t < 0} and ao is its effect
on the disease probability (or logit of the disease probability). We can also assume
7(t + ad) = Oo + 01(t + ad) which signifies a linear pattern of the effect of the exposure
trajectory on the disease probability. In the above models, the knots can be chosen on a
grid of equally spaced quantiles of the ages.
Replacing (2-2) and (2-4) in the R.H.S of (2-3), we have

P(Di = 1X,(t +ad),-c< t
= L (a+ (Pp,(t+ a )'Pi+-q,(t+a)'b,)((ta + a)'dt)

= L(a +'Mi+ bQi) (2-5)


where M, = p,,'(t a+ a)I ( t a)'dt and Q, = f (t + ai)'rc(t + a)'dt.
For pre-chosen degrees of the basis functions and the knots, both Mi and Q, are
matrices and are available in closed forms. We assume normal distributional forms for
the spline coefficients in (2-2) and (2-4) in order to penalize the jumps of the spline at
the knots. Thus, we have 3p+k ~ N(O, a)(k = 1,... K); b,q+m N(O, j)(m = 1...M)
and /k+r ~ N(0, o-)(k = 1.... K*). Finally, the random subject specific deviation
function g,(ay) is modeled as b, ~ N(0, oj)(i = 1 ..., N;j = 0 ..., q).












00





0 10 20 30 40 0 5 10 15 20 25 30
0 0








o distribution of the basic semiparametric and semiparametric random walk
0
I I I I I I I I I
0 10 20 30 40 0 5 10 15 20 25 30
Theoretical Quantiles of Chi-Square (9) Theoretical Quantiles of Chi-Square (9)
A Basic Semiparametric Model B Semiparametric RW Model

Figure 3-5. Quantile-quantile plot of RB values for 10000 draws from the posterior
distribution of the basic semiparametric and semiparametric random walk
models. The X-axis depicts the expected order statistics from a X2
distribution with 9 degrees of freedom.


second assumption naturally holds in our case. Regarding the first one, since we have

multiple observations over time for every state, there may be within-state dependence

between those. Thus, instead of taking all the observations (i.e the CPS median

income values), we decided to use the last observation for each state. For the basic

semiparametric model (SPM), the above summary measures were respectively 0.049

and 0.5 while for the random walk model (SPRWM), these were 0.047 and 0.51. These

measures suggest that both SPM and SPRWM fits the data quite well. Figure 3-5A

and 3-5B shows the quantile-quantile plots of RB values obtained from 10000 samples

of SPM and SPRWM with 5 knots. Both the plots demonstrate excellent agreement

between the distribution of RB and that of a X2(9) random variable.

Johnson points out that the Bayesian chi-square test statistic is also an useful tool

for code verification. If the posterior distribution of RB deviates significantly from its

null distribution, it may imply that the model is incorrectly specified or there are coding

errors. Since the summary measures are quite close to the corresponding null values,









Although a great amount of work has been done in the frequentist domain,

Bayesian modeling for case-control studies did not really start until the late 1980's.

The development of Markov chain Monte Carlo techniques lead to a rapid progression in

this front. Althman (1971) is probably the first Bayesian work which considered several

2 x 2 contingency tables with a common odds ratio and performed a Bayesian test of

association based on the common odds ratio. Later, Zelen and Parker (1986), Nurminen

and Mutanen (1987) and Marshall (1988) considered identical Bayesian formulations of

a case control model with a single binary exposure. These works dealt with inference

from the posterior distribution of summary statistics like the log odds ratio, risk ratio

and risk difference. Ashby et al. (1993) analyzed a case control study from a Bayesian

perspective and used it as a source of prior information for a second study. Their paper

emphasized the practical relevance of the Bayesian perspective in a epidemiological

study as a natural framework for integrating and updating knowledge available at each

stage.

Muller and Roeder (1997) introduced a novel aspect to Bayesian treatment of

case-control studies by considering continuous exposure with measurement error. Their

approach is based on a nonparametric model for the retrospective likelihood of the

covariates and the imprecisely measured exposure. They chose the non-parametric

distribution to be a class of flexible mixture distributions, obtained by using a mixture

of normal models with a Dirichlet process prior on the mixing measure (Escobar and

West, 1995). The prospective disease model relating disease to exposure is assumed

to have a logistic form characterized by a vector of log odds ratio parameters P. This

paper pioneered the use of continuous covariates, measurement error and flexible

non-parametric modeling of exposures in a Bayesian setting and brought to light

the tremendous possibility of modern Bayesian computational techniques in solving

complex data scenarios in case-control studies. Seaman and Richardson (2001)

extended the binary exposure model of Zelen and Parker to any number of categorical









APPENDIX A
PROOF OF BAYESIAN EQUIVALENCE RESULTS

Proof of Theorem 1. Let Ydj (d = 0, 1;j = 1,..., J) be independently distributed as

Poisson(Adj) where

logAd = log/ + dlog9 + logj + d4' / Z(t)(t)dt (A-1)

Thus, the likelihood will be
1 J
L(ji, O,i6, f fA= i ( )exp(-Ad,)
d
and hence the log likelihood will be
1 J
1(p,0,, 6)= -{ydjlog(Ad)- Ad}
d= Oj 1
Now, replacing the expression of logAdj from (A-1) we have


1(p, ) = yyj (log+( dlog,' ogJ dq3' Zj(t)W(t)dt
d=0j=1l c
1 J 0
-ddjexpp (d' Zj(t)W (t)dt) (A-2)
d=Oj=1 -c

Differentiating (A-2) w.r.t p and 0 and solving the resulting equations we have


= EYyoj/CE and 0= J
>yoj 5jexp (q' Zj(t)(J(t)dt)
J J
Replacing the above expressions in (A-2) and then exponentiating, we obtain the

expression of L(6, 4) in (2-8).

Again, differentiating (A-2) w.r.t 6j, we have


J = d j 1...J (A-3)
5 Odexp d Zj(t)xW(t)dt)
d J-c
It is easy to show that if we replace (A-3) in (A-2) and then exponentiate, we get the

expression for L(O, 4) in (2-9). Since the order of maximization is immaterial, it follows

that, L(6, 4) and L(O, 4), once maximized over the nuisance parameters (0 and 6


122









families remains interesting nevertheless. Now, we will briefly discuss the estimation

procedure that the U.S. Census Bureau used to follow towards that end.

In estimating the median income of four-person families, the U.S. Census Bureau

relied on data from three sources. The basic source was the annual demographic

supplement to the March sample of the Current Population Survey (CPS) which used to

provide the state specific median income estimates for different family sizes. The second

source was the decennial census estimates for the year proceeding the census year i.e

1969, 1979, 1989 and so on. Lastly, the Census Bureau also used the annual estimates

of per capital income (PCI) provided by the Bureau of Economic Analysis (BEA) of the

U.S. Department of Commerce. Each of the above data sources (and the resulting

estimates) have some disadvantages which neccesiated an estimation procedure that

used a combination of all three to produce the final median income estimates. The

CPS estimates were based on small samples which resulted in substantial variability.

On the other hand, decennial census estimates, although having negligible standard

errors, were only available every 10 years. Due to this lag in the release of successive

census estimates, there was a significant loss of information concerning fluctuations in

the economic situation of the country in general and small areas in particular. Lastly, the

per capital income estimates didn't have associated sampling errors since they were not

obtained using the usual sampling techniques. The details of the estimation procedure

appears in Fay et al. (1993).

The Census Bureau based their estimation procedure on a bivariate regression

model suggested by Fay (1987). In doing so, they used median income observations

for three and five person families in addition to those of four person families. The basic

dataset for each state was a bivariate random vector with one component the CPS

median income estimates of four person families and the other component being the

weighted average of CPS median incomes of three and five person families, with

weights 0.75 and 0.25 respectively. Both the regression equations used the base year









expressed using truncated polynomial basis functions with varying degrees and

number of knots, although other types of basis functions like B-splines or thin plate

splines can also be used. We have worked with two types of models viz. a regular

semiparametric model and a semiparamteric random walk model. For each of these

models, analysis has been carried out using a hierarchical Bayesian approach. Since

we chose non-informative improper priors for the regression parameters, propriety of

the posterior has been proved before proceeding with the computations. Markov chain

Monte Carlo methodologies, specifically, Gibbs sampling (Gelfand and Ghosh, 1998)

has been used to obtain the parameter estimates.

We have compared the state-specific estimates of median household income for

1999 with the corresponding decennial census values in order to test for their accuracy.

In doing so, we observed that the semiparametric model estimates improve upon

both the CPS and the SAIPE estimates. Interestingly, the positioning of the knots had

significant influence on the results as will be discussed later on. We want to mention

here that the SAIPE model had a considerable advantage over ours in that they used the

census estimates of the median income for 1999 as a predictor. In small area estimation

problems, the census estimates are regarded as the "gold standard" since these are

the most accurate estimates available with virtually negligible standard errors. So,

using those as explanatory variables was an added advantage of the SAIPE state level

models. The fact that our estimates still improve on the SAIPE model based estimates is

a testament to the flexibility and strength of the semiparametric methodology specially

when observations are collected over time. It also indicates that it may be worthwhile

to take into account the longitudinal income patterns in estimating the current income

conditions of the different states of the U.S.

The rest of the chapter is organized as follows. In Section 3.2 we introduce the two

types of semiparametric models we have used. Section 3.3 goes over the hierarchical

Bayesian analysis we performed. In Section 3.4, we describe the results of the data









population (or cohort) over time is often impractical. Thus, case control studies are

generally retrospective in nature.

Case-control studies have consistently attracted the attention of statisticians, and

as a result, a rich and voluminous body of work has developed over the years. Notable

work in the Frequentist domain include Cornfield (1951) who pioneered the logistic

model for the probability of disease given exposure. He was the first to demonstrate

that the exposure odds ratio for cases versus controls equals the disease odds ratio

for exposed versus unexposed and that the latter in turn approximates the ratio of the

disease rates if the disease is rare. Let D and E be dichotomous factors respectively

characterizing the disease and exposure status of individuals in a population. A common

measure of association between D and E is the (disease) odds ratio

P(D= 1IE= 1)/P(D= 0|1E= 1)
P(D= IE = O)/P(D= 0|E = 0)

By applying the Bayes theorem, the above expression can be rewritten as

= P(E = 1D = 1)/P(E= 0D = 1) (1-2)
P(E = l1D = O)/P(E = 0|D = 0)

which is the exposure odds ratio. Another well known measure of association is the

relative risk (RR) of disease for different exposure values given by P(D = 1 E =

1)/P(D = 1IE = 0). For rare diseases, both P(D = 0|E = 0) and P(D = 0|E = 1)

are close to one and the disease odds ratio is approximately equal to the relative risk

of disease. The classic paper by Mantel and Haenszel (1959) further clarified the

relationship between a retrospective case-control study and a prospective cohort study.

They considered a series of 2 x 2 tables as in Table 1-1

Table 1-1. A typical 2 x 2 table
Disease Status Exposed Not Exposed Total
Case nli no1i nli
Control noii nooi noi
Total eli eoi Ni









3.5 Model Assessment

To examine the goodness-of-fit of the semiparametric models, we used a Bayesian

Chi-square goodness-of-fit statistic Johnson (2004). This is essentially an extension of
the classical Chi-square goodness-of-fit test where the statistic is calculated at every

iteration of the Gibbs sampler as a function of the parameter values drawn from the
respective posterior distribution. Thus, a posterior distribution of the statistic is obtained
which can be used for constructing global goodness-of-fit diagnostics.

To construct this statistic, we form 10 equally spaced bins ((k 1)/10, k/10),
k = 1,..., 10, with fixed bin probabilities, pk = 1/10. The main idea is to consider the

bin counts mk(O) to be random where 0 denotes a posterior sample of the parameters.

At each iteration of the Gibbs sampler, bin allocation is made based on the conditional
distribution of each observation given the generated parameter values i.e YU would be

allocated to the kth bin if F(YU|) e ((k 1)/10, k/10), k = 1,..., 10. The Bayesian
chi-square statistic is then calculated as


R8(&)= m/k() npk 2

For the purpose of model assessment, two summary measures can be used, both
derived from the posterior distribution of RB(O). First one is the proportion of times the
generated values of RB exceeds the 0.95 quantile of a X distribution. Values quite close

to 0.05 would suggest a good fit. The second diagnostic is the probability that RB(O)
exceeds a X2 deviate i.e

A = PI(RB() > X), X X

Since the nominal value of this probability is 0.5, values close to 0.5 would suggest a

good fit.
The only assumptions for this statistic to work are that the observations should be

conditionally independent and the parameter vector should be finite dimensional. The









We assume that conditional on the past observations, Y, depends only on the

previous p observations i.e (Y,_t t-2, ..., Yt-). Here we have to deal with the

following three types of dependence structures :

1. Dependence between response and dropout time modeled by the latent classes.

2. Short range (serial dependence) between Y, and (Yt-_,..., ,-p) modelled by a
MTM(p).

3. Long range or non-diminishing dependence among the Y,'s modelled by the
subject specific random effects bi, i = 1,..., N.

We first specify the Marginal model as


T = E(YtX t,0) = g-l(t) (5-5)

The above model marginalizes over the subject specific random effects and over the

latent class distribution (implicitly over the dropout distribution) as well. In order to

fully specify the association due to repeated measurements and nonignorability in the

missingness process, we specify a conditional model in addition to the marginal model.

By conditional, we mean conditioned over the random effects and latent classes. We

assume that the relevant information in the dropout times is captured by the latent

variable S this is obvious because the specific latent class a subject would belong to

would solely depend on his/her dropout time. Thus, we specify a mixture distribution

over these latent classes, as opposed to over D itself.

Before delving into the model, it is important to note that the conditional model

parameters are not of main interest, and in fact will be viewed as nuisance parameters.

This is because we are not interested in estimating either subject-specific effects (i.e.

effects conditional on the random effects) or class-specific covariate effects (i.e. effects

of covariates on Y given a particular dropout class). Moreover, the conditional model

should be so specified that it is compatible with the marginal model (5-5). As we will see

below, this leads to a somewhat complicated model. Specifying this conditional model


112









priors on the inverse of the variance components ( ..... o-, a ,o- ). The prior

distributions are assumed to be mutually independent. We choose small values (0.001)

for the gamma shape and rate parameters to make the priors diffuse in nature so that

inference is mainly controlled by the data distribution.

Thus, we have the following priors : 3 ~ uniform(RP++), (pj)-1 ~ G(cj, d)(j =

1 ... t), (j)-1 ~ G(c, d), (7)-1 G(c,, d,) and (o)-1 ~ G(cv, dv). Here X ~ G(a, b)

denotes a gamma distribution with shape parameter a and rate parameter b having the

expression f(x) oc xa-lexp(-bx), x > 0. Since we have chosen improper priors for 0,

posterior propriety of the full posterior have been shown. We have the following theorem
Theorem 1. Let 2x = max(, ...,.2) = '.7. say, for some k e [1,..., t]. Then,

posterior propriety holds if the following conditions are satisfied

1. (m p 5)/2 + ck > 0 and dk > 0

2. m/2 + cj 2 > 0 and dj >0,j = 1,..., t;j 4 k

3.3.3 Posterior Distribution and Inference

The full posterior of the parameters given the data is obtained in the usual way by

combining the likelihood and the prior distribution as follows
m t
p(f2Y, X, Z) x H L(Yi, Xi, Zili)7(/3)7(o)7(o) () (3-6)
i=1 j=1

For the random walk model, there will be an additional term 7r(a2). By the conditional

independence properties, we can factorize the full posterior as

[0, ,b, a a2, { ..., }Y, X,Z] o [Ylo ][0|/3,, b,{ ..., X, Z][b|l ] x
t
[7 1- 1 [/3]7[- [ ]nb]
j= 1

Our target of inference is {0,, i = 1,..., m;j = 1, ...t}, the true median household

income of all the states. Since the marginal posterior distribution of 0, is analytically

intractable, high dimensional integration needs to be carried out in a theoretical








Now, integrating (A-5) w.r.t 0 we obtain


J NJ
p(e, O, rly) ox p(-) J y,


j 1 c
x exp (' yi Zt(t)(dt 0 Y (A-6)
j 1 J -c j= 1
Integration of (A-6) w.r.t b yields (2-12) after some minor manipulation.

(iii) The order in which p(O, 6, |0y) is integrated w.r.t the parameters does not make
any difference in the marginal posterior density of p(0). Thus, integration of p(w, 01y)
w.r.t w or p(O, 01y) w.r.t 0 will yield the same marginal posterior density p(0|y) of 0.
Remarks :

1. As in Seaman and Richardson (2004), the assumption of existence and finiteness
of E (04' J Zq(t)W(t)dt and E 4' Z,(t)V(t)dt is automatically satisfied
provided the prior density p(O) ensures that E(O) exists and is finite.

2. The posterior propriety of p(O, 6, 0 y) in (A -10) can be shown in a similar way
to that in Seaman and Richardson (2001).

3. The prior distribution p(O) of 0 induces a prior distribution on the "influence
function" {1(t), -c < t < 0} in the logistic case-control model in (2 -3) since
7(t) = O'(t), -c < t < 0.

Proof of Theorem 3. Let D denotes the disease status with r + 1 categories. As
before, let {X(t), -c < t < 0} be the exposure trajectory with support S =
{Z(t), ..., Zj(t), -c < t < 0}, the set of all exposure trajectories.
Let P(D = dlX(t) = Zk(t), -c < t < 0) = Pdk, (d = 0,1, ..., r; k = 1,..., K) and
P(X(t) = Zk(t), -c < t < 0|D = 0) = k/ 11. Let ndk be the number of individuals
with D = d and X(t) = Zk(t), -c < t < 0}. It can be shown that

6kPdk/POk
P(X(t) = Zk(t), -c < t < OD = d) = k pk
S1PdI/Po
1=1









Here By = (01, 0y2), uy = (u6i, U2)', ey = (edl, e2)', bi = (bil, bi2)', =

(/01, ... /pl, 02, q2 = (711, ... 7K1i, 712 ... 7K22),

x1 ... x 0 0 ...
0 0 ... 0 1 Xi2 ... XU

and


Z ((X Tr )P ... (X TK11)p 0 ... 0
0 ... 0 (X2 712)q ... (X2 TK22)

Analogous to the univariate case, we assume bi i'nd N(0, Xo), and 7 ~ N(0, 1,).

e. and u. are mutually independent with e. 'ind N(0, :y) and u- ~ind N(0, qIj). For
simplification purposes, we assume that Yo = diag(o-7, o-,), and 1E = diag(o-71, o-,)

where o- is assumed to be known and is estimated from the data as in the univariate
framework. The above bivariate model can easily be generalized to a multivariate
framework if the need arise.

4.2.2.2 Bivariate random walk model

In order to model any conspicuous trend in the income observations for a specific
family size and/or a specific state, we add a time specific random component to the

simple bivariate model (4-2) as follows

Yu = U' Z'7 + b + v + u + e

= o,+ e (4-3)

where 0y = U 0/ + Z'y bi + vj + uy.

As in Section 3.2.2.2, we assume that (v jvj_ Ev) N(vj-_, Ev) with vo = 0.

Alternatively, we may write vj = vj-_ + wj where wj /i.i.d N(0, Iv).









Table 4-4. Percentage improvements of bivariate non-random
walk estimates over Census Bureau estimates
Estimate ARB ASRB AAB ASD
GNK.TS(4,3) -0.48% -2.52% 1.03% -2.01%
GNK.NTS(4,3) -8.99% -22.45% -8.77% -21.33%
BSPM(1)(4,3) 7.43% 0.00% 8.81% -1.46%
BSPM(2)(4,3) 3.38% 15.38% 4.42% 12.61%
GNK.TS(4,5) 22.19% 30.52% 21.23% 24.79%
GNK.NTS(4,5) 0.31% -0.18% 0.33% -3.04%
BSPM(4,5) 13.85% 23.08% 12.74% 13.57%
GNK.TS(4,3+5) 2.94% 3.56% 2.84% 1.61%
GNK.NTS(4,3+5) -9.36% -17.18% -9.56% -17.64%
BSPM(1)(4,3+5) 8.45% 7.69% 8.90% 1.05%
BSPM(2)(4,3+5) 2.37% 7.69% 4.37% 14.54%

Now let us consider the bivariate random walk model. For the case with 4 and

3 person families, the lowest comparison measures were obtained for three models

with degrees of freedoms and number of knots (3, 6), (5, 6) and (9, 1) respectively. We

denote these models as BRWM(1)(4,3), BRWM(2)(4,3) and BRWM(3)(4,3) respectively.

Each of these models significantly improves upon the CPS and Census Bureau

estimates and are also superior to the bivariate time series and non-time series models

proposed by Ghosh et al. (1996) (GNK). The random walk estimates also seem to

improve marginally over those corresponding to the non-random walk semiparametric

model. When we consider the median income estimates of 4 and 5 person families,

the random walk model with degrees of freedom 5 and 1 knot in the trajectory seems

to perform the best. The comparison measures are significantly better than the CPS,

Bureau and the non-time series model of GNK. However, they fall marginally short of

the time series estimates but fare better than the corresponding estimates obtained

from the non-random walk model (BSPM(4, 5)). We denote this model as BRWM(4,

5). Lastly, for the model with median incomes of 4 person families and the weighted

average incomes of 3 and 5 person families (with weights 0.75 and 0.25) as response

vectors, the best results were obtained for the model with 5 degrees of freedom and 1

knot in the trajectory. The comparison measures were significantly better than the CPS,








where W is finite if (m p 5)/2 + ck > 0, dk > 0, m/2+ cj 2 > 0 and dj > 0 for
j = 1, ..., t;j / k.
Combining (B-1) and (B-5), we have

/ < W I ... Jf {L{(Y, ,)L(ba) } L(i) ~) ()di2* (B-6)

where f* = (0 3 b). Since all the components of the integrand in (B-5) have proper
distributions, the above integral would be finite thus proving posterior propriety.
For the random walk model, the integrand in (B-1) will have an additional likelihood
term nli L(vI vjv_, oi) and a prior term 7(ao2). The derivation would then proceed
exactly as above and the integrand in (B-5) will also contain these additional terms. But
since both of these are proper distributions (normal and inverse gamma respectively), I
will still be finite under the conditions stated in the theorem.
B.2 Bivariate Small Area Model
The proof of posterior propriety for the bivariate semiparametric model is outlined
below.

Proof of Theorem : Here, the parameter space is Q* = (0, 0, 7, b, Zo, -7, {( ,.... }).
Here also, due to the same logic as in the univariate case, we just need to show

I p()p(0| y, b, { l,..., J})d3 < oo

/ ( ,,- ,/ ,6
or, J exp( (, X.3 Z71 b),) (0 X.3 Z>7 bi) df < oo (B-7)

in order to prove posterior propriety.
Using the same type of algebraic manipulations as in the univariate case, the L.H.S
of (B-7) can be shown to be

| X
-X WX/- exp W./WW (B-8)
2J /J


130









Thus, the likelihood function for the ith state, (4-4) will have an extra component

corresponding to v given by L(v lvj_ v,) which has a normal distribution with mean

vj_1 and covariance matrix 1,.

4.3.2 Prior Specification

To complete the Bayesian specification of our model, we need to assign prior

distributions to the unknown parameters. We assume noninformative improper uniform

prior for the polynomial coefficients (or fixed effects) 3 and proper conjugate Inverse

Wishart priors on the variance covariance matrices ({f1,..., q}, 01, ). The prior

distributions are assumed to be mutually independent. We choose the inverse Wishart

parameters in such a way that the priors are diffuse in nature so that inference is mainly

controlled by the data distribution.

Thus, we have the following priors : 3 ~ uniform(RP ++2), v ~_ IW(Sj, dj)(j =

1, ... t~, IW(S, d7), 1o IW(So, do) and I, IW(S,, d,) Here X ~ IW(A, b)

denotes a inverse Wishart distribution with scale matrix A and degrees of freedom b

having the expression f(X) oc IXI-(b+p+1)/2exp(-tr(AX-1)/2), p being the order of A.

4.3.3 Posterior Distribution and Inference

The full posterior of the parameters given the data is obtained in the usual way by

combining the likelihood and the prior distribution as follows
m t
p(f|Y, U, Z) oc n L(Yi, Ui, Zi|n,)7(0)7r(o)7(I ) [H (q) (4-5)
i=1 j= 1

For the random walk model there will be an additional term 7r(,). By conditional

independence properties, we can factorize the full posterior as


[0, 3, 7, b, o {, i1, .... 't} Y, U, Z] oc [Y le][el 3, 7, b, { Wi,..., W }, X, Z]
t
x [bl E][7l E[/3][E0] f[5o[[L]
j= 1

Our target of inference is {06,, i = 1,..., m;j = 1, ...t}, the true median income

for of four-person families for all the states. Since the marginal posterior distribution
































To my mother and to the memory of my father









comparable performances to some established methodologies specially those involving

time series modeling techniques. Based on our findings in parts two and three, we come

to the conclusion that semiparametric and nonparametric regression models can be a

attractive alternative to the more traditional modeling frameworks specially in situations
where information on different characteristics of small areas are available at multiple

time points in the past.









REFERENCES


Agresti, A. (2002). Categorical data analysis. Wiley.

Albert, J. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response
data. Journal of the American Statistical Association 88, 669-679.

Althman, P. (1971). The analysis of matched proportions. Biometrika 58, 561-576.

Ashby, D., Hutton, J., and McGee, M. (1993). Simple Bayesian analyses for
case-controlled studies in cancer epidemiology. Statistician 42, 385-389.

Battese, G., Harter, R., and Fuller, W. (1988). An error component model for prediction
of county crop areas using survey and satellite data. Journal of the American
Statistical Association 83, 28-36.

Bell, W. (1999). Accounting for uncertainty about variances in small area estimation.
Bulletin of the International Statistical Institute .

Botts, C. and Daniels, M. (2008). A fexible approach to Bayesian multiple curve fitting.
Computational Statistics and Data Analysis 52, 5100-5120.

Bradlow, E. and Zaslavsky, A. (1997). Case influence analysis in Bayesian inference.
Journal of Computational and Graphical Statistics 6, 314-331.

Breslow, E. T. and Day, N. E. (1980). Statistical Methods in Cancer Research, Volume 1.
International Agency for Research on Cancer, Lyon.

Breslow, E. T, Day, N. E., Halvorsen, K. T, Prentice, R. L., and Sabai, C. (1978).
Estimation of multiple relative risk functions in matched case-control studies. Ameri-
can Journal of Epidemiology 108, 299-307.

Breslow, N. (1996). Statistics in epidemiology : The case-control study. Journal of the
American Statistical Association 91, 14-28.

Carroll, R. J., Wang, S., and Wang, C. Y. (1995). Prospective analysis of logistic case
control studies. Journal of the American Statistical Association 90, 157-169.

Catalona, W., Partin, A., Slawin, K., and Brawer, M. (1998). Use of the percentage
of free prostate-specific antigen to enhance differentiation of prostate cancer from
benign prostatic disease : A prospective multicenter clinical trial. Journal of the
American Medical Association 19, 1542-1547.

Cornfield, J. (1951). A method of estimating comparative rates from clinical data:
applications to cancer of the lung, breast, and cervix. Journal of the National Cancer
Institute 11, 1269-1275.

Cornfield, J., Gordon, T, and Smith, W. W. (1961). Quantal response curves for
experimentally uncontrolled variables. Bulletin of the International Statistical Institute
38, 97-115.


139









when available, may lead to more precise estimates of the odds ratios of disease. We
use semiparametric regression procedures to model the exposure profiles of the cases

and controls and also the influence pattern of the exposure profile on the disease status.

This enables us to analyze how the present disease status of a subject is influenced

by his/her past exposure conditions conditional on the current ones. Analysis is carried

out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC)

algorithms. The proposed methodology is motivated by, and applied to a case-control

study of prostate cancer where longitudinal biomarker information is available for the

cases and controls.

The second and third part of my dissertation deals with univariate and multivariate

semiparametric procedures for estimating characteristics of small areas across

the United States. In the second part, we put forward a semiparametric modeling

procedure for estimating the median household income for all the states of the U.S.

and the District of Columbia. Our models include a nonparametric functional part for

accommodating any unspecified time varying income pattern and also a state specific

random effect to account for the within-state correlation of the income observations.

Model fitting and parameter estimation is carried out in a hierarchical Bayesian
framework using Markov chain Monte Carlo (MCMC) methodology. It is seen that

the semiparametric model estimates can be superior to both the direct estimates and

the Census Bureau estimates. Overall, our study indicates that proper modeling of the
underlying longitudinal income profiles can improve the performance of model based

estimates of household median income of small areas.

In the third part of the dissertation, we put forward a bivariate semiparametric

modeling procedure for the estimation of median income of four-person families for the

different states of the U.S. and the District of Columbia while explicitly accommodating

for the time varying pattern in the income observations. Our estimates tend to have

better performances than those provided by the Census Bureau and also have









marginal regression model. But this goal cannot be achieved using a non-linear link

function since it doesn't hold for the marginal covariate effects.

Heagerty (1999) proposed marginally specified logistic models which lead to direct

modeling of the marginal covariate effects. Let Y, and Xit respectively be the response

observation and the covariate vector corresponding to the ith individual at the tth time

point, i = 1, 2,..., N ; t = 1, 2,..., T. Let E(YtXit, /) be the marginal mean of Y,. It is

specified as
logit [E(Y tX,t,/3)] = X/3 (5-2)

The above structure is the marginal regression model. Now, in order to specify the

dependence among (Y,1, Y2,..., -T) the following conditional model is specified

logit [E( YXit, bi)] = At + bi (5-3)

where bi N(0, 0). Ai, can be computed by solving the following convolution equation


P(Yt = 1)= P(Y,t Xit, bi)dF(bi) (5-4)

Thus A is a function or / and 0. In this study we will be proposing a model which will

marginalize over the random effects and the drop-out distribution to directly model

the marginal covariate effects of interest taking into account both the serial and

exchangeable dependence structure among the Yi's.

Let us briefly go over the necessary notations with respect to subject i. Let Y =

(Y,, Y, ..., YT) be the response vector. Let the T unique dropout times be grouped
into m classes by the latent indicators Si = (Si, ..., Sim). Here S is an indicator for class

j,j = 1,..., m (m < T) such that

S { 1 if the ith subject is in class
Otherwise.
0 otherwise.






































I I I I I I I I
0 20 40 60 80 100 120 140
Deleted Case
A 1p


I I I I I I I I
0 20 40 60 80 100 120 140
Deleted Case
B Yo
















1 ,7 77~l,,l~l 1 T ''I


0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Deleted Case Deleted Case
C 1 D Disease Probability


Figure 2-2. Sensitivity of 31, 0 Q i and disease probability estimates to case-deletions.


L A ll. -IJ LJA A 1 ,LLL kIL|J|L h 2 iJ,I


o
N-



E s
LUJ

E

0
0
o


o.


' '1' I' *1.I I" 1 I I'


o

oC
0 -
0




LC
0
62


o
C


LO


I II I


. t 1, 1 i r.r i[r ,'t r .i


0


.. I "1I


d









trajectory on the binary disease outcome. Inference on these two models will be done

simultaneously and is described in Section 3.

Our modeling framework bears some resemblance to that of Zhang et al. (2007)

who used a two stage functional mixed model approach for modeling the effect of a

longitudinal covariate profile on a scalar outcome. They proposed a linear functional

mixed effects model for modeling the repeated measurements on the covariate. The

effect of the covariate profile on the scalar outcome was modeled using a partial

functional linear model. In doing so, they treated the unobserved true subject-specific

covariate time profile as a functional covariate. For fitting purposes, they developed

a two-stage nonparametric regression calibration method using smoothing splines.

Thus, estimation at both the stages was conveniently cast into a unified mixed model

framework by using the relation between smoothing splines and mixed models. The

key differences between their framework and ours is that we use Bayesian inferential

techniques to simultaneously estimate the parameters of the exposure and disease

models. Moreover, instead of a linear modeling framework, we use a combination of

linear and logistic models since our response is binary.

Exposure Trajectory Model

The exposure trajectory model is given by

vy = Xi(ay) + e- = f(ay) + gi(ay) + e- (2-1)

where e ~- N(0, o-2), f(a) is the population mean function modeling the overall

PSA trend as a function of age for all the subjects while gi(a) is the subject specific

deviation function reflecting the deviation of the ith subject specific profile from the mean

population profile.

The reason for modeling exposure as a function of age is that for a randomly

chosen subject with unknown disease status, the PSA value at a certain time point

should depend on the subject's age at that time point controlling for the time with respect









Thus, we have


| XV Y -1X' I > I A min XyUX |
=| Z> X' |-, < I(AmnZ xJ- x I

=11. -x, xx/ 1-1/2 < min pxq2
^ ^-V<(A) 2 YLXUXW


Since I W 1 = J Ajk, V 1... t
k=1

(m+d,-r-1) (m+d,-r-1)
1 1 2 I (A k) 2
k=1


Now, replacing (B-11) and (B-12)

S< xx | .('-n)-
/ | XUX'y | .. (A in


where T denotes "trace". Let Am"n

Then, I < /1 x 2I where


in the expression of I in (B-10), we have

p+q+2 t r (mdj-r-1) V1 J-1
2 H (,Ajk) 2 exp -T 2( d ...d2
j1( k=1
(B-13)


= Aim, / [1 ..., t]; m [1 ..., r].


i f (m +d-r-1) (m+dp-p-q-2)-r-
l1 = I> XyX 2 (Ak) 2 (AIn) 2
i, {k= 1,k m}
Sp q 2 (m+d -p-q-2)-r-1
= | XyX'- | n (A/,k) 2 1 |--1 2
ij {k=l,k m}
and


1exp -T(V) 2 ) d


exp -T (V2)] d1


t m .n dj -r-1
/2 { 2 -Td md(VJ Fr-1d-..}


2r 2-d 2 2
f= 1J7i}

which is finite.

Thus, in order to show posterior propriety, we have to prove that /2 < oo.


132


(B-11)


(B-12)


(B-14)









to represent the separate effect of matching in each matched set. Ghosh and Chen

(2002) developed general Bayesian inferential techniques for matched case-control

problems in the presence of one or more binary exposure variables. Their framework

was more general than that of Zelen and Parker (1986). Unlike Diggle et al. (2000),

they based their analysis on unconditional rather than the conditional likelihood after

elimination of the nuisance parameters. Their framework included a wide variety of

links like complimentary log links and some symmetric and skewed links in addition

to the usual logit and probit links. Recently Sinha et al. (2004) and Sinha et al. (2005)

proposed a unified Bayesian framework for matched case-control studies with missing

exposures. They also motivated a semiparametric alternative for modeling varying

stratum effects on the exposure distributions. The parameters were estimated in a

Bayesian framework by using a non-parametric Dirichlet process prior on the stratum

specific effects in the distribution of the exposure variable and parametric priors on all

other parameters. The interesting aspect of the Bayesian semiparametric methodology

is that it can capture unmeasured stratum heterogeneity in the distribution of the

exposure variable in a robust manner. They also extended the proposed method to

situations with multiple disease states.

In a typical case-control study design, the exposure information is collected only

once for the cases and controls. However, some recent medical studies Lewis et al.

(1996) have indicated that a longitudinal approach of incorporating the entire exposure

history, when available, may lead to a gain in information on the current disease status

of a subject vis-a-vis more precise estimation of the odds ratio of disease. It may also

provide insights on how the present disease status of a subject is being influenced by

past exposure conditions conditional on the current ones. Unfortunately, proper and

rigorous statistical methods of incorporating longitudinally varying exposure information

inside the case control framework have not yet been properly developed. In this work,









Non-ignorable missingness can be handled by two distinct classes of models viz

pattern-mixture and selection models, first formulated by Little and Rubin (1987). These

approaches differ in the way they factor the joint distribution of the missing data and

the response. In the former approach, the population is first stratified by the pattern of

dropout resulting in a model for the whole population that is a mixture over the patterns.

On the other hand, the selection modelling approach first models the hypothetical

complete data and then a model for the missing data process (conditional on the

hypothetical complete data) is appended to the complete data model. In this study we

will focus on the Pattern mixture (PM) modeling approach.

Suppose our study consists of N subjects, each of whom can be measured at T

time points. Let Yi and the Di respectively denote the response vector and dropout time

for the ith subject. Di is such that


Di t if the ith subject drops out between the (t l)th and tth observation times.
T 1 if the ith subject is a complete.

Here we assume that a subject is first measured at baseline (t = 0). Thus, there be

T unique dropout times. In the PM approach, it is assumed that subjects with different

dropout times have different response distribution i.e


f (y I ) D f (Yi) = f(yi, Di) f(yi) f (Di) (5-1)

So, for the ith subject, yi and Di are assumed to be associated or dependent. Thus,

in this approach models are built for [Y, Di] but inferences are based on f(y) =

Sf(ylD)P(D).
D
An important but realistic situation that may arise in longitudinal studies is that

the number of unique dropout times T (vis-a-vis, the number of times a subject is

measured) maybe large. As a result the number of subjects having a particular dropout

time may be quite small. Thus, stratification by dropout pattern may lead to sparse


108











"*
C 8
0 *
o 0







oI I I I I
C)* 0
E












30000 40000 50000 60000 70000

IRS Mean Income

Figure 3-4. Positions of 5 knots after realignment. The knots are the bold faced triangles
at the bottom. The region between the dashed and bold lines is the
additional coverage area gained from the realignment.


rearrangement. Based on the number of data points inside this region, it is clear that a

much larger proportion of observations has been captured with the knot realignment.

No knots are in the region beyond the bold vertical lines (i.e beyond IRS mean 56000)

possibly due to the very low density of the observations in that area. Overall, it seems

that, the new knots can capture some of the underlying non-linear pattern in the dataset

which the old knots failed to achieve. We also experimented by placing all the knots in

the low density region (beyond IRS mean = 47000) but the results were not satisfactory.

This indicates that the knots should be uniformly placed throughout the range of the
independent variable to get an optimal fit.














We have worked with 5 knots because it performed consistently well for both

the SPM and SPRW models. On fitting the semiparametric models with the new
knot alignment, we did achieve some improvement in the results. Table 3-2 reports
C'j I I- I 1-1
30000 40000 50000 60000 70000

IRS Mean Income
Figure 3-4. Positions of 5 knots after realignment. The knots are the bold faced triangles
at the bottom. The region between the dashed and bold lines is the
additional coverage area gained from the realignment.


rearrangement. Based on the number of data points inside this region, it is clear that a

much larger proportion of observations has been captured with the knot realignment.

No knots are in the region beyond the bold vertical lines (i.e beyond IRS mean 56000)

possibly due to the very low density of the observations in that area. Overall, it seems

that, the new knots can capture some of the underlying non-linear pattern in the dataset

which the old knots failed to achieve. We also experimented by placing all the knots in

the low density region (beyond IRS mean = 47000) but the results were not satisfactory.

This indicates that the knots should be uniformly placed throughout the range of the

independent variable to get an optimal fit.

We have worked with 5 knots because it performed consistently well for both

the SPM and SPRW models. On fitting the semiparametric models with the new

knot alignment, we did achieve some improvement in the results. Table 3-2 reports









like counties, cities and other substate areas. Due to the ten year lag in the release of

successive census values, there was a large gap in information concerning fluctuations

in the economic situation of the country in general and local areas in particular. The

establishment of the SAIPE program has largely mitigated this issue.

The current methodology of the SAIPE program is based on combining state

and county estimates of poverty and income obtained from the American Community

Survey (ACS) with other indicators of poverty and income using the Fay-Herriot class

of models (Fay and Herriot, 1979). The indicators are generally the mean and median

adjusted gross income (AGI) from IRS tax returns, SNAP benefits data (formerly

known as Food Stamp Program data), the most recent decennial census, intercensal

population estimates, Supplemental Security Income Receipiency and other economic

data obtained from the Bureau of Economic Analysis (BEA). Estimates from ACS are

being used since January 2005 on the recommendation of the National Academy of

Sciences Panel on Estimates of Poverty for Small Geographic Areas (2000). Income

and poverty estimates until 2004 were based on data from the Annual Social and

Economic Supplement (ASEC) of the Current Population Survey (CPS).

Apart from various poverty measures, the SAIPE program provides annual state

and county level estimates of median household income. At this point, direct ACS

estimates of median household income are only available for the period 2005-2008.

Thus, for illustration purpose, we have considered data from ASEC for the period

1995-1999 in order to estimate the state level median household income for 1999.

This is because, the most recent census estimates correspond to the year 1999 and

these census values can be used for comparison purposes. The SAIPE regression

model for estimating the median household income for 1999 use as covariates, the

median adjusted gross income (AGI) derived from IRS tax returns and the median

household income estimate for 1999 obtained from the 2000 Census. The response

variable is the direct estimate of median household income for 1999 obtained from the









Since logit P( S =1 D), AOk A1D,,we have,


P(S = )= P(S, +...+ S = D,) P(S, + ...+ S,_ = D,)
eAD O (0e)- 1
e= i(eAj- eA/jl) (5-13)
1 eAo + A1D,) + e-1 + A1D,)

Now, as mentioned earlier, D, is the dropout time for the ith subject. Also, there are
T unique dropout times. Let, for t = 1,2,..., T

1 if the ith subject drops out between the (t l)th and tth observation times
0 otherwise.

Thus 1b, = ('il, i,' ., ... iT) = (0, 0, ..., 0) would imply that the ith subject is a complete.
So, D, = t <> = 1 and D, = T+ 1 => (',i, ',, ..., ,iT) = (0, 0, ... 0). Let pt denote
the probability of dropping out between times t 1 and t, t = 1, 2,... T. So, for the ith
subject, the density of D, would be Multinomial i.e

P(D = d) = ... ( ... r)1- d, = 1, 2,..., T+ 1 (5-14)

5.2.4 Specification of Priors
We assume that the number of latent classes m follows a truncated Poisson
distribution with rate parameter j, truncated at an integer between 1 and T (the number
of unique dropout times) i.e

p(m) oc m =0,1,...,s where 1 < s < R
For the other parameters, we assume the following priors

1. Let 0 ~ Nq(30, Z/o) assuming that Vi = 1,2,..., N and t = 1,2,... T, Xit is q
dimensional.

2. Let all), a(2) ..., a(m) -"d Nr((ao, Zao). where r < q since Zt C Xit Vi = 1,2,..., N
and t= 1, 2, ..., T.

3. Let o1, ,..., -ld U(a, b) where 0 < a < b < oo.


117









Posterior Sampling


The full posterior of the parameters is given by
m q
p(2Y, D, a) N x L(Yi, Di, ai, l )() ( )( ) (-)


where P(1) = ( /3 ...j3p) and (1) = ( ,0 ...., r). The full posterior can be factorized as


[Q|2Y,D,a] oc [Y|3, b, a] [f / {A[Z Di, Ai,,/,a,f ,bi,][Ail,]}dzidA] [(2) ]
=1. "
N q q
] H\-}[b b^[bUlj2][0(1)][ 1 ]- ]- ] 3-e]
i lj O j=O

where 0 is the entire parameter space. Our main parameter of interest is 0 in (2-5).

Since, the marginal posterior distribution of 0 is analytically intractable, we construct an

MCMC algorithm to sample from its full conditionals. In doing so, we use multiple chains

and monitor convergence of the samplers using Gelman and Rubin diagnostics (Gelman

and Rubin, 1992).

2.4 Bayesian Equivalence

As mentioned in Section (1.2), Seaman and Richardson (2004) showed that for

certain choices of the priors on the log odds, posterior inference for the parameter of

interest based on a prospective logistic model can be shown to be equivalent to that

based on a retrospective one. As a result, a prospective modeling framework can be

used to analyze case-control data which are generally collected retrospectively. Here we

show that the Bayesian equivalence results of Seaman and Richardson (2004) can be

extended to the semiparametric framework we have proposed. This enables us to use

a prospective logistic framework (as described in Section (2.2.2)) to analyze the PSA

dataset.

Our modeling framework hinges on the idea that for every subject, instead of a

single exposure observation, a series of past exposure observations are available.

We use this "exposure trajectory" or "exposure profile" in analyzing the present









4. A ~ Nm(Ao, Zo)

5. (Oi1, 2, ...., (7-) ~ Dirichlet(71 r2, ..., rT)

6. 6, ..., 6 ii"d Nr(60, ) for the same reasons as in (3).

7. For the time being we keep the prior of 4, 7r(4) unspecified.

Now, combining (5-10 5-14) and the priors specified above, we can write down the

full posterior distribution of m and w, 7r(w, mlY, X, D) upto a constant. Thus, we can get

the full conditional distribution of all the relevant parameters and proceed with sample

generation using MCMC.

The assumption of conditional independence between Y, and Di given 5, and the

covariates can be verified by performing a likelihood ratio test (Frequentist) or using

Bayes factors (Bayesian). The null model is given by (5-6) and the alternative model

may be written as
m p
g{E(Yt Yk, k < t, b,, S,, Di)} = At + SUyZ + 7itkYt-k + b f(D,) (5-15)
j=1 k=1

where f(Di) maybe a smooth but unspecified function of Di. Thus, the null hypothesis

of conditional independence (between Y, and Di given 5, and Xi) would be simply

f(Di) = 0. The test can be carried out by first fitting the null model (??). Then, the
posterior probability of class membership for each subject can be estimated by

f Li(YilY{-i S, = 1, bi, & ~)p(S -= l|Di; )p(Dil,|)dF(b|S,, 2)
P(5, = IDi, Yi, Xi, al) = Li(Di, Yi, CV)
Li(Di, Yi, w)

where w is obtained by performing a full Bayesian analysis on the full conditionals of

w. The Likelihood Ratio test (LRT) is then performed by fitting models (??) and (5-14)

using a weighted likelihood (the weights being the above posterior probability of class

membership). An alternative way of doing the above conditional independence tests

would be to use score tests based on smoothing splines as used in proportional hazards

models by Lin et al. (2006).


118









Since the number of latent classes m is treated as a random variable itself, we assume a

prior for m along with w. Let the priors be respectively denoted by 7(m), 7r(0), r(a),

{7(o ), = 1, 2..., m)}, r(A), r(p), Tr(), and 7(6). So the full posterior of m and w is

given by
N m
7(w, m Y, X, D) = J Li(w Y,, X,, Di)(m)r(/3)7(a)7(A)7(p)7r()r(5){f 7(u72)}
i= / 1
(5-9)

We can avoid the integral (w.r.t b,) in (5-8) if we also sample the big's along with the other

parameters from the full posterior (5-9). In that case, the full posterior may be rewritten

as


7(w, mlY, X, D)=



ere


L*(w Yi, Xi,


N m
[ L*(wlYi, Xi, Di) 7(m) 7(0)X)7(a) 7(A) 7(p) 7(0)) (6){ nT7(72)}
i=1 /= 1
(5-10)


m
Di) = Li(Yi|Y{_i,}, Sy= 1, b,, a ), )p(5y= 1|D,; A)


(5-11)


For the most general case, we have assumed an OPEF structure for each Y, conditional

on the past. Since the outcomes are binary, we can simplify it to a Bernoulli distribution


(5-12)


where p = E(Ytlyt-1, Yt-2 ,.... Yit-p, bi, Sy


1) = g- Air bi


p
S- it, kYit-k k
k= 1


116


wh


M

j=1


x p(Dily)p(bilSy = 1, ~72)


Li(Yi|Y{-_i, SU = 1, bi, al), 0)


H G'c) (I g)(1-









out over what has been already done above. I will briefly go over some of the possible

extensions below. These extensions are independent of the specific area or setting

where they are applied i.e these equally apply to the case control and small area

scenarios we have mentioned before.

5.1 Adaptive Knot Selection

As mentioned before, we have used penalized splines to model the exposure and

influence profiles in the case control framework and the income trajectories in the small

area estimation problem. As explained in Section 1.4, selection and proper positioning

of knots is a vital aspect in any smoothing procedure involving splines. Traditionally,

knots are placed at equally spaced sample quantiles of the independent variables and

that's what we have done in both the case control and small area scenarios. But this

procedure has its fair share of drawbacks it was evident in the univariate small area

problem where the original placement of the knots failed to account for the low density

region of the data pattern where the non-linearity was mostly concentrated. This was

probably because of the quantile dependent placement procedure of the knots.

Recently, there has been some research on data-driven or "adaptive" knot

placement procedures in which the number and locations of the knots are controlled

by the data itself rather than being pre-specified. The advantage of this procedure is that

fewer number of knots would be required which would be placed in "optimal" locations

along the domain. Thus, the resulting spline fit will be flexible enough to capture any

underlying heterogeneity in the data pattern. Both Frequentist and Bayesian approaches

have been proposed towards this end. Some Frequentist contributions include Friedman

(1991) and Stone et al. (1997) who used forward and backward knot selection schemes

until the "best" model is identified. Zhou and Shen (2001) used an alternative algorithm

which led to the addition of knots at locations which already possessed some knots.

Bayesian treatment of this problems revolves on the notion of treating the knot number

and knot locations as free parameters. Some notable Bayesian contributions include


105









Table 2-2. Posterior means and 95% confidence intervals of odds ratio for
/ = (-10, -5) for the linear influence model
Age at Diagnosis
50 60 70 80
Mean 4.99 3.27 2.22 1.56
95% C.1 (1.96, 10.41) (1.91, 5.36) (1.67, 2.98) (1.10, 2.29)

2.6.3 Overall Model Comparison

For both the constant and linear influence models, we calculated the PPL criterion

(described in Section 5.1) corresponding to different trajectory intervals and number of

knots. These values are given in Table 2-3.

The PPL values for the linear model were smaller than those corresponding to the

constant influence model. Thus, we can conclude that for the prostate cancer data, the

class of linear influence models fit better than the class of constant influence models.

For both setups, the model with 0 knots has the worst fit (highest PPL criterion) across

all trajectory lengths. For a given trajectory, the models tend to improve with an increase

in the number of knots until a certain number of knots is reached. Further increase

of knots tend to worsen the fit; this agrees with the findings of Ruppert (2002). The

important point to note here is that the number of knots and the length of the exposure

trajectory seem to interact in their effect on model fit. The best fitting constant influence

model seem to be the one with exposure trajectory (-10, 0) and 3 knots.

For the linear influence setup, the PPL criterion has a decreasing trend as longer

exposure trajectories are taken into account. Thus, inclusion of past exposures result

in an improvement of model fit. This may be indicative of the fact that past exposure

observations contain significant amount of information about the current disease status.

In addition, for the trajectory interval / = (-10, -5), the PPL criteria corresponding

to the linear and constant influence models are moderately small. Thus, exposure

observations recorded 5-10 years prior to diagnosis also provide a modest amount of

information toward predicting the current disease status, corroborating the conclusions









One of the major qualitative difference between the above model and our semiparametric

models is that the former doesn't have a state specific random effect. In fact, it would

also be interesting to compare the above model with the basic semiparametric model

(SPM) with 0 knots i.e

Y, = 3o + ixj + bi + u. + ey (3-8)

where bi ~i.i.d N(0, o-) while ud and ed have the same distribution as above. Clearly,

the only difference between (3-7) and (3-8) is that the former contains a time specific

random component while the latter contains a area specific random component. Ghosh

et al. (1996) showed that the estimates from the bivariate version of the GNK model

(3-7) performs much better than the census bureau estimates in estimating the median

household income of 4-person families in the United States. Table 3-6 depicts the

comparison measures corresponding to the above models.

Table 3-6. Comparison measures for time series and
other model estimates
Estimate ARB ASRB AAB ASD
CPS 0.0415 0.0027 1,753.33 5,300,023
SAIPE 0.0326 0.0015 1,423.75 3,134,906
GNK 0.0397 0.0025 1709.58 5,229,869
SPM(0) 0.0337 0.0017 1408.7 3,137,978
SPM(5)* 0.028 0.0012 1173.71 2,334,379
SPRWM(5)* 0.0295 0.0013 1256.08 2,747,010


It is clear that, although the estimates from the GNK model perform slightly better

than the CPS, those are quite inferior to the semiparametric and SAIPE estimates. This

may be because the state specific random effects in the semiparametric models can

account for the within-state correlations in the income values, something which the GNK

model fails to do. Since the comparison measures for SPM(0) are much lower than

those for the GNK model, we can also conclude that the area specific random effect is

much more critical than a time specific random component in this situation.









we think that our models provide a satisfactory fit to the data set and also that there are

no coding errors.

3.6 Discussion

The proper estimation of median household income for different small areas is one

of the principal goals of the U.S. Census Bureau. These estimates are frequently used

by the Federal Government for the administration and maintenance of different federal

programs and also for the allotment of federal grants to local jurisdictions. Although

these estimates are available annually for every state, the U.S. Census Bureau generally

uses a non-longitudinal approach in their estimation procedure based on the Fay-Herriot

model (Fay and Herriot, 1979). In this study, we have proposed a semiparametric class

of models which exploit the longitudinal trend in the state-specific income observations.

In doing so, we have modeled the CPS median income observations as an "income

trajectory" using penalized splines (Eilers and Marx, 1996). We have also extended the

basic semiparametric model by adding a time series random walk component which can

explain any specific trend in the income levels over time. We have used as our covariate,

the mean adjusted gross income (AGI) obtained from IRS tax returns for all the states.

Analysis has been carried out in a hierarchical Bayesian framework. Our target of

inference has been the median household incomes for all the states of the U.S. and the

District of Columbia for the year 1999. We have evaluated our estimates by comparing

those with the corresponding census estimates of 1999 using some commonly used

comparison measures.

Our analysis has shown that information of past median income levels of different

states do provide strength towards the estimation of state specific median incomes

for the current period. In fact, if there is an underlying non-linear pattern in the median

income levels, it may be worthwhile to capture that pattern as accurately as possible and

use that in the inferential procedure. In terms of modeling the underlying observational

pattern, the positioning of knots proved to be both important and interesting. The











3 ESTIMATION OF MEDIAN HOUSEHOLD INCOMES OF SMALL AREAS : A
BAYESIAN SEMIPARAMETRIC APPROACH .. ................


3.1 Introduction . .
3.1.1 SAIPE Program and Related Methodology .
3.1.2 Related Research .....................
3.1.3 Motivation and Overview . .
3.2 M odel Specification ........................
3.2.1 General Notation .. .. .. .. .. .. .
3.2.2 Semiparametric Income Trajectory Models .
3.2.2.1 Model I : Basic Semiparametric Model (SPM) .
3.2.2.2 Model II : Semiparametric Random Walk Model
3.3 Hierarchical Bayesian Inference . .
3.3.1 Likelihood Function .. .
3.3.2 Prior Specification .. .. .. .. ..
3.3.3 Posterior Distribution and Inference .
3.4 Data Analysis ..........................
3.4.1 Comparison Measures and Knot Specification .
3.4.2 Computational Details ...................
3.4.3 Analytical Results . .
3.4.4 Knot Realignment .. .
3.4.5 Comparison with an Alternate Model .
3.5 Model Assessment .........................
3.6 D discussion . .

4 ESTIMATION OF MEDIAN INCOME OF FOUR PERSON FAMILIES
MULTIVARIATE BAYESIAN SEMIPARAMETRIC APPROACH ..


......
. .
......
. .
......
......
. .
. .
(SPRWM)
. .
. .
......
. .
......
. .
......
. .
. .
. .
......
......

: A
. .

......
. .
. .
. .
......
......
. .
. .
. .
. .
. .
. .
. .
......
. .
. .
. .
. .


4.1 Introduction . .
4.1.1 Census Bureau Methodology .
4.1.2 Related Literature. .
4.1.3 Motivation and Overview .
4.2 Model Specification ................
4.2.1 Notation . .
4.2.2 Semiparametric Modeling Framework .
4.2.2.1 Simple bivariate model .
4.2.2.2 Bivariate random walk model .
4.3 Hierarchical Bayesian Analysis .
4.3.1 Likelihood Function .
4.3.2 Prior Specification .
4.3.3 Posterior Distribution and Inference .
4.4 Data Analysis ....................


4.4.1 Comparison Measures and Knot Specification
4.4.2 Computational Details .
4.4.3 Analytical Results. .
4.5 Conclusion and Discussion .


59
59
61
62
65
65
66
66
67
68
68
68
69
70
71
72
73
74
78
80
82



85

85
85
87
89
90
90
91
91
92
93
93
94
94
95
96
97
98
102


.









BAYESIAN SEMIPARAMETRIC REGRESSION AND RELATED APPLICATIONS


By
DHIMAN BHADRA


















A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2010









aim is to examine whether past exposure observations can contribute significantly

towards predicting the current disease status of a subject given his/her current exposure

information. In doing so, we will also test how differential lengths of the PSA trajectories

affect the current probability of disease for a particular individual.

For the purpose of our analysis, we have used a linear p-spline (p = 1) with a

subject specific slope parameter to model the exposure trajectory as follows
K
Y =/3o + 31(t + ai) + -/3,k+(td + ad 7 )+ + bi(tu + a) + e, (2-15)
k=l
For the prospective disease model (2-3), we considered two specific scenarios viz.

constant influence, 7(t + af) = 0o and linear influence, 7(t + af) = Oo + 0l(t + af). The

results for these two cases are summarized below.

2.6.1 Constant Influence Model

In this parametrization, the area under the PSA process, {X,(t + af), -c t < 0}

acts as the covariate and 0o signifies its effect on the disease probability. We have used

different values of "c" (time, in years, by which we go back in the past to record the

exposure history of a subject) to analyze the effect of differential areas under the PSA

process on the current disease state.

On fitting the above model, we observed that for all trajectory lengths, 0o is

significant (its 95% credible interval does not contain 0). For any particular interval

(i.e choice of c), the posterior means and 95% credible intervals of 00 do not change

much with the number of knots (K). In addition, 0o increases as the trajectory length

decreases i.e as we move closer to the point of diagnosis. This is likely related to the

scale of the area under the PSA process but it also seems to support the well known

medical fact that total PSA is a better discriminator of prostate cancer at times closer to

diagnosis than at times further off (Catalona et al., 1998). To assess the impact of only

the past PSA observations on the current disease state, we considered the exposure

interval I = (-10, -5) and 3 knots in the trajectory. The posterior mean of 0o is 0.298









LIST OF FIGURES


Figure page

2-1 Longitudinal exposure (PSA) profiles of 3 randomly sampled cases (1st column)
and 3 randomly sampled controls (2nd column) plotted against age. 36

2-2 Sensitivity of /3, 0o, 1i and disease probability estimates to case-deletions. 56

3-1 Longitudinal CPS median income profiles for 6 states plotted against IRS mean
and median incomes. (1st column : IRS Mean Income; 2nd column : IRS Median
Inco m e ). . . .. 63

3-2 Plots of CPS median income against IRS mean and median incomes for all
the states of the U.S. from 1995 to 1999. ........... ......... 65

3-3 Exact positions of 5 and 7 knots in the plot of CPS median income against
IRS mean income. The knots are depicted as the bold faced triangles at the
bottom ........................................ 75

3-4 Positions of 5 knots after realignment. The knots are the bold faced triangles
at the bottom. The region between the dashed and bold lines is the additional
coverage area gained from the realignment. ... 76

3-5 Quantile-quantile plot of RB values for 10000 draws from the posterior distribution
of the basic semiparametric and semiparametric random walk models. The
X-axis depicts the expected order statistics from a X2 distribution with 9 degrees
of freedom .................. .................. 81









Nurminen, M. and Mutanen, P. (1987). Exact Bayesian analysis of two proportions.
Scandinavian journal of Statistics 14, 67-77.

O'brien, S. and Dunson, D. (2004). Bayesian multivariate logistic regression. Biometrics
60, 739-746.

Opsomer, J., Claeskens, G., Ranalli, M., and Breidt, F. (2008). Non-parametric small
area estimation using penalized spline regression. Journal of the Royal Statistical
Society, Series B 70, 265-286.

Paik, M. and Sacco, R. (2000). Matched case-control data analyses with missing
covariates. Applied Statistics 49, 145-156.

Park, E. and Kim, Y (2004). Analysis of longitudinal data in case-control studies.
Biometrika 91, 321-330.

Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case control
studies. Biometrika 66, 403-411.

Rao, J. N. K. (2003). Small Area Estimation. Wiley Inter Science, New York.

Rathouz, P., Satten, G., and Carroll, R. (2002). Semiparametric inference in matched
case-control studies with missing covariate data. Biometrika 89, 905-916.

Robinson, G. (1991). That BLUP is a good thing : the estimation of random effects.
Statistical Science 6, 15-31.

Roeder, K., Carroll, R., and Lindsay, B. (1996). A semiparametric mixture approach to
case-control studies with errors in covariables. Journal of the American Statistical
Association 91, 722-732.

Roy, J. (2003). Modeling longitudinal data with non-ignorable dropouts using a latent
dropout class model. Statistics in Medicine 59, 829-836.

Roy, J. and Daniels, M. (2008). A general class of pattern mixture models for
nonignorable dropouts with many possible dropout times. Biometrics 64, 538-545.

Rubin, D. (1981). The Bayesian bootstrap. The Annals of Statistics 9, 130-134.

Ruppert, D. (2002). Selecting the number of knots for penalized splines. Journal of
Computational and Graphical Statistics 11, 735-757.
Ruppert, D. and Carroll, R. (2000). Spatially adaptive penalties for spline fitting.
Australian and New Zealand Journal of Statistics 2, 205-224.

Ruppert, D., Wand, M., and Carroll, R. (2003). Semiparametric Regression. Cambridge
University Press, Cambridge, U.K.

Satten, G. and Carroll, R. (2000). Conditional and unconditional categorical regression
models with missing covariates. Biometrics 56, 384-388.


143









normal prior variances to make the priors diffuse in nature so that inference is mainly

controlled by the data distribution.

2.3.3 Posterior Computation

Likelihood Approximation

As mentioned in Section 3.1, we have used the data augmentation algorithm

of Albert and Chib (1993) to approximate the likelihood and thus simplify posterior

inference. They showed that a logistic regression model on binary outcomes can be

well approximated by an underlying mixture of normal regression structure on latent

continuous data. In doing so, it can be shown that a logit link is approximately equivalent

to a Student-t link with 8 degrees of freedom.

As in Albert and Chib (1993), we introduce latent variables Z, Z,, ..., ZN such that

Di = 1 if Z, > 0 and Di = 0 otherwise. Let Z, be independently distributed from a t

distribution with location Hi = a O/'Mi, + b'Qif, scale parameter 1 and degrees of

freedom v. Equivalently, with the introduction of the additional random variable A,, the

distribution of Z, can be expressed as scale mixtures of normal distribution

Zi|A, N(Hi, A, 1), A, Gamma(v/2, 2/v)
where the Gamma pdf is proportional to A /2-lexp(-vAi/2). Using this approximation,

we can replace the logit link by a mixture of normals and can rewrite (2 -6) as


L(Yi,Di,ai|ji) oc f{p(YuS, Sa} A Ip(Z|Hi, 1/Ai) G(Ai\v/2, 2/v)dzidA
j= 1
N q
x p(j3(2)),o)p( (2)jaia)p(b(2) ,) 1 gP(b2 i )
i=1 j=0

where, p(Ula, b) denotes a normal density with mean a and variance b while G(Vla, b)

denotes a gamma density with shape a and rate b. Moreover, S, = p-,,(a,)'3 +

Cq,,(a,)'b, and A, = {/(Z, > 0)I(Di = 1)+ I(Z, < 0)I(Di = 0)}.









census median (b) and the adjusted census medians (c) corresponding to four person

families and the weighted average of three and five person families as covariates. The

base year census median denotes the median income estimate obtained from the most

recent decennial census while the adjusted census median (c) for the current year is

obtained by the relation

Adjusted census median (c) = PC ) x census median (b)
PCI (b)
Here PCI(c) and PCI(b) denotes the per capital income estimates produced by the BEA

for the current and base years respectively. Thus, in the above expression, the current

year adjusted census median estimate is obtained by adjusting the base year census

median by the proportional growth in the PCI between the base year and the current

year. In the regression equation, the base year census median adjusts for any possible

overstatement of the effect of change in the PCI in estimating the current median

incomes. Finally, the Census Bureau used an empirical Bayesian (EB) technique (Fay

(1987); Fay et al. (1993)) to calculate the weighted average of the current CPS median

income estimate and the estimates obtained from the regression equation.

4.1.2 Related Literature

The estimation of median incomes for small areas have received sustained attention

over the years. Datta et al. (1993) extended and refined the ideas of Fay (1987) and

proposed a more appealing empirical Bayesian procedure. They also performed an

univariate and multivariate hierarchical Bayesian analysis of the same problem and

showed that both the EB and HB procedures resulted in significant improvement over

the CPS median income estimates for the univariate and multivariate models. However,

the multivariate model resulted in considerably lower standard error and coefficient of

variation than the univariate model although the point estimates were similar. Later,

Ghosh et al. (1996) (henceforth referred to as GNK) presented a Bayesian time series

analysis of the same problem by exploiting the inherent repetitive nature of the CPS

median income estimates. In doing so, they estimated the statewide median income









exposures. They achieved this by replacing the usual binomial model by a multinomial

one and using a MCMC scheme to estimate the log odds ratio of disease at each

category with respect to the baseline category. As in Muller and Roeder, they assumed

a prospective logistic likelihood and a flexible prior for the exposure distribution and

derived the implied retrospective likelihood. Muller et al. (1999) considered any number

of continuous and binary exposures. However, in contrast to Seaman and Richardson,

they specified a retrospective likelihood and then derived the implied prospective

likelihood. They also addressed the problem of handling categorical and quantitative

exposures simultaneously.

Continuous covariates can be treated in the Seaman and Richardson framework

by discretizing them into groups and little information is lost if the discretization is

sufficiently fine. Gustafson et al. (2002) treated the problem of measurement errors in

exposure by approximating the imprecisely measured exposure by a discrete distribution

supported on a suitably chosen grid. In the absence of measurement error, the support

is chosen as the set of observed values of the exposure, a device that resembles the

Bayesian Bootstrap (Rubin, 1981). They assigned a Dirichlet(1, 1,..., 1) prior on the

probability vector corresponding to the grid points. Seaman and Richardson (2004)

proved equivalence between the prospective and retrospective likelihood in the Bayesian

context. Specifically, they showed that posterior distribution of the log-odds ratios based

on a prospective likelihood with a uniform prior distribution on the log odds (that an

individual with baseline exposure is diseased) is exactly equivalent to that based on a

retrospective likelihood with a Dirichlet prior distribution on the exposure probabilities in

the control group. Thus, Bayesian analysis of case-control studies can be carried out

using a logistic regression model under the assumption that the data was generated

prospectively.

Diggle et al. (2000) introduced Bayesian analysis for matched case controls studies

when cases are individually matched to controls. They introduced nuisance parameters









Table 2-1 shows the posterior means and 95% credible intervals of the odds ratios

corresponding to different trajectory lengths and age at diagnosis when m = 0.5. For a

fixed trajectory length, the odds ratios decrease as age at diagnosis increases. This


Table 2-1. Estimates of odds ratios for different trajectory lengths
for a 0.5 vertical shift of the exposure trajectory for the
Age (-3,0) (-5,0) (-8,0)
50 3.96 (2.10, 7.63) 4.57 (2.32, 8.73) 5.26 (2.41, 11.01)
55 3.34 (2.02, 5.78) 3.75 (2.15, 6.43) 4.19 (2.24, 7.77)
60 2.83 (1.92, 4.39) 3.08 (2.00, 4.77) 3.36 (2.08, 5.46)
65 2.41 (1.79, 3.35) 2.55 (1.83, 3.59) 2.70 (1.90, 3.91)
70 2.06 (1.62, 2.70) 2.12 (1.64, 2.77) 2.19 (1.68, 2.89)
75 1.78 (1.41,2.32) 1.77 (1.41, 2.24) 1.79 (1.41, 2.31)
80 1.54 (1.16, 2.12) 1.48 (1.13, 1.98) 1.46 (1.11, 1.99)


and age at diagnosis
linear influence model
(-10,0)
5.46 (2.46, 11.24)
4.32 (2.29, 7.95)
3.44 (2.11,5.63)
2.76 (1.92, 4.02)
2.22 (1.69, 2.97)
1.80 (1.41, 2.38)
1.47 (1.09, 2.07)


seems to support the notion that younger subjects tend to have more aggressive

form of prostate cancer than older ones and thus are most likely to be benefited from

early detection (Catalona et al., 1998). For most ages at diagnosis, the odds ratios

steadily increase as longer exposure trajectories are considered i.e as past exposure

observations are taken into account. However, the rate of increase is higher for lower

age at diagnosis. Thus, consideration of past exposure observations in addition to

recent ones result in a significant gain in information about the current disease status

of a subject. Finally, for the highest age at diagnosis considered (80), the odds ratios

decrease as longer exposure trajectories are considered. This may imply that for a

subject with very high age at diagnosis, his/her past exposure observations may not

contain significant amounts of information about the present disease status.

As before, we fitted the disease model on the interval / = (-10, -5). The posterior

mean and 95% credible interval of po and 01 are respectively 1.24 (0.29, 2.19) and

-0.015 (-0.029, 0.003) implying that exposure observations recorded 5-10 years prior

to diagnosis also has a significant effect on the current disease status. The posterior

means and 95% credible intervals of the odds ratios shown in Table 2-2 corroborate the

above conclusion.









started to receive some attention. Park and Kim (2004) are one of the first contributors

to this area. They proposed an ordinary logistic model to analyze longitudinal case

control data but ignored the longitudinal nature of the cohort. They also showed that

ordinary generalized estimating equations (GEE) based on an independent correlation

structure fails in this framework.

2.1.1 Setting

Case-control study designs generally incorporate exposure information for a single

time point in the past. In some situations however, an entire exposure history may be

available for the cases and controls containing relevant exposure information collected

at multiple time points in the past. However, proper and rigorous statistical methods

of incorporating longitudinally varying exposure information inside the case control

framework have not yet been adequately developed. This may be due to the obvious

complications in properly handling a longitudinal exposure profile and thereby integrating

it in an existing case control framework. But once done, there may be significant

payoffs notably, the ability to learn how the present disease status of a subject is being

influenced by his/her past exposure conditions conditional on the current ones. It can

also lead to valuable insights about differences in the exposure patterns between the

cases and controls over a long time span. In analyzing the effect of a longitudinally

varying exposure profile on a binary outcome variable (like disease status), some

of the possible challenges are : (1) The longitudinal exposure observations may be

unbalanced in nature i.e the number of observations and also the observation times may

differ from subject to subject; (2) The exposure trajectory may be highly nonlinear; (3)

The exposure observations may be subject to considerable measurement error and (4)

The effect of the exposure profile on the disease outcome may itself be complex and can

even change over time.

In view of the above challenges, we propose to use functional data analytic

techniques, specially nonparametric regression methodology to model both the time









The likelihood for the augmented model will be


LA =exp(- Adk hAdk/ndk1)
d=O k=l d=O k=l
K r r K
cx exp (- 6k 1(i +zd)dk HH( ,'/:5). ndk
k=1 dl d=0 k=1

exp k d (+ dldk (6k)E0 d ndk fJ(od)1 kndk dk)n
Sk=1 d=1 k=l d=l d=l k=l
The posterior based on the augmented likelihood will be


n) r K ( ')
, 6, Idn) N LA k = d 1 7r )
(d-1 kl-


(A-8)


r \
Noting that /exp (k ((1 Odldk) (k)2= Oc
Sd )
we have, by integrating out 6 in (A-8),


-ld6k OC (


r K K r Yd0 do
7(7, dn) x n (Oddk) ndk i ddk
dN wk1 k=1 d= 1


S(rd=l 1
Now, integrating out (1, ..., r,) from (A-8), we have


r 0o n
- Z rdrldk
d= 1


(A-9)


( K K r K r '1 n,
T(7q,6 n) oc exp Z j k H(k)E-O nd-1 H( )(dk) ndk f N 6kd
Sk=1 k=l d=lk=l d=l k=l


Next, we make the transformation 6k = Ok and o

the prior distribution in (A-7) becomes


K
6i having jacobian '-1. Hence
= 1


126


(A-1 0)


( r ( K
d=1 k=1









situation where exposure observations for cases and controls are collected at a single

time point in the past. Some medical studies however have suggested that it may be

worthwhile to take into account an entire exposure history, if available, in assessing the

disease-exposure relationship. Case-control studies involving longitudinal exposure

trajectories is a relatively unexplored area. At the same time, it is a promising one given

the wide variety of longitudinal data analytic tools that are now available. Moreover,

recent developments in the area of semiparametric and nonparametric regression

analysis have added more flexibility in this direction specially when exposure trajectories

have complicated and unknown functional forms.

In this work, we have applied semiparametric regression techniques in analyzing

longitudinal case control studies. We have used penalized regression splines in

modeling the exposure trajectories for the cases and the controls. Thus our framework

can be used even when exposure observations are collected at different time points

across subjects i.e when exposures are unbalanced in nature. The exposure trajectory

is used as the predictor in a prospective logistic model for the binary disease outcome.

We have also modeled the slope parameter of the disease model as a p-spline to

account for any time varying influence pattern of the exposure trajectory on the current

disease status. In doing so, we have summarized the exposure history for the cases

and controls in a flexible way which allowed us to consider differential lengths of the

exposure trajectory in analyzing its effect on the current disease status. In order

to simplify the analysis, we used the logit-mixture of normal approximation (Albert

and Chib, 1993). We showed that the Bayesian equivalence results of Seaman and

Richardson (2004) essentially holds for our framework, thus allowing us to use a

prospective logistic model having fewer nuisance parameters although the dataset was

collected retrospectively. Analysis have been carried out in an hierarchical Bayesian

framework. Parameter estimates and associated credible intervals are obtained using

MCMC samplers. We have applied our methodology to a longitudinal case control








and the 95% credible interval is (0.196, 0.421). Thus, even exposure observations
recorded as far as 5-10 years prior to diagnosis seem to have a significant influence on
the current disease status of a subject. We formally compared the different models using
the PPL criterion in Section 2.6.3.
2.6.2 Linear Influence Model
We next fitted the model permitting a linear pattern of influence of the exposure
trajectory on the disease outcome. For all trajectory lengths, 0o and 01 were significant
since the 95% credible intervals excluded 0. To better understand the influence of
differential lengths of exposure trajectories on disease status, we calculated the odds
ratios for different age at diagnosis and trajectory lengths. Suppose the exposure
trajectory for the ith subject changes from {X,(t + ad), -c < t < O} to {Z,(t + ad), -c <
t < 0}. The corresponding odds ratio of disease is given by
P(Di = 1Z,(t+af), -c< t <) P(D X(t a),-c < t < 0)
Di =0IZit a ,-ct0 P(Di = O1Xit +a ,- t0
P(Di = OlZ,(t+ af),-C < t <0) XP(D, lX,(t ad),C < t < O)

= exp [ {Z,(t a d) X(t+ ad)}I (t a )dt. (2-16)

Parameterizing {Z,(t + af), -c < t < 0} as p,r,(t + af)'l q+ (q,(t + a)'d,, as in (2-2),
we can rewrite (2-16) as

exp [( )' ( p,(t + afd),c(t + ad)'dt)
(0-c
xexp [(d, b,)' ( q, (t + ad r,)((dt + ad'dt .

If there is an uniform increase of "m" in the trajectory i.e {Z,(t + af) X,(t + a) =
m, t e [-c, 0]}, (this can also be looked upon as a vertical shift of the trajectory upwards
by "m"), the above odds ratio simplifies to

exp m 7(t+ af)dt = exp [cm(o0 + (af c/2)01)] (2-17)
S -C








(i) Assuming w = logO, the posterior density of (w, 4) is


*0 A]
j {exp (w+ 4'c Zj(t)W(t)dt)}
p*,w, 0|1y) N p(O) H 0 -J+^ (2-11)
-1 + exp w+f Z/ (t)W(t)dt
J-
(ii) Assuming 0 = (60, ..., 0) and j = 6j/ 6k, the posterior density of (0, 4) is
k=1


J i1 Oieexp (d Zj(t)W4t))dt)
p(, 0y) N p() H j Yd+ (2-12)
j{1 d=o Y-exp df Zj(At)Wt)dt
j1 \ -c /
(iii) The marginal posterior densities of 4 obtainable from p(w, 0|y) and p(0, 0|y) are
the same.
The proofs of the above theorem are similar in nature to those in Seaman and
Richardson (2004) and are given in the Appendix A. Since we have considered near
uniform prior for a and our prior on 4 ensures the existence and finiteness of E(O), the
conditions of Theorem 2 are essentially satisfied for our framework.
Based on the above results, it can be concluded that the marginal posterior
distribution of 4 the parameter of interest, will be the same regardless of whether
we fit a prospective or retrospective model. Thus, we can analyze the PSA data using
the prospective semiparametric modeling framework described above. Bayesian
equivalence can also be shown in the more general case of multicategory case control
setup, i.e when there are multiple (> 2) disease states. We have the following result
Theorem 3. Let, {X(t), -c < t < 0} be any exposure trajectory with support
{Z1(t),..., ZK(t), -c < t < 0}, the set of all observable exposure trajectories. Let there
are r + 1 disease categories. Suppose Ydk (d = 0, 1, ..., r; k = 1,..., K) be independently









to diagnosis. In other words, the same exposure observation recorded at the same

time relative to diagnosis for two subjects with widely different age ranges should have

different significance.

We represent both f(ay) and gi(ay) using p-splines as follows
K
f(ad) = o+ 1 ad + ... /3pay /Wa, T/)p =+ pO,(a ) +
k=l
M
gi(ay) = bio+ bila + ...+ bqa bi,qm(ao m) =q,,(aJ)'bi, (2-2)
m=l

where p,,P(a [) = [1, a, ..., ay, (ad -7i)P.. (a rK)]' and Pq,,(ad) = [1, a, ..., a, (a-

K1),..., (a KM)+]' are truncated polynomial basis functions of degrees p and q with

knots (Tr-,..., TK) and (, ..., KM) respectively (Durban et al., 2004). Generally, M < K.

Disease Model

The prospective disease model is given by

P(Di = l|Xi(t +a), -c < t < 0) = L(a + Xi(t + a)7(t + a)dt (2-3)

where L(.) is the logistic distribution function, X,(t+ a) is the true, error-free unobserved

subject-specific exposure profile modeled as f(t + ad) + gi(t + ad) while y(t + ad) is an

unknown smooth function of age which reflects the time pattern of the effect of the PSA

trajectory on the current disease status for the ith subject. In (2-3), we use the relation

ay = t + afd to model the exposure trajectory X(.) and the influence function 7(.) as a

function of time with respect to diagnosis. In doing so, we can easily assess the effect

of the trajectory on the current disease state at any given point before diagnosis for a

particular subject. "c" is the time by which we go back in the past to record the exposure

history for the ith subject; e.g. c = 8 would imply that, for the ith subject, the exposure

observations recorded since eight years prior to diagnosis are being considered for

analysis. Thus, by changing the value of c, the effect of differential lengths of PSA

trajectories on the current disease status can be studied.









Fay, R., Nelson, C., and Litow, L. (1993). Estimation of median income of four-person
families by state, in Statistical Policy Working Paper 21, Indirect Estimators in Federal
Programs.

Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics
19, 1-141.

Gelfand, A. and Ghosh, S. (1998). Model choice : A minimum posterior predictive loss
approach. Biometrika 85, 1-11.

Gelfand, A. and Smith, A. (1990). Sampling based approaches to calculating marginal
densities. Journal of the American Statistical Association 85, 398-409.

Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple
sequences (with discussion). Statistical Science 7, 457-511.

Ghosh, M. and Chen, M.-H. (2002). Bayesian inference for matched case control
studies. Sankhya, B 64, 107-127.

Ghosh, M., Nangia, N., and Kim, D. (1996). Estimation of median income of four-person
families : A Bayesian time series approach. Journal of the American Statistical
Association 91, 1423-1431.

Ghosh, M. and Rao, J. N. K. (1994). Small area estimation : An appraisal. Statistical
Science 9, 55-76.

Godambe, V. P. (1976). Conditional likelihood and unconditional optimum estimating
equations. Biometrika 63, 277-284.

Green, P. (1995). Reversible jump Markov Chain Monte Carlo computation and Bayesian
model determination. Biometrika 82, 711-732.

Green, P. and Silverman, B. (1994). Nonparametric regression and generalized linear
models : a roughness penalty approach. Chapman and Hall/CRC.

Gustafson, P., Le, N., and Valle, M. (2002). A Bayesian approach to case-control studies
with errors in covariables. Biostatistics 3, 229-243.

Hampel, F, Ronchetti, E., Rousseeuw, P., and Stahel, W. (1987). Robust statistics : The
approach based on influence functions. Wiley.

Hanson, T. and Johnson, W. (2000). Spatially adaptive penalties for spline fitting.
Australian and New Zealand Journal of Statistics 2, 205-224.

Heagerty, P. (1999). Marginally specified logistic normal models for longitudinal binary
data. Biometrics 55, 688-698.

Heagerty, P. (2002). Marginalized transition models and likelihood inference for
longitudinal categorical data. Biometrics 58, 342-351.









study dealing with the association between prostate specific antigen (PSA) and prostate

cancer.

We analyzed our model using differential lengths of exposure trajectories. In

doing so, we have concluded that past exposure observations do provide significant

information towards predicting the current disease status of a subject. Specifically,

we have shown that across all age at diagnosis groups, the odds of disease steadily

increase as past exposure observations are taken into account in addition to the recent

ones. We also observed that for a fixed trajectory length, the odds of disease steadily

decrease as the age at diagnosis increases corroborating the medical fact that younger

subjects tend to have more aggressive form of prostate cancer and thus are most likely

to be benefitted from early detection. We performed model comparison using posterior

predictive loss (Gelfand and Ghosh, 1998). This criterion indicated that models with

longer exposure trajectories tend to perform better than those with shorter trajectories.

Lastly, model assessment was performed on the optimal model using the kappa statistic

and case deletion diagnostics. Both these tools suggested that our model fits relatively

well to the data.

Some interesting extensions can be done to our setup. For richer datasets, it will

be interesting to model the subject specific deviation functions as p-splines. In addition,

we have only assumed constant and linear parameterizations of the influence function

of the prospective disease model. For a larger data set, a p-spline formulation can

also be used for the influence function which may bring out any underlying non-linear

pattern of influence of the exposure trajectory on the current disease status. Although

we have used a binary disease outcome, it will be interesting to extend our framework

to accommodate multi-category disease states. Our modeling framework can also be

generalized by incorporating a larger class of nonparametric distributional structures

(like Dirichlet processes or Polya trees) for the subject specific random effects.









yields the following well known mixed effects model representation :


y = X + Z- +e (1-17)

where Cov(e) = o-l and 7 and e are independent.

Bayesian P-splines have recently become popular because they combine the

flexibility of non-parametric models and the exact inference provided by the Bayesian

inferential procedure. This is even more true because of the seamless fusion of

penalized splines into the mixed model framework (Wand, 2003) as shown above. This

equivalence also carries over to the manner in which smoothing is done. Smoothing can

be achieved by imposing penalties on the spline coefficients, 7 as shown in (1-14) or

by assuming a distributional form for 7, for example 7 ~ NK(O, 721K). In the Bayesian

context, priors are placed on -2 and the other parameters and usual posterior sampling

is carried out. Since samples are generated from the smoothing parameter alongside

the other parameters, this method is also known as automatic scatterplot smoothing.

In all the problems tackled in this dissertation, we will be using Bayesian inferential

procedures on penalized splines as shown above.








respectively) yield the same profile likelihood of 0. Thus, inferences about the parameter
of interest 4 can be obtained using the prospective likelihood which has fewer nuisance
parameters than the retrospective one.
Proof of Theorem 2.(i) The posterior density of (0, 6, 4) is


(A-4)


J 1 J
p(0'6, 6,y) o p(4)fj 6 i-1 If (AdJ)~exp(-A,)
j 1 d Oj 1
Replacing the expression of Adj from (2-10), we have


p(O, 6,41y) oc


x


P() {Oexp Zjy(t)M (t)dt)} y+a-1
j-1
exp ([I +exp (1 Z' Z,(t)(t))dt)) 6


Integrating out 6, from the above expression, we have


p(0, |y) oc o ) F(y+j + aj) exp Z Wt)
j1 1 + exp(0/ Z (t)x(t) dt) d j

Now, performing the transformation from 0 to w yields expression (2-11).
J
(ii) First, we perform the transformation from 6 to (0, b), where = yj. Thus,
j=1
6j = Ojy, j = 1,...,J. The jacobian of transformation will be J-1.
Using this transformation in (A-4) and after some manipulation, we have


J
p(iO, 0, 4,\ly) oc p(a)Y+++-1-yI'+-1I ojY


a'-1exp ('YJ ZJ(t)W(t)dt
j 1 -c


x exp[


(A-5)


123


j= 1


,'exp Z (t) (t)dt)
\ -c /









Here the area specific effects vi are assumed to be independently and identically

distributed (i.i.d) with mean 0 and constant variance o2, e, = kyy where k, is known and

ey's are i.i.d random variables independent of v,'s with mean 0 and constant variance

a Often normality of vi and e,'s are assumed. For these models, the parameters of

interest are the small area means Y, or the totals Y,. Battese et al. (1988) studied the

nested error regression model (1-10) in estimating the area under corn and soyabeans

for counties in North-Central Iowa using sample survey data and satellite information. In

doing so, they came up with an empirical best linear unbiased predictor (EBLUP) for the

small area means.

Over the years, numerous extensions have been proposed for the above modeling

frameworks including multivariate Fay-Herriot models, generalized linear models, spatial

models and models with more complicated random-effects structure etc. Rao (2003)

presented a nice overview of the different estimation methods while Jiang and Lahiri

(2006) reviewed the development of mixed model estimation in the small area context.

A proper review of model based small area estimation will be incomplete without

explaining the EBLUP, EB and HB approaches that are being widely used in this context.

As shown above, small area models are special cases of general linear mixed models

involving fixed and random effects such that small area parameters can be expressed

as linear combinations of these effects. Henderson (1950) derived the BLUP estimators

of small area parameters in the classical frequentist framework. These are so called

because they minimize the mean squared error among the class of linear unbiased

estimators and do not depend on normality. So, they are similar to the best linear

unbiased estimators (BLUEs) of fixed parameters. The BLUP estimator takes proper

account of the between area variation relative to the precision of the direct estimator.

An EBLUP estimator is obtained by replacing the parameters with the asymptotically

consistent estimator. Robinson (1991) gives an excellent account of BLUP theory and

some applications. In an EB approach, the posterior distribution of the parameters of









Table 2-3. Posterior predictive losses (PPL) for the constant and linear
influence models for varying exposure trajectories and knots
Knots Model (-2,0) (-5,0) (-8,0) (-10,0) (-10,-5)
0 Constant 47.54 47.02 47.20 47.65 47.81
Linear 43.61 43.32 43.17 43.33 43.82

1 Constant 46.61 46.65 46.77 46.57 45.29
Linear 42.80 42.83 42.91 42.90 42.94

2 Constant 45.83 45.50 45.72 46.32 44.69
Linear 43.20 43.01 42.74 42.66 43.33

3 Constant 45.47 45.23 45.24 44.82 45.17
Linear 43.47 43.05 42.72 42.73 43.43
4 Constant 45.35 45.67 45.27 45.31 45.54
Linear 43.70 43.13 42.56 42.61 43.47
5 Constant 46.67 46.06 45.42 45.75 46.01
Linear 43.91 43.20 43.12 42.93 43.48


reached earlier. For the linear setup, the model with exposure trajectory I = (-8, 0)

and 4 knots perform the best (has the lowest PPL criterion among all the models

considered).

2.6.4 Model Assessment

As mentioned before, the number of knots and length of exposure trajectory tend

to interact in influencing the fit of the constant and linear influence models. Thus, for

a fixed trajectory length, the optimal model can be selected as the one with the lowest

value of the PPL criterion across all the knot choices. For the linear influence model, the

lowest PPL value was recorded for / = (-8, 0) and 4 knots. So, we perform our model

assessment procedure on this model.

For this model, the posterior mean of K was about 0.6 with 95% credible interval

(0.535, 0.680) which indicates substantial agreement beyond what is expected by

chance. We next performed case deletion analysis. We deleted each subject (with all

the observations) rather than each observation for a subject. Figure 2-2 (a)-(c) shows

the case deleted posterior means and 95% credible intervals for 31, 0o and 01. (In








disease status of a subject. In the spirit of our dataset, we assume that the exposure
observations are continuous. Let the exposure profile for the ith subject be X,(t) =
{X,, ...,X,,, i = 1, ..., N; -c < t < 0} where X, is thejth exposure observation
recorded for the ith subject. Let X = {X1, ..X,1 ...X XNv ..., Xvnn} be the set of
all exposure observations. Since an exposure trajectory is composed of a finite set
of exposure observations, the discretizing mechanism proposed by Rubin (1981)
and later by Gustafson et al. (2002) can be applied to the trajectory as a whole i.e
{X,(t), -c < t < 0} can be assumed to be a discrete random variable with support
{Z,(t),..., Zj(t), -c < t < 0}, the set of all observable exposure trajectories
where {Z(t), -c < t < 0,j = 1,...,J} is a finite collection of elements in the
support of the X,'s. Let Yoj and Yj be the number of controls and cases having
exposure profile {Z,(t), -c < t < 0}. We denote the "Null" or "baseline" trajectory
as X(t) =0,-c < t <0}.
The odds ratio of disease corresponding to Zj(t), -c < t < 0} with respect to
baseline exposure is exp (/ Zj(t)7(t)dt) Assuming that a control has exposure
profile {Zj(t), -c < t < 0} with probability 6/ J=16k, it can be easily shown that

6,exp Z( t) ( t)dt
P(X(t) = Z(t), -c < t < 0D = 1) = Z(t)(t)d
S kexp (J_ Zk(t)7(t)dt)
k=-c
Thus, the retrospective likelihood is
Ydj
i j 5exp (dJ Zj(t)7(t)dt)
L(56, ) = co _j -- (2-7)
d=-O 1 6Skexp(d Zk()7(t)dt)
k=1 c









APPENDIX C
FULL CONDITIONAL DISTRIBUTIONS

C.1 Semiparametric Case Control Model

The full conditional distribution of the parameters for the semiparametric case control
model are as follows :

1. [/pa, A, b, A,, a,, Y, D, a] ~ N(M3, V) where
/j + N n N -1
V- ( YZZ Pp,(a,)cp, (a +.)' I AIM 'M:) ,
e l j1 =i=1
N n, N
M3 = Vp ( Y ,,(au)(yd q(aP)b,) + AMc(Zi a bQ),
Je j=1 i= 1
and Zp is the p + K + 1 order prior variance-covariance matrix of 3.

2. [Zia, 4, bi, A,, Di] ~ N(a + O'MI + b'Qi, A, ) truncated at the left (right) by 0 if
Di = 1(Di = 0), i = 1,..., N.

3. [bil|, a, ,, A, ao, ao, Y, D, a] N(Mb, b)( 1,..., N) where

v = (zb + a (q,(a,).q,(a,)'+ AQ,/')Q ,
e =1

Mb = vb V q,,(a,)(y. _.(a)') + Ai (Z a 'Mi) ,

and Zb is the q + M + 1 order variance-covariance matrix of b.

4. [0*| b, A, ao, Y,D,a] ~ N(Ma,, V,) where* = (a, 4)',
/ N -1
V. = + (1, 0'M + b ,Q,)A,(1, 0'M, + b/Q,) ,
i 1
N
Mo* = A i (1, 'Mi + yQj)/Zi

and Z,* is the r + K* + 2 order variance-covariance matrix of (a, 0).

5. [Ail, a, b,Y, D,a] ~ G( +, v + (Z, a -'M -bQ ) where


6. [(r)-11, a, b, Y, D, a] G( 1, b) j= 0 q.
(i 12


135









March 2000 CPS. Bayesian techniques are used to weigh the contributions of the CPS

median income estimates and the regression predictions of the median income based

on their relative precision. The standard deviations of the error terms are estimated

by fitting a model to the estimates of sampling error covariance matrices of the CPS

median household income estimates for several years. The mean function in this model

is referred to as a "generalized variance function" (Bell, 1999). Noninformative prior

distributions are placed on the regression parameter corresponding to the IRS median

income since it was found to be statistically significant even in the presence of census

data, both in the 1989 and 1999 models.

3.1.2 Related Research

Estimation of median income for small areas contributes to the policy making

process of many Federal and State agencies. Before the establishment of the SAIPE

program, the estimation of median income for four-person families was of general

interest. The Census Bureau used the ideas suggested by Fay (1987) in this regard.

Estimation was carried out in an empirical Bayes (EB) framework suggested by Fay

et al. (1993). Later, Datta et al. (1993) extended the EB approach of Fay (1987) and

also put forward univariate and multivariate hierarchical Bayes (HB) models. The

estimates from their EB and HB procedures significantly improved over the CPS median

income estimates for 1979. Ghosh et al. (1996) exploited the repetitive nature of the

state-specific CPS median income estimates and proposed a Bayesian time series

modeling framework to estimate the statewide median income of four-person families

for 1989. In doing so, they used a time specific random component and modeled it as a

random walk. They concluded that the bivariate time series model utilizing the median

incomes of four and five person families performs the best and produces estimates

which are much superior to both the CPS and Census Bureau estimates. In general, the

time series model always performed better than its non-time series counterpart.









Instead of proportional odd's model, we can also assume proportional hazards
model i.e

log log1 -P( S =l Di )] k = A/Di, k=l ... m 1
j 1

The other option would be to assume an ordinal probit formulation for the probabilities of
the latent classes given by

-1 P( S = Di =Ak A1D,, k= 1,...,m- 1


The predicted probabilities obtained from the ordinal probit model are similar to those

obtained from the proportional odd's model. Moreover, an advantage of the former
model is that, sampling from its posterior distribution is particularly efficient. For this

reason, the ordinal probit model is sometimes preferred if a Bayesian analysis needs to

be performed.
Lastly, the drop-out times Di are assumed to follow a multinomial distribution with

mass at each possible drop-out times, parameterized by p. Here we make the important
assumption that Y, is independent of Di given 5,. Our main target of inference are the

covariate effects averaged over the classes i.e PM averaged over M. The intercept Ai, in

(5-6) is determined by the following relationship between the marginal and conditional
models

E(Ytl|) = Z p(SilDi)P(Di) J {E(Y |tyt-_ 1.... Yt-p, bi, S)p(yt_1,.... yt-plb, S)}
D S A
x p(bilSi)dbi

where A = {it_, ..., Yt-p}.
5.2.3 Likelihood, Priors and Posteriors

Let, the set of all parameters be denoted by w = (3, a, a ,..., o-, 6). We
partition the complete response data for subject i, Yf into observed (values of Yf prior

to dropout) components, denoted by Yi and missing (response observations after









Using the above transformation, (A-10) can be rewritten as



d= Ok=1 d= lk=1
r K k= ndkK

z [exp(_ )(5)Z H ) 'o1 ( i (]fio" K no(- 1) ( Hifio )k
Sr K r
kd 7Onk 1 \x nd
d=1 k=) I k=1 d=1k=
r K k n ndkK




r ld=1 \k=l \1




d LR ( 1 71(k ) (A-13)
Integrating out (/ from (A-11), we have










From (A-9) and (A-12), it is clear that posterior inference for the parameter

of interest, ir remains the same under either the prospective likelihood L, or the
retrospective likelihood LR as long as the posterior is proper. It can be shown that
the posterior will be proper for any proper prior for 1n if nok > 1 V k = 1,..., K.


127









an underlying non-linear relationship with the CPS median income (Figure 3-2A), and so

it is more suited to a semiparametric analysis.

3.4.1 Comparison Measures and Knot Specification

Our dataset originally contained the median household income of all the states of

the U.S. and the District of Columbia for the years 1995-2004. However, we only used

the information for the five year period 1995-1999 since our target of inference are the

state specific median household incomes for 1999. We evaluated the performance of

our estimates by comparing them to the corresponding census figures for 1999. This

is because, in small area estimation problems, the census estimates are often treated

as "gold standard" against which all other estimates are compared. However, such a

comparison is only possible for those years which immediately precede the census year

e.g. 1969, 1979, 1989 and 1999.

In order to check the performance of our estimates, we plan to use four comparison

measures. These were originally recommended by the panel on small area estimates of

population and income set up by the Committee on National Statistics in July 1978 and

are available in their July 1980 report (p. 75). These are

* Average Relative Bias (ARB) = (51)-1 Y Ici eil
i Ci
2
Average Squared Relative Bias (ASRB) = (51)-1 Y Ici -e12
Ci

Average Absolute Bias (AAB) = (51)-1 1 |c, e,

Average Squared Deviation (ASD) = (51)-1 'i1(c, e,)2

Here c, and e, respectively denote the census and model based estimate of median

household income for the ith state (i = 1,..., 51). Clearly, lower values of these measures

would imply a better model based estimate.

The basic structure of our models would remain the same as in Section 3.2.2.

We have used truncated polynomial basis for the P-spline component in both the

models. Since Fig 2a doesn't indicate a high degree of non-linearity, we have restricted









a penalty function as shown in (1-14). A major difference between smoothing splines

and penalized splines is that, in the former, all the unique data points are used as knots

but in the latter the number of knots are much smaller resulting in more flexibility. Infact,

penalized splines can be seen as a generalization of regression and smoothing splines.

The wide applicability of penalized splines in diverse settings is mainly due to its

correspondence with linear mixed effects models. Infact, penalized splines can be

shown to be best linear unbiased predictors (BLUP)'s in a mixed model framework. To

see this, we rewrite (1-14) as
n
S = {yi f (xi 7)2 + AO'D (1-15)
i=1

where 0 = (, (/3, )', =. ( 1 3p )',7 (71, 72 ..., 7K)' and D is a known positive

semi-definite penalty matrix such that


D 0(p+l)x(p+l) 0(p+l)x(K)
0(K)x(p+l) K1

Different types of penalties can be accommodated by specifying different forms of D.

For example, the penalty / f (2(x 1, 7) used for smoothing splines can be achieved

with D being the sample second moment matrix of the second derivatives of the spline

basis functions. However, the above form of D only penalizes the spline coefficients

(71 ..., 7K). Specifically, the penalty in (1-14) corresponds to setting -: = I.
Let X be the matrix with the ith row Xi = (1, xi, ..., x) and Z be the matrix with the

ith row Zi = {(xi Ti), ..., (x, Ti)P). Using this formulation in (1-15) with the basis

function in (1-12) and dividing by the error variance ao, we have

1
S y = I- XP- z112 + 11712 (1-16)
oe oe

By assuming that is a vector of random effects with Cov(-) = o-I where o2 = o. /A

while 0 as the set of fixed effects parameters, the above penalized spline framework








APPENDIX B
PROOF OF POSTERIOR PROPRIETY FOR THE SMALL AREA MODELS
B.1 Univariate Small Area Model
The proof of posterior propriety for the basic univariate semiparametric model
(Model I) is outlined below. The necessary changes to the proof for the random walk
model are mentioned at the end.

Proof of Theorem : The basic parameter space is Q = (0, 0, 7, b, o-, o {,...,)
where 0 = (0'i,..., 0')' and b = (bl,..., b,)'. Let

S= ... / p(|Y, X, Z)dQ

= ... {L(Yi j i)L(0i j,-7, bi, d Xi, Zi)L(bi b )I L(, i7 (0)7(07b) ( 7) 7 ( j)df
i=l j=1
(B-1)

We have to show that / < M where M is any finite positive constant.
Integrating first w.r.t 3, we have

/ = w(/3) [L(O /3, b ,2, Xi, Zi)d/

= exp[- (Oi- Xi/3 Z -- bil)'W- (Oi- Xi3 Zi7 bil)] d
i

= X we Xi exp W'-Wi +
(B-2)

where Q = wlx ) xl 1X ) X(Z Xl -~ W Wi = 0 Z,7 b1l
and V-1 = diag(b-2, .b-2 .. 2).
Now, W -1'Wi, = W '-1/2-1/2Wi = S'S, where Si = X-1/2W,. Similarly,
W/ -1Xi = S'Ti, X'V-JWi = TtSi and X'V- Xi = T'T, where T, = X-1/2X,.


128









7. [(,7a)-11,, ,b,Y,D,a] G nI
i= 1
q,,(ay)'bi) .


N ni
1, (yui p,,(aij)'
i= 1 j=1


8. [(,72)-110, a, (K + 1 K )
8. [(V )- |4, b, Y, D, a] G 2-+1, Yr *
k=1
9_ 1M N q+M 2
9. [()- bY, D,a]G -+1,+ Y b .




C.2 Semiparametric Small Area Models
i=1 j= q1
K'
10. [(r )-ll|, ,0, bY, D,a] G +-1,2-Y
k=1
Here, G(x, y) denotes a Gamma density with shape parameter x and rate parameter y
respectively.
C.2 Semiparametric Small Area Models
C.2.1 Semiparametric Univariate Small Area Model

The full conditional distributions of the parameters for the univariate semiparametric
small area model are as follows :

1. [O, 3 2,7 2,b, X, Z] ~ N(Mb, V) where

--( --1 1a 1 1 Y (X -- Z -- hi b )
V = + and Mo + 6 6

2. [bi /3, b, N(, 2a, X, Z] N(Mb, Vb) where

+ = and M = (0 X'..- Z' .) .
b j= 1 j= 1 j= 1

3. [3|7, 0, b, b2, X, Z] ~ N(M3, V3) where

V,3= ( and M,3= (- (m X-.))
i 1= 1 i=i 1 = 1 i= 1 j= 1

4. [y|/, b, b,2, 2 X, Z] ~ N(M7, V.) where

V= --+'/) and M. = -( ,Z 11 d-).


136









where Q = ( WId'X X ( XyWX. 'x ( XyJ--i VW, and W, = O, Z -
ij ij j
bi.
As before, the expression within the exponent in (B-8) can be rewritten as

K* = -\ ( STS-( U) (5 TTU) (5T/Su)
iJ j j i,j
S S' [I- T(T'T)-T']S < 0.
2

where S = (S' ..., S')', T = (T ,..., T')', S, = V1/2W, and T, = V/2X,.
Thus,

exp- W.' W -1 + < 1 (B-9)
/,J
So, in order to prove posterior propriety, we have to show
./ /. t (m d r 1) [ V trace J /)-1-l
/ = ... I| Xolx. -1/2 1 f'- 1 2 exp -trace 2 d 1...d 1
ij j=j 1
< oo (B-10)

Here r is the order of j,j = 1, 2,..., t. (r = 2 in our case).
Let A1, Aj2,..., Ajr be the distinct eigen values of .l,j = 1, 2,..., t. Since ,j is a
variance covariance matrix, it is positive definite and symmetric. Hence, W-1 also has
the same properties. Thus, Ajk > 0, Vk = 1, 2,... r.
Now, Vj = 1,2, ..., r,

1JX 1 > Aj- YIr where Aj'" = min(A, A2..., Ar).

y XuW JfX' > Y A7in X isrXje

yx,,J u l'X. > Am/in5X .X where Am1n = min(A'.

SX,,-,'X Amin Y XX is non-negative definite.
ij ij









step function, a spline of degree 1 is a piecewise linear function and so on. For example,

f(.) can be represented as a linear combination of a pth degree truncated polynomial

basis having K knots, given by


1,x ..., xP, (x- 'T)P ..., (X TK)P. (1-12)

Here (x Tk)P is the function (x Tk)1/{x>T}. Using the above basis, a spline of degree

p can be expressed as
K
f(xl3, 7) = 0 + ix+ ... + PXPk+ 7(x- Tk) (1-13)
k=l

Here, (/3, ..., /p) and (7, .... TK) are the coefficients of the polynomial and spline

portions of the above structure and must be estimated. p = 1, 2, 3 corresponds to a

linear, quadratic or cubic spline respectively. The above basis constitutes one of the

most commonly used basis functions while other bases like radial basis or B-splines can

also be used. It can be shown that there exists a very rich class of spline-generating

functions which in turn greatly increases the scope and applicability of splines in various

modeling frameworks. Moreover, the very structure of the splines makes them extremely

good at capturing local variations in a pattern of observations, something which cannot

be achieved using Fourier or Polynomial bases.

One of the most important aspect of smoothing is the proper selection and

positioning of the knots. This is because the knots act as "sensors" in relaying

information about the underlying "true" observational pattern. Too few knots often

lead to a biased fit while an excessive number of knots leads to overfitting vis-a-vis

overparametrization and may even worsen the resulting fit. Thus, a sufficient number

of knots should be used and they should be placed uniformly throughout the range of

the independent variable. Generally, the knots are placed on a grid of equally spaced

sample quantiles of x and a maximum of 35 to 40 knots suffices for any practical

problem (Ruppert, 2002). Recently, there have been interesting contributions on knot









four, three and five person families for the ith state and thejth year. Y, is assumed

to estimate the true unknown median income Oi, (u = 1, 2, 3). The corresponding

adjusted census medians are denoted by X,y, Xy and X,3. The years correspond to

1979,...,1989.

For the univariate setup, the response and covariates are respectively Y,i and

X,6. For the bivariate setup, the basic data vector is a duplet with first component YU1
and second component is either Y,2, Y 3 or 0.75Y,2 + 0.25Y,3. The adjusted census

medians are chosen analogously. As mentioned before, our target of inference are the

state specific median incomes of four person families for 1989.

4.4.1 Comparison Measures and Knot Specification

In this study, our target of inference is the state specific median income corresponding

to four-person families for the year 1989. We judged our estimates by comparing those

to the corresponding census figures for 1989. In small area estimation problems, the

census estimates are often treated as "gold standard" against which all other estimates

are compared. However, such a comparison is only possible for those years which

immediately precede the census year i.e 1969, 1979, 1989 and 1999.

In order to check the performance of our estimates, we plan to use four comparison

measures. These were originally recommended by the panel on small area estimates of

population and income set up by the Committee on National Statistics in July 1978 and

is available in their July 1980 report (p. 75). These are

* Average Relative Bias (ARB)= (51)-1 Zic' ic e
C 51 I e12
2
Average Squared Relative Bias (ASRB) = (51)-1 |i1 c-2
Ci

Average Absolute Bias (AAB) = (51)-1 1 | c,- e,|

Average Squared Deviation (ASD) = (51)-1 51(c, e,)









these figures, the solid and dashed horizontal lines respectively indicates the estimated

posterior mean and 95% credible intervals of the respective parameters based on

the full data posterior. The solid points denote the importance weighted case-deleted

posterior mean while the vertical lines segments are the 95% case-deleted posterior

intervals). None of the subjects seem to be very influential on the parameter estimates.

For every subject, we also looked at the difference in the predicted probability of disease

based on the full data and with that subject deleted. Figure 2-2 (d) shows the plot of

the posterior means of the difference probabilities and the corresponding confidence

intervals. (In this figure, the solid line represents zero difference. The solid points

represents the difference in disease probabilities based on the full and case deleted

posteriors. The vertical line segments are the 95% posterior intervals of the differences).

Surprisingly, the observation for case number 108 has a significant departure from

the rest. On analyzing this subject, it was found that it had the unique combination of

very high age and very high values of PSA. In fact it had the highest mean age in the

sample, the highest age at diagnosis while the third highest mean Ptotal value. These

characteristics may have contributed to the exceptionally high difference in the predicted

probability of disease.

We also performed case deletion analysis of the intercept parameters of the

disease and trajectory models and the variance components. None of the subjects were

found to be influential on the posterior estimates of these parameters. Thus, based on

the above two measures, we may conclude that the semiparametric linear influence

model with trajectory I = (-8, 0) and 4 knots seems to fit the observed data relatively

well.

2.7 Conclusion and Discussion

Case control studies have witnessed a wide variety of research over the years.

Fundamental and far reaching contributions have been made both in the Frequentist

and Bayesian domains. Generally, the bulk of research have dealt with the standard









the comparison measures for the raw CPS estimates, SAIPE estimates and the

semiparametric estimates with the knot realignment while Table 3-3 depicts the

percentage improvement of the semiparametric estimates over the CPS and SAIPE

estimates. Here, SPM(5)* and SPRWM(5)* respectively denote the semiparametric

models with the realigned 5 knots.


Table 3-2. Comparison
SPRWM(5)*
Estimate ARB
CPS 0.0415
SAIPE 0.0326
SPM(5)* 0.028
SPRWM(5)* 0.0295


Table 3-3.

Estimate

SAIPE


CPS


measures for SPM(5)* and
estimates with knot realignment
ASRB AAB ASD
0.0027 1,753.33 5,300,023
0.0015 1,423.75 3,134,906
0.0012 1173.71 2,334,379
0.0013 1256.08 2,747,010


Percentage improvements of SPM(5)* and SPRWM(5)*
estimates over SAIPE and CPS estimates
Model ARB ASRB AAB ASD
SPM(5)* 14.11% 20.00% 17.56% 25.54%
SPRWM(5)* 9.51% 13.33% 11.78% 12.37%
SPM(5)* 32.53% 55.55% 33.06% 55.96%
SPRWM(5)* 28.92% 51.85% 28.36% 48.17%


It is clear that, with the knot realignment, the comparison measures corresponding

to the semiparametric estimates have decreased substantially, specially so for the SPM.

The new comparison measures for the semiparametric models are quite lower than

those corresponding to the SAIPE estimates. Thus, we may say that the semiparametric

model estimates performs better than the SAIPE estimates with the realigned knots.

This improvement is apparently due to the additional coverage of the observational

pattern that is being achieved with the relocation of the knots. As a result of this

increased coverage, a larger proportion of the underlying nonlinear pattern in the

observations in being captured by the new knots. Although we have done this exercise

with only 5 knots, it would be interesting to experiment with other types of knot alignment









Table 3-1. Parameter estimates of SPRWM with 5 knots
Parameter Mean Median 95% Cl
0o 4677.71 4660.08 (4633.31, 4758.7)
/1 0.8156 0.816 (0.814, 0.817)
71 -0.154 -0.154 (-0.158, -0.149)
72 0.02 0.024 (-0.016, 0.040)
73 -0.008 -0.016 (-0.056, 0.066)
4 -0.093 -0.119 (-0.127, -0.037)
5 -0.165 -0.173 (-0.187, -0.139)

3.4.4 Knot Realignment

As mentioned in Section 3.1.1, the SAIPE state models use the census estimates

of median income (for 1999) as one of the predictor which essentially gives them a

big edge over us. This may be one of the reasons why the estimates obtained from

the semiparametric models are atmost comparable, but not superior to the SAIPE

estimates. But that doesn't rule out the fact that the semiparametric models have room

for improvement. In this section, we will look for any possible deficiencies in the our

models and will try to come up with some improvements, if there is any.

As mentioned in Section 3.4.1, selection and proper positioning of knots plays

a pivotal role in capturing the true underlying pattern in a set of observations. Poorly

placed knots does little in this regard and can even lead to an erroneous or biased

estimate of the underlying trajectory. Ideally, a sufficient number of knots should be

selected and placed uniformly throughout the range of the independent variable to

accurately capture the underlying observational pattern.

Figures 3-3A and 3-3B shows the exact positions of 5 and 7 knots in the plot of CPS

median income against IRS mean income. In both the cases, the knots are placed on

a grid of equally spaced sample quantiles of IRS mean income. In both the figures, the

knots lie on the left of IRS mean = 50000, the region where the density of observations

is high. The knots tend to lie in this region because they are selected based on quantiles

which is a density-dependent measure. Thus, in both the figures, the coverage area

of knots (i.e the part of the observational pattern which is captured by the knots) is the









CHAPTER 3
ESTIMATION OF MEDIAN HOUSEHOLD INCOMES OF SMALL AREAS : A BAYESIAN
SEMIPARAMETRIC APPROACH

3.1 Introduction

Sample survey methodologies are widely used for collecting relevant information

about a population of interest over time. Apart from providing population level estimates,

surveys are also designed to estimate various features of subpopulations or domains.

Domains may be geographic areas like state or province, county, school district etc. or

can even be identified by a particular socio-demographic characteristic like a specific

age-sex group. Sometimes, the domain-specific sample size may be too small to yield

direct estimates of adequate precision. This led to the development of small area

estimation procedures which specifically deal with the estimation of various features

of small domains. Generally, observations on various characteristics of small areas

are collected over time, and thus, may possess a complicated underlying time-varying

pattern. It is likely that models which exploit the time varying pattern in the observations

may perform better than classical small area models which do not utilize this feature. In

this study, we present a semiparametric Bayesian framework for the analysis of small

area level data which explicitly accomodates for the longitudinal pattern in the response

and the covariates.

3.1.1 SAIPE Program and Related Methodology

The Small Area Income and Poverty Estimates (SAIPE) program of the U.S.

Census Bureau was established with the aim of providing annual estimates of income

and poverty statistics for all states, counties and school districts across the United

States. The resulting estimates are generally used for the administration of federal

programs and the allocation of federal funds to local jurisdictions. There are also many

state and local programs that depend on these estimates. Prior to the creation of the

SAIPE program, the decennial census was the only source of income and poverty

statistics for households, families and individuals related to small geographic areas









quality (in terms of their "closeness" to the census estimates) of the estimates tended

to improve as the knots were positioned more uniformly throughout the range of the

independent variable. It became apparent that the contribution of the knots towards

deciphering the underlying observational pattern improved substantially when those

were properly placed with an optimal coverage area. This in turn improved the

approximation of the curve vis-a-vis the true unknown observational pattern. This

proved interesting because, still now, there is no absolute rule which controls the

positioning of knots. Our final estimates proved to be superior, not only to the raw CPS

estimates, but also to the current U.S. Census Bureau (SAIPE) estimates. Although the

basic semiparametric model performed much better that the semiparametric random

walk model with 5 knots, more experiments need to be done with different knot positions

and number before anything conclusive can be said about their relative performance as

a whole. But, it seems that, if adequate knots are used and if those are placed uniformly

throughout the range of the independent variable, then a random walk component

may not improve the fit any further provided there is no strong trend in the income

levels. The main advantage of our modeling procedure is that it can be used for any

possible patterns in the response (income, poverty etc) observations of small areas. In

a subsequent work related to the estimation of median incomes of 4-person families,

we have shown that the multivariate version of the basic semiparametric model perform

quite well too and provide estimates which are consistently superior to the U.S. Census

Bureau estimates.

The above models can be extended in various ways based on the nature of the

observational pattern and the quality (or richness) of the dataset. Some obvious

extensions are given as follows : (1) In the models considered above, the spline

structure f(xi) represents the population mean income trajectory for all the states

combined. The deviation of the ith state from the mean is modeled through the random

intercept b,. This implies that the state-specific trajectories are parallel. A more flexible









LIST OF TABLES


Table page

1-1 Atypical 2 x 2 table ........... ...... .............. 15

2-1 Estimates of odds ratios for different trajectory lengths and age at diagnosis
for a 0.5 vertical shift of the exposure trajectory for the linear influence model 52

2-2 Posterior means and 95% confidence intervals of odds ratio for
/ = (-10, -5) for the linear influence model ... 53

2-3 Posterior predictive losses (PPL) for the constant and linear
influence models for varying exposure trajectories and knots ... 54

3-1 Parameter estimates of SPRWM with 5 knots . ... 74

3-2 Comparison measures for SPM(5)* and
SPRWM(5)* estimates with knot realignment . ... 77

3-3 Percentage improvements of SPM(5)* and SPRWM(5)*
estimates over SAIPE and CPS estimates ..... 77

3-4 Parameter estimates of SPM(5)* .................. ....... 78

3-5 Parameter estimates of SPRWM(5)* ..... ....... 78

3-6 Comparison measures for time series and
other m odel estim ates .. .. .. .. .. .. .. .. 79

4-1 Comparison measures for univariate estimates ..... 99

4-2 Percentage improvements of univariate
estimates over Census Bureau estimates .... 99

4-3 Comparison measures for bivariate non-random
w alk estim ates . . 100

4-4 Percentage improvements of bivariate non-random
walk estimates over Census Bureau estimates ..... 101

4-5 Comparison measures for bivariate random walk model ... 102









CHAPTER 4
ESTIMATION OF MEDIAN INCOME OF FOUR PERSON FAMILIES :A MULTIVARIATE
BAYESIAN SEMIPARAMETRIC APPROACH

4.1 Introduction

Small area estimation techniques have been widely used for estimating various

features of small domains domains for which the sample size is prohibitively small

for the application of direct survey based estimation procedures. Small domains can

be specific regions like a state, county or school district or can even be identified by a

particular socio-demographic characteristic like a specific ethnic group.

The U.S.. Census Bureau has always been concerned with the estimation of

income and poverty characteristics of small areas across the United States. These

estimates play a vital role towards the administration of federal programs and the

allocation of federal funds to local jurisdictions. For example, state level estimates of

median income for four-person families are needed by the U.S. Department of Health

and Human Services (HHS) in order to formulate its energy assistance program to low

income families. Since income characteristics for small areas are generally collected

over time, there may well be a time varying pattern in those observations. Neglecting

those patterns may lead to biased estimates which doesn't reflect the true picture. In

this study, we put forward a multivariate Bayesian semiparametric procedure for the

estimation of median income of four-person families for the different states of the U.S.

while explicitly accommodating for the time varying pattern in the observations.

4.1.1 Census Bureau Methodology

The estimation of median incomes for different family sizes used to be carried out

by the U.S. Census Bureau until a few years ago. More recently, they have established

the Small Area Income and Poverty Estimates (SAIPE) program which exclusively

deals with the estimation of median household income and poverty estimates for small

areas across the United States. But the estimation of the median income of four-person









variables is too complex to be expressible using a known functional form. One of the

main differences between parametric and nonparametric regression methodologies is

that, in the former, the true shape of the functional pattern is determined by the model

while in the latter, the shape is determined by the data itself.

Suppose, the response y and the covariate x are related as

yi = f(xi) e, = 1,2 ...n. (1-11)

where f(x) is an unknown and unspecified smooth function of x and ei N(O0, o-2).

The basic problem of "nonparametric regression" is to estimate the function f(-)

using the data points (xi, yi). In doing so, it is typically assumed that beneath a rough

observational data pattern there is a smooth trajectory. This underlying smooth pattern

is estimated by various smoothing techniques. Broadly, there are four major classes of

smoothers used to estimate f(.) viz Local polynomial kernel smoothers (Fan and Gijbels

(1996); Wand and Jones (1995)), Regression splines (Eubank (1988), Eubank (1999)),

Smoothing splines (Wahba (1990); Green and Silverman (1994)) and Penalized splines

(Eilers and Marx (1996); Ruppert et al. (2003)). Each smoother has its own strengths

and weaknesses. For example, local polynomial smoothers are computationally

advantageous for handling dense regions while smoothing splines may be better for

sparse regions. Here, we will briefly review the main characteristics of splines in general

and penalized splines in particular.

The basic idea behind splines is to express the unknown function f(x) using

piecewise polynomials. Two adjacent polynomials are smoothly joined at specific points

in the range of x known as "knots". The knots, say, ( 7-,.... -K) partition the range of

x into K distinct subintervals (or neighborhoods). Within each such neighborhood, a

polynomial of certain degree is defined. A polynomial spline of degree p has (p 1)

continuous derivatives and a discontinuous pth derivative at any interior knot. The pth

derivative reflects the "jump" of the splines at the knots. Thus, a spline of degree 0 is a









the jth year. In that case, we may be interested in estimating (01,1 ..., 0,,O)' the

median income of four-person families for all the states at time u. We may also want to

estimate the difference in median incomes of four-person families at times v and u i.e

(O1vi 01,1 ... *Omvi Om,,)'. Correspondingly, let X, = (Xyi, ..., Xs)' be the predictors

corresponding to the ith state and jth year.

4.2.2 Semiparametric Modeling Framework

We consider both univariate and bivariate income trajectory models for the

family-size dataset. The univariate modeling framework is exactly the same as explained

in Chapter 3. Here, we will explain the bivariate framework which is of two types viz a

simple bivariate model and a bivariate random walk model. These can also be seen as

extensions of the univariate models explained in Section 3.2.2.

4.2.2.1 Simple bivariate model

The bivariate non-random walk model is given by

KI
Yi = Ao + a11xi + ... + a + 7kl(X kl + b; + Uiy + eiL
k=l
K2
Yij2 = 302 + /12Xy2 ... + ,2,2 + k2 (Xi2 7k2) + bi2 + UJ2 + e6 (4-1)
k=l

This is the most general structure since the degrees of the spline as well as the number

and position of the knots are different for the two models. If for / = 1, 2,..., m;j =

1, 2,..., t, { Yi, X 1} and { Y,2, Xo2} have similar relationship, we can assume p = q and

rkl = k2, k = 1, 2,..., K (= K2).

Equation (4-1) can be rewritten as


Y = U~ Z bi +u+ ey (4-2)

= Oy +e.,


where 0o- = U0/3 + Z-y + bi + u-.
6 6rL IU









2.3 Posterior Inference


2.3.1 Likelihood Function

Let Yi = (Y,1 ..., Yi)' and Di be the exposure vector and disease status while

a, = (ai, ..., ain,)' and ti = (ti, ..., tin)' be the observed values of age and time with

respect to diagnosis for the ith subject respectively. So, the response vector for the ith

subject will be the pair (Yi, Di). Let 02 = (c, 3, 2fa, b, ,, ac, a e ..., C}) be the

parameter space corresponding to the ith subject. Thus, the full parameter space will be

given by 0 = E2 u ,2 U ... U vN.

The likelihood for to the ith subject, conditional on the random effects is given by

L(Y,, Di, ail i) oc p(Yi,|, a,, bi, 0 )p(Dia, /, 4P )p(3(2)l )p( (2) )
N q
xp(b(2) o) l p(b 1iJ2) (2-6)
il j 0

where p(Yi,|/, a,, bi, o- ) is the probability distribution corresponding to the trajectory

model, p(Dla, 0, 4) denotes the logistic distribution corresponding to the disease model

while the rest deals with the distributional structures on the spline coefficients and

random effects.

Since the trajectory model (2-1) has a normal distributional structure while the

disease model (2-3) has a logistic structure, the likelihood function and hence the

posterior have a complicated form. To alleviate this problem, we approximate the logistic

distribution as a mixture of normals using a well known data augmentation algorithm

proposed by Albert and Chib (1993). This is briefly explained in Section 3.3.

2.3.2 Priors

To complete the Bayesian specification of our model, we need to assign prior

distributions to the unknown parameters. We assume diffuse normal priors for the

polynomial coefficients (/3, ..., /p) and (a, o,..., ,). For the variance components
(o, ao, oa o {o, ..., ao}), we assume uniform priors with large upper bounds. The prior

distributions are assumed to be mutually independent. We choose large values for the









is necessary, as we will see, in order to account for the three types of dependencies

mentioned above.

We assume that Y,, conditional on the random effects bi and latent class 5,, are from an

exponential family with distribution

f( Yt Ik, k < t, bi, Si) = exp [{ t it (Tit)}/(mi) + h( t, ( )]
where E(Y,t Yk, k < t, b,, Si) = g-l(it) = '(lit). Here Tlit is the linear predictor, b() is

a known function, 0 is a scale parameter and m, is the prior weight. We next specify the

conditional model as
m p
g{E(Yt Yk, k < t, bi, Si)} = Ait + SuZ() + 7t,kyit-k + b (5-6)
j=1 k=l

where, in the most general case, [bi, Sy = 1, X] ~ N(0, o(7(Xi)) and 7it,k(S = 1) =

V'tkk forj = 1, 2,..., m and k = 1, 2,..., p, where Vi, and Zit are both subsets of Xt.
Thus, the variance of bi may depend on the latent class and the covariate vector for

the ith subject. Moreover, 62, 6, .... 6) determines how the dependence between Y,

and Yt-k varies as a function of the covariates Vit,k conditional on the latent classes.

We also make the sum-to-zero constraint i.e at = Y1 a for the purpose of

identifiability. Lastly, in this conditional model, each subject has its own intercept, and

the effect of each covariate, is allowed to differ by dropout class via the regression

coefficients, aO).

The probabilities of the latent classes given the drop-out times are specified as

proportional odd's model (Agresti, 2002) given by


logit P Su= = oDi /k A1Di, k = 1..., m 1. (5-7)
j= 1

where Ao,1 < Ao,2 < ... < AO,M-1 and A1 are unknown parameters. Thus the class

probabilities are assumed to be a monotone function of dropout time (in fact, linear on

the logit scale).


113









effects assumed to be independent and identically distributed with mean 0 and constant

variance a Lastly, big's are known positive constants and 3 = (/1, ..., 3p)' is the vector of

regression coefficients.

In order to infer about the small area means, Y,, direct estimators, Y, are assumed

to be known and available. The linear model

Oi =g( ) = Oi + e, i = 1,... m (1-8)

is assumed where the sampling errors, e, are independent with

Ep(eii0) = 0, Vp(eii0) = i,, i, known

which implies that 0, are design-unbiased. By setting ov = 0 in (1-7), we have 0, = z'p

which leads to synthetic estimators that does not account for local variation above and

beyond that reflected in the auxiliary variables z,. Combining (1-7) and (1-8), we have

0, = z' + bivi + e, (1-9)

which is a special case of a linear mixed model. Here, vi and e, are assumed to be

independent. Fay and Herriot (1979) studied the above area level model (1-9) in the

context of estimating the per capital income (PCI) for small places in the United States

and proposed Empirical Bayes estimator for that case. Ericksen and Kadane (1985)

used the same model with bi = 1 and known -2 to estimate the undercount in the

decennial census of U.S. The area level model has also been used recently to produce

model based county estimates of poor school age children in the United States.

In the unit level model, it is assumed that unit specific auxiliary data xy

(xil, ..., Xip)' are available for each population element j in each small area i. Moreover,
it is assumed that the variable of interest, yy, is related to x, through a one-fold nested

error linear regression model


yU = x,3 + vi + eu, i = 1,..., m;j =1, ..., N (1-10)









Henderson, C. (1950). Estimation of genetic parameters (abstract). Annals of Mathe-
matical Statistics 21, 309-310.

Hogan, J. and Laird, N. (1998). Mixture models for the joint distribution of repeated
measures and event times. Statistics in Medicine 16, 239-257.

Hogan, J., Roy, J., and Korkontzelou, C. (2004). Tutotial in biostatistics : Handling
drop-out in longitudinal studies. Statistics in Medicine 23, 1455-1497.

Jiang, J. and Lahiri, P. (2006). Mixed model prediction and small area estimation. Test
15, 1-96.

Johnson, V. (2004). A Bayesian X2 test for goodness-of-fit. Annals of Statistics 32,
2361-2384.

Lewis, M., Heinemann, L., MacRae, K., Bruppacher, R., and Spitzer, W. (1996). The
increased risk of venomous thromboembolism and the use of third generation
progestagens : Role of bias in observational research. Contraception 54, 5-13.

Lin, J., Zhang, D., and Davidian, M. (2006). Smoothing spline based score tests for
proportional hazards models. Biometrics 62, 803-812.

Lindstrom, M. (1999). Penalized estimation of free-knot splines. Journal of Computa-
tional and Graphical Statistics 8, 333-352.

Lipsitz, S., Parzen, M., and Ewell, M. (1998). Inference using conditional logistic
regression with missing covariates. Biometrics 54, 295-303.

Little, R. and Rubin, D. (1987). Statistical Analysis with Missing Data. New York: Wiley
& Sons.

MacEachern, S. and Muller, P. (1998). Estimating mixtures of Dirichlet process models.
Journal of Computational and Graphical Statistics 2, 223-238.

Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from
retrospective studies of disease. Journal of the National Cancer Institute 22, 719-748.

Marshall, R. (1988). Bayesian analysis of case-control studies. Statistics in Medicine 7,
1223 1230.

Morris, C. (1983). Parametric empirical Bayes inference : theory and applications.
Journal of the American Statistical Association 78, 47-54.

Muller, P., Parmigiani, G., Schildkraut, J., and Tardella, L. (1999). A Bayesian
hierarchical approach for combining case-control and prospective studies. Biometrics
55, 858-866.

Muller, P. and Roeder, K. (1997). A Bayesian semiparametric model for case-control
studies with errors in variables. Biometrika 84, 523-537.


142









Semiparametric regression methods have not been used in small area estimation

contexts until recently. This was mainly due to methodological difficulties in combining

the different smoothing techniques with the estimation tools generally used in small

area estimation. The pioneering contribution in this regard is the work by Opsomer

et al. (2008) in which they combined small area random effects with a smooth,

non-parametrically specified trend using penalized splines. In doing so, they expressed

the non-parametric small area estimation problem as a mixed effects regression model

and analyzed it using restricted maximum likelihood. Theoretical results were presented

on the prediction mean squared error and likelihood ratio tests for random effects.

Inference was based on a simple non-parametric bootstrap approach. The methodology

was used to analyze a non-longitudinal, spatial dataset concerning the estimation of

mean acid neutralizing capacity (ANC) of lakes in the north eastern states of U.S.

3.1.3 Motivation and Overview

The motivation of our work also originates from the repetitive nature of the CPS

median income estimates. But, in contrast to the approach of Ghosh et al. (1996), we

have viewed the state specific annual household median income values as longitudinal

profiles or "income trajectories". This gained more ground because we used the state

wide CPS median household income values for only five years (1995 1999) in our

estimation procedure. Figure 3-1 shows sample longitudinal CPS median household

income profiles for six states spanning 1995 to 2004 while Figure 3-2 shows the plots

of the CPS median income against the IRS mean and median incomes for all the states

for the years 1995 through 1999. It is apparent that CPS median income may have

an underlying non-linear pattern with respect to IRS mean income, specially for large

values of the latter. The above two features motivated us to use a semiparametric

regression approach. In doing so, we have modeled the income trajectory using

penalized spline (or P-spline) (Eilers and Marx, 1996) which is a commonly used

but powerful function estimation tool in non-parametric inference. The P-spline is











0 0 C>
00 50 6 00
0 0 *
Eo E
I S Me o e


8r 32. o. P m


I C, 0 0- c>.
10100 *0 ** 1 1




S. ana s e e e e. S





of our models. We end with a discussion in Section 3.6. The appendix contains the






















The target of inference is generally By or some function of it. Specifically, in our context,
incomes at times v and u i.e ()'. We denote by X the covariate




















corresponding to the ith Mtate and jthI CM
30000 40000 50000 60000 70000 20000 25000 30000 35000
IRS Mean Income IRS Median Income
A IRS mean income plot B IRS median income plot

Figure 3-2. Plots of CPS median income against IRS mean and median incomes for all
the states of the U.S. from 1995 to 1999.


analysis with regard to the median household income dataset. In Section 3.5, we

discuss the Bayesian model assessment procedure we used to test the goodness-of-fit

of our models. We end with a discussion in Section 3.6. The appendix contains the

proofs of the posterior propriety and the expressions of the full conditional distributions.

3.2 Model Specification

3.2.1 General Notation

Let Y = (Y,4,..., Y,js)' be the sample survey estimators of some characteristics

OY = (01, ...0,)' for the ith small area at the jh time (/ = 1,2,...,m;j = 1,2,...,t).

The target of inference is generally 0, or some function of it. Specifically, in our context,

0, = O, which denotes the median household income of the ith state at the jth year.

We are interested in estimating (, ..., Omu,,)' i.e the median household income for all

the states at time u. We may also want to estimate the difference in median household

incomes at times v and u i.e (0v Oiu, ..., Omv Omu,,)'. We denote by X, the covariate

corresponding to the ith state and jth year.


*


*









distributed as Poisson(Adk) where


log(Adk) = Ig(d) + og(rdk) + log(6k),

log(Aok) = Iog(k).

0d being the baseline odds for disease category d and rl being the parameter of interest.
Assume independent improper priors, r (Od) oc 1d, Tr(6k) oc 61 for 0 and 6 and a prior

7r(rl) forrl that is independent of 6 and 6 and proper i.e E(rl) exists and is finite. Let ndk

be the number of individuals with D = d and {X(t) = Zk(t), -c < t < 0}. Then the

following two statements holds

(i) The posterior density of (rl, 6) is

r K K r =o ndk
7(7, VIn) x nn( d)"dk)ndk d+ nldk 1 7r(77)
d lk 1 k=1 d=1 d=1
K
(ii) Assuming 0 = (6, ..., OK) and Ok = k/ 6 6, the posterior density of (0, r1) is
= 1
ndk


o(, 0|n) N K nOk r kdk K ) -1
k=1 d=1 k=1 /d k=1



(iii) The marginal posterior densities of rl obtainable from w(1r, O n) and (rl1, 0 n) are

the same.

The proof of the above theorem is given in Appendix A.

2.5 Model Comparison and Assessment

2.5.1 Posterior Predictive Loss

We performed model comparison using the posterior predictive loss (PPL) criterion

proposed by Gelfand and Ghosh (1998). This criterion is based on the idea that an

optimal model should provide accurate prediction of a replicate of the observed data.









varying exposure profile and also the influence pattern of the exposure profile on

the binary outcome. Specifically, we model the underlying exposure trajectory using

penalized splines or p-splines (Eilers and Marx (1996); Ruppert et al. (2003)). We also

express the effect of the exposures on the current disease state as a penalized spline

to account for any possible time varying patterns of influence. Analysis is carried out

in a hierarchical Bayesian framework. Our modeling framework is quite flexible since

it can accommodate any possible non-linear time varying pattern in the exposure and

influence profiles. It is difficult to achieve the same goal in a purely parametric setting.

In a case-control study, the natural likelihood is the retrospective likelihood, based

on the probability of exposure given the disease status. Prentice and Pyke (1979)

showed that the maximum likelihood estimators and asymptotic covariance matrices

of the log-odds ratios obtained from a retrospective likelihood are the same as that

obtained from a prospective likelihood (based on the probability of disease given

exposure) under a logistic formulation for the latter. Thus, case-control studies can

be analyzed using a prospective likelihood which generally involves fewer nuisance

parameters than a retrospective likelihood. Seaman and Richardson (2004) proved a

similar result in the Bayesian context. Specifically, they showed that posterior distribution

of the log-odds ratios based on a prospective likelihood with a uniform prior distribution

on the log odds (that an individual with baseline exposure is diseased) is exactly

equivalent to that based on a retrospective likelihood with a Dirichlet prior distribution on

the exposure probabilities in the control group. Thus, Bayesian analysis of case-control

studies can be carried out using a logistic regression model under the assumption that

the data was generated prospectively.

We show that the results of Seaman and Richardson (2004) applies for the

proposed semiparametric framework thus enabling us to perform the analysis based

on a prospective likelihood even though a case control study is retrospective in nature.

We perform model checking based on the posterior predictive loss criterion (Gelfand and









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

BAYESIAN SEMIPARAMETRIC REGRESSION AND RELATED APPLICATIONS

By

Dhiman Bhadra

August 2010

Chair: Malay Ghosh
Cochair: Michael J. Daniels
Major: Statistics

Case-Control studies and small area estimation are two distinct areas of modern

Statistics. The former deals with the comparison of diseased and healthy subjects

with respect to risk factors) of a disease with the aim of capturing disease exposure

association specially for rare diseases. The later area is concerned with the measurements

of characteristics of small domains regions whose sample size is so small that the

usual survey based estimation procedures cannot be applied in the inferential routines.

Both these areas are important in their own right. Case-control studies forms one of

the pillars of modern biostatistics and epidemiology and has diverse applications in

various health related issues, specially those involving rare diseases like Cancer. On the

other hand, estimates of characteristics for small areas are widely used by Federal and

local governments for formulating policies and decisions, in allocating federal funds to

local jurisdictions and in regional planning. My dissertation deals with the application of

Bayesian semiparametric procedures in modeling unorthodox data scenarios that may

arise in case control studies and small area estimation.

The first part of the dissertation deals with an analysis of longitudinal case-control

studies i.e case-control studies for which time varying exposure information are available

for both cases and controls. In a typical case-control study, the exposure information is

collected only once for the cases and controls. However, some recent medical studies

have indicated that a longitudinal approach of incorporating the entire exposure history,










o *
*


S- *g- S
0)

S 4 5 6 400 5 0 6 0












IRS Mean Income IRS Mean Income
A Positioning of 5 Knots B Positioning of 7 Knots




Figure 3-3. Exact positions of 5 and 7 knots in the plot of PS median income against








region to the left of the dotted vertical lines. On the other hand, the non-linear pattern
8 |-<," 8 **<-"*
o A 0 AAAA A 0
I< I-----------------------------------I ICM I -------------']------------- I
30000 40000 50000 60000 70000 30000 40000 50000 60000 70000
IRS Mean Income IRS Mean Income
A Positioning of 5 Knots B Positioning of 7 Knots

Figure 3-3. Exact positions of 5 and 7 knots in the plot of CPS median income against
IRS mean income. The knots are depicted as the bold faced triangles at the
bottom.


region to the left of the dotted vertical lines. On the other hand, the non-linear pattern

is tangible only in the low density area of the plot i.e the region lying to the right of IRS

mean = 50000. Evidently, none of the knots lie in this part of the graph. Thus, we can

presume that in both the cases (5 and 7 knots), the underlying non-linear observational

pattern is not being adequately captured.

As a natural solution to this issue, we decided to place half of the knots in the low

density region of the graph while the other half in the high density region. The exact

boundary line between the high density and low density regions is hard to determine.

We tested different alternatives and came up with IRS mean = 47000 as a tentative

boundary because it gave the best results. In both the regions, we placed the knots at

equally spaced sample quantiles of the independent variable. Figure 3-4 shows the new

knot positions for 5 knots.

It is clear from Figure 3-4 that the new knots are more dispersed throughout

the range of IRS mean than the old ones. The region between the bold and dashed

vertical lines denotes the additional coverage that has been achieved with the knot









the appendix). Markov chain Monte Carlo methodologies, specifically, Gibbs sampling

(Gelfand and Smith, 1990) has been used to obtain the parameter estimates.

We have compared the state-specific estimates of median household income for

1989 with the corresponding decennial census values in order to test for their accuracy.

In doing so, we observed that the semiparametric model estimates improve upon both

the CPS and the Census Bureau estimates. Interestingly, for all the above models,

the semiparametric estimates are generally superior or at least comparable to the

corresponding estimates from the time series models of Ghosh et al. (1996). This is a

testament to the flexibility and strength of the semiparametric methodology specially

when observations are collected over time. It also indicates that it may be worthwhile

to take into account the longitudinal income patterns in estimating the current income

conditions of the U.S. states. Lastly, the semiparametric modeling framework is very

general and can be applied to any situation where various characteristics of small areas

are collected over time.

The rest of the chapter is organized as follows. In Section 4.2 we introduce the

bivariate semiparametric modeling framework. Section 4.3 goes over the hierarchical

Bayesian analysis we performed. In Section 4.4, we describe the results of the data

analysis with regard to the median household income dataset. Finally, we end with

a discussion and some references towards future work in Section 4.5. The appendix

contains the proofs of the posterior propriety and the expressions of the full conditional

distributions for our models.

4.2 Model Specification

4.2.1 Notation

Let Y, = (Y, ..., ,s)' be the sample survey estimators of some characteristics

8, = (01, ..., s)' for the ith small area at the jth time (i = 1, 2,..., m;j = 1,2,..., t).

In this study, we are concerned with the estimation of 0, or some function of it. For

example, 0y, may be the median income of four-person families for the ith state at









ourselves to a linear spline (p = 1). The selection of knots is always a subjective but

tricky issue in these kind of problems. Sometimes experience on the subject matter

may be a guiding force in placing the knots at the "optimum" locations where a sharp

change in the curve pattern can be expected. Too few or too many knots generally

create problems in terms of worsening the fit. This is because, if too few knots are

used, the complete underlying pattern may not be captured properly, thus resulting in

a biased fit. On the other hand, once there are enough knots to fit important features

of the data, further increase in the number of knots have little effect on the fit and may

even degrade the quality of the fit (Ruppert, 2002). Generally, at most 35 to 40 knots

are recommended for effectively all sample sizes and for nearly all smooth regression

functions. Following the general convention, we have placed the knots on a grid of

equally spaced sample quantiles of the independent variable (IRS mean income).

3.4.2 Computational Details

We implemented and monitored the convergence of the Gibbs sampler following

the general guidelines given in Gelman and Rubin (1992). We ran three independent

chains each with a sample size of 10,000 and with a burn-in sample of another 5,000.

We initially sampled the O6's from t-distributions with 2 df having the same location and

scale parameters as the corresponding normal conditionals given in the Appendix. This

is based on the Gelman-Rubin idea of initializing certain samples of the chain from

overdispersed distributions. However, once initialized, the successive samples of O6's

are generated from regular univariate normal distributions. Convergence of the Gibbs

sampler was monitored by visually checking the dynamic trace plots, acf plots and by

computing the Gelman-Rubin diagnostic. The comparison measures deviated slightly for

different initial values. We chose the least of those as the final measures presented in

the tables that follows.









4.3 Hierarchical Bayesian Analysis

In this section, the notations and expressions would correspond to the bivariate

setup. The expressions for the univariate setup would be analogous and is mentioned in

detail in Chapter 3.

4.3.1 Likelihood Function

Let Yi = (Y ..., Yl)' be the response and Ui = (Ui, ..., Uit)' and Z, = (Zi, ..., Zt)

be the covariate vectors corresponding to the ith state. Here, Y, = (Yy, Y,2)' and the

expressions for U, and Z, are given above. Let 0, = (0i, 0, 7, bi, { i1,.... qjt} o, Y)

be the parameter space corresponding to the ith state where i0 = (0i, ..., 0t)'. Thus, the

full parameter space will be given by 0 = i2 x ... x ,,. For the bivariate non-random

walk model, the likelihood function for the ith state would be given by


L(Y,, Ui, Zil i) oc L(Y, il )L(OI/3, 7, bi, { 1, .... It}, Ui, Zi)L(bil o)L(-yl, )
t
= {L(Yg 0o, Xy)L(Oy U' Z'7 bi, V)} L(bil o)L(7-y|,)
j=1
(4-4)

Here, L(X|l, 1) denotes a multivariate normal density with mean vector p and variance

covariance matrix X.

For the bivariate random walk model, the parameter space for the ith state would be

fi = (,O, 0, 7, bi, v, {i' ...., t}, o,, :, ~) where v = (v, ..., v)' is the vector of time

specific random effects. The hierarchical Bayesian framework is given by

1. (Y0e,) N/(eO, 0:)

2. (06/1, 7, bi, vj, qj) ~ N(X' + Z',7 + bi + vj, qjj)

3. (v lv-_~, ZE) ~ N(vj-_, ZE), assuming vo = 0

4. (bil, o) ~ N(0, Zo)

5. 7 ~ N(0, Z.)









TABLE OF CONTENTS
page

ACKNOW LEDGMENTS .................... .............. 4

LIST OFTABLES ..................... ................. 8

LIST OF FIGURES .................... ................. 9

ABSTRACT ..................... ............... .... 10

CHAPTER

1 INTRODUCTION .................... ............... 13

1.1 Overview of Dissertation ............................ 13
1.2 Review of Case-Control Studies ..................... 14
1.3 Review of Small Area Estimation ....................... 21
1.4 Non-Parametric Regression Methodology ............. 25

2 BAYESIAN SEMIPARAMETRIC ANALYSIS OF CASE CONTROL STUDIES
WITH TIME VARYING EXPOSURES ........................ 31

2.1 Introduction .................... ............... 31
2.1.1 Setting .................... .............. 32
2.1.2 Motivating Dataset : Prostate Cancer Study ............ 34
2.2 M odel Specification .. .. .. .. .. .. .. .. 35
2.2.1 N otation . .. 35
2.2.2 Model Framework.................... ......... 35
2.3 Posterior Inference . .. 40
2.3.1 Likelihood Function .. .. .. .. .. .. 40
2.3.2 Priors . . 40
2.3.3 Posterior Computation ......................... 41
2.3.3 Posterior Computation . 41
2.4 Bayesian Equivalence ............................. 42
2.5 Model Comparison and Assessment .. ... 46
2.5.1 Posterior Predictive Loss ..... .. .. .. 46
2.5.2 Kappa statistic . 47
2.5.3 Case Influence Analysis ..... .... 48
2.6 Analysis of PSA Data ............................. 49
2.6.1 Constant Influence Model ..... .... 50
2.6.2 Linear Influence Model ......................... 51
2.6.3 Overall Model Comparison ... 53
2.6.4 Model Assessment ......................... 54
2.7 Conclusion and Discussion ... 55




Full Text

PAGE 2

2

PAGE 3

3

PAGE 4

IhadthegoodfortunetobeastudentattheDepartmentofStatisticsatUniversityofFlorida.ItisherethatIcameinclosecontactwithsomeofthepreeminentstatisticiansofthedayandlearntalotfromthem.Ideeplyacknowledgethetremendoushelp,encouragementandendlesssupportthatIreceivedfrommyadvisorProf.MalayGhosh,myco-advisorProf.MichaelJ.DanielsandProf.AlanAgrestithroughoutthehighsandlowsofdoingmyresearchwork.Theynotonlytaughtmestatisticsortheartofwritingpapersorsolvingproblems-theyintroducedmetothespiritofdiscoveryandthejoyoflearning,somethingthatwillstaywithmeforeverandwouldmotivatemeinwaysIcanneverimagine.Howeverthelistdoesn'tendheresinceeachandeverymemberofthefacultyopenedupnewdoorsformethroughwhichknowledgeowedpastandenrichedmealongtheway.Myendlessgratitudetoeachandeveryoneofthem.IalsowishtothankProf.BhramarMukherjee(currentlyattheDepartmentofBiostatisticsatUniversityofMichigan)forherhelpandinspirationovertheyears.Lastbutnottheleast,myunendinggratitudetomymotherwhosesacrice,unconditionalloveandblessingwasalwayswithme,guidingmealongtheway.Iwouldendbyconveyingmydeepestrespecttothememoryofmyfather-hewastherewithmealwaysthroughoutthisjourney. 4

PAGE 5

page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 13 1.1OverviewofDissertation ............................ 13 1.2ReviewofCase-ControlStudies ....................... 14 1.3ReviewofSmallAreaEstimation ....................... 21 1.4Non-ParametricRegressionMethodology .................. 25 2BAYESIANSEMIPARAMETRICANALYSISOFCASECONTROLSTUDIESWITHTIMEVARYINGEXPOSURES ....................... 31 2.1Introduction ................................... 31 2.1.1Setting .................................. 32 2.1.2MotivatingDataset:ProstateCancerStudy ............. 34 2.2ModelSpecication .............................. 35 2.2.1Notation ................................. 35 2.2.2ModelFramework ............................ 35 2.3PosteriorInference ............................... 40 2.3.1LikelihoodFunction ........................... 40 2.3.2Priors .................................. 40 2.3.3PosteriorComputation ......................... 41 2.4BayesianEquivalence ............................. 42 2.5ModelComparisonandAssessment ..................... 46 2.5.1PosteriorPredictiveLoss ........................ 46 2.5.2Kappastatistic ............................. 47 2.5.3CaseInuenceAnalysis ........................ 48 2.6AnalysisofPSAData ............................. 49 2.6.1ConstantInuenceModel ....................... 50 2.6.2LinearInuenceModel ......................... 51 2.6.3OverallModelComparison ....................... 53 2.6.4ModelAssessment ........................... 54 2.7ConclusionandDiscussion .......................... 55 5

PAGE 6

................... 59 3.1Introduction ................................... 59 3.1.1SAIPEProgramandRelatedMethodology .............. 59 3.1.2RelatedResearch ........................... 61 3.1.3MotivationandOverview ........................ 62 3.2ModelSpecication .............................. 65 3.2.1GeneralNotation ............................ 65 3.2.2SemiparametricIncomeTrajectoryModels .............. 66 3.2.2.1ModelI:BasicSemiparametricModel(SPM) ....... 66 3.2.2.2ModelII:SemiparametricRandomWalkModel(SPRWM) 67 3.3HierarchicalBayesianInference ........................ 68 3.3.1LikelihoodFunction ........................... 68 3.3.2PriorSpecication ........................... 68 3.3.3PosteriorDistributionandInference .................. 69 3.4DataAnalysis .................................. 70 3.4.1ComparisonMeasuresandKnotSpecication ............ 71 3.4.2ComputationalDetails ......................... 72 3.4.3AnalyticalResults ............................ 73 3.4.4KnotRealignment ............................ 74 3.4.5ComparisonwithanAlternateModel ................. 78 3.5ModelAssessment ............................... 80 3.6Discussion ................................... 82 4ESTIMATIONOFMEDIANINCOMEOFFOURPERSONFAMILIES:AMULTIVARIATEBAYESIANSEMIPARAMETRICAPPROACH .......... 85 4.1Introduction ................................... 85 4.1.1CensusBureauMethodology ..................... 85 4.1.2RelatedLiterature ............................ 87 4.1.3MotivationandOverview ........................ 89 4.2ModelSpecication .............................. 90 4.2.1Notation ................................. 90 4.2.2SemiparametricModelingFramework ................ 91 4.2.2.1Simplebivariatemodel ................... 91 4.2.2.2Bivariaterandomwalkmodel ................ 92 4.3HierarchicalBayesianAnalysis ........................ 93 4.3.1LikelihoodFunction ........................... 93 4.3.2PriorSpecication ........................... 94 4.3.3PosteriorDistributionandInference .................. 94 4.4DataAnalysis .................................. 95 4.4.1ComparisonMeasuresandKnotSpecication ............ 96 4.4.2ComputationalDetails ......................... 97 4.4.3AnalyticalResults ............................ 98 4.5ConclusionandDiscussion .......................... 102 6

PAGE 7

.................... 104 5.1AdaptiveKnotSelection ............................ 105 5.2AnalyzingLongitudinalDatawithManyPossibleDropoutTimesusingLatentClassandTransitionalModelling ................... 107 5.2.1IntroductionandBriefLiteratureReview ............... 107 5.2.2ModelingFramework .......................... 110 5.2.3Likelihood,PriorsandPosteriors ................... 114 5.2.4SpecicationofPriors ......................... 117 APPENDIX APROOFOFBAYESIANEQUIVALENCERESULTS ................ 122 BPROOFOFPOSTERIORPROPRIETYFORTHESMALLAREAMODELS .. 128 B.1UnivariateSmallAreaModel ......................... 128 B.2BivariateSmallAreaModel .......................... 130 CFULLCONDITIONALDISTRIBUTIONS ...................... 135 C.1SemiparametricCaseControlModel ..................... 135 C.2SemiparametricSmallAreaModels ..................... 136 C.2.1SemiparametricUnivariateSmallAreaModel ............ 136 C.2.2UnivariateRandomWalkModel .................... 137 C.2.3BivariateRandomWalkModel ..................... 137 REFERENCES ....................................... 139 BIOGRAPHICALSKETCH ................................ 145 7

PAGE 8

Table page 1-1Atypical22table ................................. 15 2-1Estimatesofoddsratiosfordifferenttrajectorylengthsandageatdiagnosisfora0.5verticalshiftoftheexposuretrajectoryforthelinearinuencemodel 52 2-2Posteriormeansand95%condenceintervalsofoddsratioforI=(10,5)forthelinearinuencemodel .................... 53 2-3Posteriorpredictivelosses(PPL)fortheconstantandlinearinuencemodelsforvaryingexposuretrajectoriesandknots .......... 54 3-1ParameterestimatesofSPRWMwith5knots ................... 74 3-2ComparisonmeasuresforSPM(5)andSPRWM(5)estimateswithknotrealignment ................... 77 3-3PercentageimprovementsofSPM(5)andSPRWM(5)estimatesoverSAIPEandCPSestimates ..................... 77 3-4ParameterestimatesofSPM(5) 78 3-5ParameterestimatesofSPRWM(5) 78 3-6Comparisonmeasuresfortimeseriesandothermodelestimates ................................ 79 4-1Comparisonmeasuresforunivariateestimates .................. 99 4-2PercentageimprovementsofunivariateestimatesoverCensusBureauestimates ..................... 99 4-3Comparisonmeasuresforbivariatenon-randomwalkestimates .................................... 100 4-4Percentageimprovementsofbivariatenon-randomwalkestimatesoverCensusBureauestimates .................. 101 4-5Comparisonmeasuresforbivariaterandomwalkmodel ............. 102 8

PAGE 9

Figure page 2-1Longitudinalexposure(PSA)prolesof3randomlysampledcases(1stcolumn)and3randomlysampledcontrols(2ndcolumn)plottedagainstage. ...... 36 2-2Sensitivityof1,0,1anddiseaseprobabilityestimatestocase-deletions. .. 56 3-1LongitudinalCPSmedianincomeprolesfor6statesplottedagainstIRSmeanandmedianincomes.(1stcolumn:IRSMeanIncome;2ndcolumn:IRSMedianIncome). ........................................ 63 3-2PlotsofCPSmedianincomeagainstIRSmeanandmedianincomesforallthestatesoftheU.S.from1995to1999. ..................... 65 3-3Exactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Theknotsaredepictedastheboldfacedtrianglesatthebottom. ........................................ 75 3-4Positionsof5knotsafterrealignment.Theknotsaretheboldfacedtrianglesatthebottom.Theregionbetweenthedashedandboldlinesistheadditionalcoverageareagainedfromtherealignment. .................... 76 3-5Quantile-quantileplotofRBvaluesfor10000drawsfromtheposteriordistributionofthebasicsemiparametricandsemiparametricrandomwalkmodels.TheX-axisdepictstheexpectedorderstatisticsfroma2distributionwith9degreesoffreedom. ...................................... 81 9

PAGE 10

Case-ControlstudiesandsmallareaestimationaretwodistinctareasofmodernStatistics.Theformerdealswiththecomparisonofdiseasedandhealthysubjectswithrespecttoriskfactor(s)ofadiseasewiththeaimofcapturingdisease-exposureassociationspeciallyforrarediseases.Thelaterareaisconcernedwiththemeasurementsofcharacteristicsofsmalldomains-regionswhosesamplesizeissosmallthattheusualsurveybasedestimationprocedurescannotbeappliedintheinferentialroutines.Boththeseareasareimportantintheirownright.Case-controlstudiesformsoneofthepillarsofmodernbiostatisticsandepidemiologyandhasdiverseapplicationsinvarioushealthrelatedissues,speciallythoseinvolvingrarediseaseslikeCancer.Ontheotherhand,estimatesofcharacteristicsforsmallareasarewidelyusedbyFederalandlocalgovernmentsforformulatingpoliciesanddecisions,inallocatingfederalfundstolocaljurisdictionsandinregionalplanning.MydissertationdealswiththeapplicationofBayesiansemiparametricproceduresinmodelingunorthodoxdatascenariosthatmayariseincasecontrolstudiesandsmallareaestimation. Therstpartofthedissertationdealswithananalysisoflongitudinalcase-controlstudiesi.ecase-controlstudiesforwhichtimevaryingexposureinformationareavailableforbothcasesandcontrols.Inatypicalcase-controlstudy,theexposureinformationiscollectedonlyonceforthecasesandcontrols.However,somerecentmedicalstudieshaveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory, 10

PAGE 11

ThesecondandthirdpartofmydissertationdealswithunivariateandmultivariatesemiparametricproceduresforestimatingcharacteristicsofsmallareasacrosstheUnitedStates.Inthesecondpart,weputforwardasemiparametricmodelingprocedureforestimatingthemedianhouseholdincomeforallthestatesoftheU.S.andtheDistrictofColumbia.Ourmodelsincludeanonparametricfunctionalpartforaccomodatinganyunspeciedtimevaryingincomepatternandalsoastatespecicrandomeffecttoaccountforthewithin-statecorrelationoftheincomeobservations.ModelttingandparameterestimationiscarriedoutinahierarchicalBayesianframeworkusingMarkovchainMonteCarlo(MCMC)methodology.ItisseenthatthesemiparametricmodelestimatescanbesuperiortoboththedirectestimatesandtheCensusBureauestimates.Overall,ourstudyindicatesthatpropermodelingoftheunderlyinglongitudinalincomeprolescanimprovetheperformanceofmodelbasedestimatesofhouseholdmedianincomeofsmallareas. Inthethirdpartofthedissertation,weputforwardabivariatesemiparametricmodelingprocedurefortheestimationofmedianincomeoffour-personfamiliesforthedifferentstatesoftheU.S.andtheDistrictofColumbiawhileexplicitlyaccommodatingforthetimevaryingpatternintheincomeobservations.OurestimatestendtohavebetterperformancesthanthoseprovidedbytheCensusBureauandalsohave 11

PAGE 12

12

PAGE 13

EilersandMarx 1996 ). InChapter 2 ,Ipresentananalysisofacase-controlstudywhenlongitudinal,timevaryingexposureobservationsareavailableforthecasesandcontrols.Semiparametricregressionproceduresareusedtoexiblymodelthesubjectspecicexposureprolesandalsotheinuencepatternoftheexposureprolesonthediseasestatus.Thisenablesustoanalyzewhetherpastexposureobservationsaffectthecurrentdiseasestatusofasubjectconditionalonhis/hercurrentexposurecondition.Theproposedmethodologyismotivatedbyandappliedtoacasecontrolstudyofprostatecancerwherelongitudinalbiomarkerinformationareavailableforthecasesandcontrols.WealsoshowthedetailsofthehierarchicalBayesianimplementationofourmodelsandsomeequivalenceresultsthathaveenabledustouseaprospectivemodelingframeworkonaretrospectivelycollecteddataset. InChapter 3 ,IproposeaBayesiansemiparametricmodelingprocedureforestimatingthemedianhouseholdincomeofsmallareaswhenarea-speciclongitudinalincomeobservationsareavailable.Ourmodelsincludeanonparametricfunctional 13

PAGE 14

Chapter 4 dealswithanextensionofthemethodologyinChapter3whereabivariatesemiparametricprocedurehasbeenusedtoestimatethemedianincomeoffamiliesofvaryingsizesacrosssmallareas.Thiscanalsobeseenasanextensionofthetimeseriesmodelingframeworkof Ghoshetal. ( 1996 ).Weshowthatthesemiparametricmodelsgenerallyhavebetterperformancethantheirtimeseriescounterpartsandinafewsituations,theperformancesarecomparable.Wewanttoconveythemessagethatsemiparametricregressionmethodologycanprovideanattractivealternativetothetraditionalmodelingtechniquesspeciallywhentimevaryinginformationareavailableforsmallareas. InChapter 5 ,weprovideanoveralldiscussionofourresultsandalsopointtosomeinterestingopenproblemsandareasforfutureresearchthatmaybeworthpursuing. 14

PAGE 15

Case-controlstudieshaveconsistentlyattractedtheattentionofstatisticians,andasaresult,arichandvoluminousbodyofworkhasdevelopedovertheyears.NotableworkintheFrequentistdomaininclude Corneld ( 1951 )whopioneeredthelogisticmodelfortheprobabilityofdiseasegivenexposure.Hewasthersttodemonstratethattheexposureoddsratioforcasesversuscontrolsequalsthediseaseoddsratioforexposedversusunexposedandthatthelatterinturnapproximatestheratioofthediseaseratesifthediseaseisrare.LetDandEbedichotomousfactorsrespectivelycharacterizingthediseaseandexposurestatusofindividualsinapopulation.AcommonmeasureofassociationbetweenDandEisthe(disease)oddsratio ByapplyingtheBayestheorem,theaboveexpressioncanberewrittenas whichistheexposureoddsratio.Anotherwellknownmeasureofassociationistherelativerisk(RR)ofdiseasefordifferentexposurevaluesgivenbyP(D=1jE=1)=P(D=1jE=0).Forrarediseases,bothP(D=0jE=0)andP(D=0jE=1)areclosetooneandthediseaseoddsratioisapproximatelyequaltotherelativeriskofdisease.Theclassicpaperby MantelandHaenszel ( 1959 )furtherclariedtherelationshipbetweenaretrospectivecase-controlstudyandaprospectivecohortstudy.Theyconsideredaseriesof22tablesasinTable 1-1 Table1-1. Atypical22table DiseaseStatusExposedNotExposedTotal Casen11in10in1iControln01in00in0iTotale1ie0iNi

PAGE 16

IXi=1n01in10i=Ni(1) ItmaybeofinteresttotestfortheequalityoftheoddsratiosacrosstheItablesi.e whichfollowsanapproximate2distributionwithI1degreesoffreedomunderthenullhypotheses.Thederivationofthevarianceoftheaboveestimatorinitiallyposedsomechallengebutwaseventuallyaddressedinseveralsubsequentpapers( Breslow 1996 ). BreslowandDay ( 1980 )markedthedevelopmentoflikelihoodbasedinferencemethodsforoddsratio.Methodstoevaluatethesimultaneouseffectsofmultiplequantitativeriskfactorsondiseaserateswerepioneeredinthe1960's. Inacase-controlstudy,theappropriatelikelihoodistheretrospectivelikelihoodofexposuregiventhediseasestatus. Corneldetal. ( 1961 )notedthatiftheexposuredistributionsinthecaseandcontrolpopulationsarenormalwithdifferentmeansbutacommoncovariancematrix,thentheprospectiveprobabilityofdisease(D)giventheexposure(X)hasthelogisticformi.e whereL(u)=1=1+exp(u).However,thereisaconceptualcomplicationinusingaprospectivelikelihoodbasedonP(DjX)whereasacase-controlsampling 16

PAGE 17

PrenticeandPyke ( 1979 )whoshowedthatthemaximumlikelihoodestimatorsandasymptoticcovariancematricesofthelog-oddsratiosobtainedfromtheretrospectivelikelihoodarethesameasthatobtainedfromaprospectivelikelihoodunderalogisticformulationforthelatter.Thus,case-controlstudiescanbeanalyzedusingaprospectivelikelihoodwhichgenerallyinvolvesfewernuisanceparametersthanaretrospectivelikelihood. Carrolletal. ( 1995 )extendedtheprospectiveformulationtothesituationofmissingdataandmeasurementerrorintheexposurevariables. Inacasecontrolset-up,matchingifoftenusedforselectingcomparablecontrolstoeliminatebiasduetoconfounding.Statisticaltechniquesforanalyzingmatchedcase-controldatawererstdevelopedby Breslowetal. ( 1978 ).Inthesimplestsetting,thedataconsistofmmatchedsets,say,S1,...,Sm,withMicontrolsmatchedwithacaseineachsetorstratum.Aprospectivestratiedlogisticdiseaseincidencemodelgivenby isassumed.i'sarethestratumspecicinterceptterms,treatedasnuisanceparametersandareeliminatedbyconditioningonthenumberofcasesineachstratum.Thegeneratedconditionallogisticlikelihoodyieldstheoptimumestimatingfunction( Godambe 1976 )forestimating.Theclassicalmethodsforanalyzingunmatchedandmatchedstudiessufferfromlossofefciencywhentheexposurevariableispartiallymissing. Lipsitzetal. ( 1998 )proposedapseudo-likelihoodmethodtohandlemissingexposurevariables. Rathouzetal. ( 2002 )developedamoreefcientsemiparametricmethodofestimationwhichtookintoaccountmissingexposuresinmatchedcasecontrolstudies. SattenandKupper ( 1993 ), PaikandSacco ( 2000 )and SattenandCarroll ( 2000 )addressedtheproblemofmissingexposurefromafulllikelihoodapproachbyassumingadistributionoftheexposurevariableinthecontrolpopulation. 17

PAGE 18

Althman ( 1971 )isprobablytherstBayesianworkwhichconsideredseveral22contingencytableswithacommonoddsratioandperformedaBayesiantestofassociationbasedonthecommonoddsratio.Later, ZelenandParker ( 1986 ), NurminenandMutanen ( 1987 )and Marshall ( 1988 )consideredidenticalBayesianformulationsofacasecontrolmodelwithasinglebinaryexposure.Theseworksdealtwithinferencefromtheposteriordistributionofsummarystatisticslikethelogoddsratio,riskratioandriskdifference. Ashbyetal. ( 1993 )analyzedacasecontrolstudyfromaBayesianperspectiveanduseditasasourceofpriorinformationforasecondstudy.TheirpaperemphasizedthepracticalrelevanceoftheBayesianperspectiveinaepidemiologicalstudyasanaturalframeworkforintegratingandupdatingknowledgeavailableateachstage. MullerandRoeder ( 1997 )introducedanovelaspecttoBayesiantreatmentofcase-controlstudiesbyconsideringcontinuousexposurewithmeasurementerror.Theirapproachisbasedonanonparametricmodelfortheretrospectivelikelihoodofthecovariatesandtheimpreciselymeasuredexposure.Theychosethenon-parametricdistributiontobeaclassofexiblemixturedistributions,obtainedbyusingamixtureofnormalmodelswithaDirichletprocessprioronthemixingmeasure( EscobarandWest 1995 ).Theprospectivediseasemodelrelatingdiseasetoexposureisassumedtohavealogisticformcharacterizedbyavectoroflogoddsratioparameters.Thispaperpioneeredtheuseofcontinuouscovariates,measurementerrorandexiblenon-parametricmodelingofexposuresinaBayesiansettingandbroughttolightthetremendouspossibilityofmodernBayesiancomputationaltechniquesinsolvingcomplexdatascenariosincase-controlstudies. SeamanandRichardson ( 2001 )extendedthebinaryexposuremodelofZelenandParkertoanynumberofcategorical 18

PAGE 19

Mulleretal. ( 1999 )consideredanynumberofcontinuousandbinaryexposures.However,incontrasttoSeamanandRichardson,theyspeciedaretrospectivelikelihoodandthenderivedtheimpliedprospectivelikelihood.Theyalsoaddressedtheproblemofhandlingcategoricalandquantitativeexposuressimultaneously. ContinuouscovariatescanbetreatedintheSeamanandRichardsonframeworkbydiscretizingthemintogroupsandlittleinformationislostifthediscretizationissufcientlyne. Gustafsonetal. ( 2002 )treatedtheproblemofmeasurementerrorsinexposurebyapproximatingtheimpreciselymeasuredexposurebyadiscretedistributionsupportedonasuitablychosengrid.Intheabsenceofmeasurementerror,thesupportischosenasthesetofobservedvaluesoftheexposure,adevicethatresemblestheBayesianBootstrap( Rubin 1981 ).TheyassignedaDirichlet(1,1,...,1)priorontheprobabilityvectorcorrespondingtothegridpoints. SeamanandRichardson ( 2004 )provedequivalencebetweentheprospectiveandretrospectivelikelihoodintheBayesiancontext.Specically,theyshowedthatposteriordistributionofthelog-oddsratiosbasedonaprospectivelikelihoodwithauniformpriordistributiononthelogodds(thatanindividualwithbaselineexposureisdiseased)isexactlyequivalenttothatbasedonaretrospectivelikelihoodwithaDirichletpriordistributionontheexposureprobabilitiesinthecontrolgroup.Thus,Bayesiananalysisofcase-controlstudiescanbecarriedoutusingalogisticregressionmodelundertheassumptionthatthedatawasgeneratedprospectively. Diggleetal. ( 2000 )introducedBayesiananalysisformatchedcasecontrolsstudieswhencasesareindividuallymatchedtocontrols.Theyintroducednuisanceparameters 19

PAGE 20

GhoshandChen ( 2002 )developedgeneralBayesianinferentialtechniquesformatchedcase-controlproblemsinthepresenceofoneormorebinaryexposurevariables.Theirframeworkwasmoregeneralthanthatof ZelenandParker ( 1986 ).Unlike Diggleetal. ( 2000 ),theybasedtheiranalysisonunconditionalratherthantheconditionallikelihoodaftereliminationofthenuisanceparameters.Theirframeworkincludedawidevarietyoflinkslikecomplimentaryloglinksandsomesymmetricandskewedlinksinadditiontotheusuallogitandprobitlinks.Recently Sinhaetal. ( 2004 )and Sinhaetal. ( 2005 )proposedauniedBayesianframeworkformatchedcase-controlstudieswithmissingexposures.Theyalsomotivatedasemiparametricalternativeformodelingvaryingstratumeffectsontheexposuredistributions.TheparameterswereestimatedinaBayesianframeworkbyusinganon-parametricDirichletprocessprioronthestratumspeciceffectsinthedistributionoftheexposurevariableandparametricpriorsonallotherparameters.TheinterestingaspectoftheBayesiansemiparametricmethodologyisthatitcancaptureunmeasuredstratumheterogeneityinthedistributionoftheexposurevariableinarobustmanner.Theyalsoextendedtheproposedmethodtosituationswithmultiplediseasestates. Inatypicalcase-controlstudydesign,theexposureinformationiscollectedonlyonceforthecasesandcontrols.However,somerecentmedicalstudies Lewisetal. ( 1996 )haveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory,whenavailable,mayleadtoagainininformationonthecurrentdiseasestatusofasubjectvis-a-vismorepreciseestimationoftheoddsratioofdisease.Itmayalsoprovideinsightsonhowthepresentdiseasestatusofasubjectisbeinginuencedbypastexposureconditionsconditionalonthecurrentones.Unfortunately,properandrigorousstatisticalmethodsofincorporatinglongitudinallyvaryingexposureinformationinsidethecasecontrolframeworkhavenotyetbeenproperlydeveloped.Inthiswork, 20

PAGE 21

GhoshandRao ( 1994 )provideanicereviewofthedifferenttypesofestimatorsandinferentialproceduresusedinsurveysamplingandsmallareaestimation. Sincesamplesurveysaregenerallydesignedforlargeareas,theestimatesofmeansortotalsobtainedthereofarereliableforlargedomains.Directsurveybasedestimatorsforsmalldomainsoftenyieldlargestandarderrorsduetothesmallsamplesizeoftheconcernedarea.Thisisduetothefactthattheoriginalsurveywasdesignedtoprovideaccuracyatamuchhigherlevelofaggregationthanforlocalareas.Thismakesitanecessitytoborrowstrengthfromadjacentorrelatedareastondindirectestimatorsthatincreasetheeffectivesamplesizeandthusincreasetheprecisionoftheresultingestimateforagivensmallarea.Broadlyspeaking,asmallareamodelhasageneralizedlinearformwithameanterm,arandomarea-speciceffecttermandameasurementerrortermwhichreectsthenoisefornotsamplingtheentiredomain. 21

PAGE 22

Duringthelast10-15years,modelbasedinferencehasbeenwidelyusedinthesmallareacontext.Thisismainlyduetothewiderangeoffunctionalitiesthatcomeswiththelinearmixedeffectsmodelingframework.Someofthemainadvantagesofthisframeworkare(i)Randomarea-speciceffectsaccountingforbetweenareavariationaboveandbeyondthatexplainedbyauxiliaryvariablesinthemodel.(ii)Differentvariationslikenon-linearmixedeffectsmodels,logisticregressionmodels,generalizedlinearmodelscanbeentertained.(iii)Areaspecicmeasuresofprecisioncanbeassociatedwitheachsmallareaestimateunliketheglobalmeasures.(iv)Complexdatastructureslikespatialdependence,timeseriesstructures,longitudinalmeasurementscanbeexploredand(v)Recentmethodologicaldevelopmentsforrandomeffectsmodelscanbeutilizedtoachieveaccuratesmallareainferences.Generally,therearetwokindsofsmallareamodelsdependingonwhethertheresponseisobservedattheareaortheunitlevel. 1. Area(oraggregate)levelmodelsrelatesmallareameanstoareaspecicauxiliaryvariables. 2. Unitlevelmodelsrelatetheunitvaluesofthestudyvariabletounit-specicauxiliaryvariables. Thebasicarealevelmodelisgivenby Hereiisoftenassumedtobeafunctionofthepopulationmean,Yioftheithsmallarea,zi=(zi1,...,zip)0isthecorrespondingauxiliarydata,vi'sareareaspecicrandom 22

PAGE 23

Inordertoinferaboutthesmallareameans,Yi,directestimators,^Yiareassumedtobeknownandavailable.Thelinearmodel isassumedwherethesamplingerrors,eiareindependentwithEp(eiji)=0,Vp(eiji)=i,iknown whichimpliesthat^iaredesign-unbiased.Bysetting2v=0in( 1 ),wehavei=z0iwhichleadstosyntheticestimatorsthatdoesnotaccountforlocalvariationaboveandbeyondthatreectedintheauxiliaryvariableszi.Combining( 1 )and( 1 ),wehave whichisaspecialcaseofalinearmixedmodel.Here,viandeiareassumedtobeindependent. FayandHerriot ( 1979 )studiedtheabovearealevelmodel( 1 )inthecontextofestimatingthepercapitaincome(PCI)forsmallplacesintheUnitedStatesandproposedEmpiricalBayesestimatorforthatcase. EricksenandKadane ( 1985 )usedthesamemodelwithbi=1andknown2vtoestimatetheundercountinthedecennialcensusofU.S.ThearealevelmodelhasalsobeenusedrecentlytoproducemodelbasedcountyestimatesofpoorschoolagechildrenintheUnitedStates. Intheunitlevelmodel,itisassumedthatunitspecicauxiliarydataxij=(xij1,...,xijp)0areavailableforeachpopulationelementjineachsmallareai.Moreover,itisassumedthatthevariableofinterest,yij,isrelatedtoxijthroughaone-foldnestederrorlinearregressionmodel 23

PAGE 24

Batteseetal. ( 1988 )studiedthenestederrorregressionmodel( 1 )inestimatingtheareaundercornandsoyabeansforcountiesinNorth-CentralIowausingsamplesurveydataandsatelliteinformation.Indoingso,theycameupwithanempiricalbestlinearunbiasedpredictor(EBLUP)forthesmallareameans. Overtheyears,numerousextensionshavebeenproposedfortheabovemodelingframeworksincludingmultivariateFay-Herriotmodels,generalizedlinearmodels,spatialmodelsandmodelswithmorecomplicatedrandom-effectsstructureetc. Rao ( 2003 )presentedaniceoverviewofthedifferentestimationmethodswhile JiangandLahiri ( 2006 )reviewedthedevelopmentofmixedmodelestimationinthesmallareacontext. AproperreviewofmodelbasedsmallareaestimationwillbeincompletewithoutexplainingtheEBLUP,EBandHBapproachesthatarebeingwidelyusedinthiscontext.Asshownabove,smallareamodelsarespecialcasesofgenerallinearmixedmodelsinvolvingxedandrandomeffectssuchthatsmallareaparameterscanbeexpressedaslinearcombinationsoftheseeffects. Henderson ( 1950 )derivedtheBLUPestimatorsofsmallareaparametersintheclassicalfrequentistframework.Thesearesocalledbecausetheyminimizethemeansquarederroramongtheclassoflinearunbiasedestimatorsanddonotdependonnormality.So,theyaresimilartothebestlinearunbiasedestimators(BLUEs)ofxedparameters.TheBLUPestimatortakesproperaccountofthebetweenareavariationrelativetotheprecisionofthedirectestimator.AnEBLUPestimatorisobtainedbyreplacingtheparameterswiththeasymptoticallyconsistentestimator. Robinson ( 1991 )givesanexcellentaccountofBLUPtheoryandsomeapplications.InanEBapproach,theposteriordistributionoftheparametersof 24

PAGE 25

Morris ( 1983 ).Lastbutnottheleast,intheHBapproach,apriordistributionisspeciedonthemodelparametersandtheposteriordistributionoftheparameterofinterestisobtained.Inferencesabouttheparametersarebasedontheposteriordistribution.Theparameterofinterestisestimatedbyitsposteriormeanwhileitsprecisionisestimatedbyitsposteriorvariance.RecentadvancesinMarkovchainMonteCarlotechnique,specicallyGibbsandMetropolisHastingssamplershaveconsiderablysimpliedthecomputationalaspectofHBprocedures. TheSmallAreaIncomeandPovertyEstimates(SAIPE)programoftheU.S.CensusBureauwasestablishedwiththeaimofprovidingannualestimatesofincomeandpovertystatisticsforallstates,countiesandschooldistrictsacrosstheUnitedStates.Theresultingestimatesaregenerallyusedfortheadministrationoffederalprogramsandtheallocationoffederalfundstolocaljurisdictions.TheSAIPEprogramalsoprovidesannualstateandcountylevelestimatesofmedianhouseholdincome.Generally,observationsonvariouscharacteristicsofsmallareasthatarecollectedovertimemaypossessacomplicatedunderlyingtime-varyingpattern.Itislikelythatmodelswhichtakesintoaccountthislongitudinalpatternintheobservationsmayperformbetterthanclassicalsmallareamodelswhichdonotutilizethisinformation.Inthisstudy,wepresentasemiparametricBayesianframeworkfortheanalysisofsmallarealeveldatawhichexplicitlyaccomodatesforthelongitudinaltimevaryingpatternintheresponseandthecovariates. 25

PAGE 26

Suppose,theresponseyandthecovariatexarerelatedas wheref(x)isanunknownandunspeciedsmoothfunctionofxandeiN(0,2e).Thebasicproblemofnonparametricregressionistoestimatethefunctionf()usingthedatapoints(xi,yi).Indoingso,itistypicallyassumedthatbeneatharoughobservationaldatapatternthereisasmoothtrajectory.Thisunderlyingsmoothpatternisestimatedbyvarioussmoothingtechniques.Broadly,therearefourmajorclassesofsmoothersusedtoestimatef(.)vizLocalpolynomialkernelsmoothers( FanandGijbels ( 1996 ); WandandJones ( 1995 )),Regressionsplines( Eubank ( 1988 ), Eubank ( 1999 )),Smoothingsplines( Wahba ( 1990 ); GreenandSilverman ( 1994 ))andPenalizedsplines( EilersandMarx ( 1996 ); Ruppertetal. ( 2003 )).Eachsmootherhasitsownstrengthsandweaknesses.Forexample,localpolynomialsmoothersarecomputationallyadvantageousforhandlingdenseregionswhilesmoothingsplinesmaybebetterforsparseregions.Here,wewillbrieyreviewthemaincharacteristicsofsplinesingeneralandpenalizedsplinesinparticular. Thebasicideabehindsplinesistoexpresstheunknownfunctionf(x)usingpiecewisepolynomials.Twoadjacentpolynomialsaresmoothlyjoinedatspecicpointsintherangeofxknownasknots.Theknots,say,(1,...,K)partitiontherangeofxintoKdistinctsubintervals(orneighborhoods).Withineachsuchneighborhood,apolynomialofcertaindegreeisdened.Apolynomialsplineofdegreephas(p1)continuousderivativesandadiscontinuouspthderivativeatanyinteriorknot.Thepthderivativereectsthejumpofthesplinesattheknots.Thus,asplineofdegree0isa 26

PAGE 27

Here(xk)p+isthefunction(xk)pIfx>kg.Usingtheabovebasis,asplineofdegreepcanbeexpressedas Here,(0,...,p)and(1,...,K)arethecoefcientsofthepolynomialandsplineportionsoftheabovestructureandmustbeestimated.p=1,2,3correspondstoalinear,quadraticorcubicsplinerespectively.TheabovebasisconstitutesoneofthemostcommonlyusedbasisfunctionswhileotherbaseslikeradialbasisorB-splinescanalsobeused.Itcanbeshownthatthereexistsaveryrichclassofspline-generatingfunctionswhichinturngreatlyincreasesthescopeandapplicabilityofsplinesinvariousmodelingframeworks.Moreover,theverystructureofthesplinesmakesthemextremelygoodatcapturinglocalvariationsinapatternofobservations,somethingwhichcannotbeachievedusingFourierorPolynomialbases. Oneofthemostimportantaspectofsmoothingistheproperselectionandpositioningoftheknots.Thisisbecausetheknotsactassensorsinrelayinginformationabouttheunderlyingtrueobservationalpattern.Toofewknotsoftenleadtoabiasedtwhileanexcessivenumberofknotsleadstooverttingvis-a-visoverparametrizationandmayevenworsentheresultingt.Thus,asufcientnumberofknotsshouldbeusedandtheyshouldbeplaceduniformlythroughouttherangeoftheindependentvariable.Generally,theknotsareplacedonagridofequallyspacedsamplequantilesofxandamaximumof35to40knotssufcesforanypracticalproblem( Ruppert 2002 ).Recently,therehavebeeninterestingcontributionsonknot 27

PAGE 28

Friedman ( 1991 ); Stoneetal. ( 1997 ); Denisonetal. ( 1998 ); Lindstrom ( 1999 ); DiMatteoetal. ( 2001 ); BottsandDaniels ( 2008 )).Theexibilityandwideapplicabilityofsplinesisduetothefactthatprovidedtheknotsareevenlyspreadoutovertherangeofx,f(xj,)canaccuratelyestimateaverylargeclassofsmoothfunctionsf(.)evenifthedegreeofthesplineiskeptrelativelylow(say,1or2). Thesplinecoefcients(1,...,K)in( 1 )correspondtothediscontinuouspthderivativeofthespline-thus,theymeasurethejumpsofthesplineattheknots(1,...,K).Thus,theycontributetotheroughnessoftheresultingspline.Inordertosmooth-outthet,aroughnesspenaltyisplacedontheseparameters.Thisisoftendonebyminimizingtheexpression whereisknownasthesmoothingparameter.Thisissynonymoustominimizingtherstpartof( 1 )subjecttotheconstraint0.playsacrucialroleinthesmoothingprocesssinceitcontrolsthegoodnessoftandroughnessofthettedmodel.Decreasing,thesplinewilltendtoovert,becominganinterpolatingcurveas!0.Increasing,thesplinewillbecomesmootherandwilltendtotheleastsquarestas!1.Therearedifferentmethodsforchoosingtheoptimallikecross-validation,generalizedcross-validation,Mallow'sCpcriterionetc. Broadlyspeaking,therearethreemaintypesofsplines:Regressionsplines,SmoothingsplinesandPenalizedsplines(orP-splines).Allofthemarebasedonthesameprincipleasdetailedabovebutdifferinthespecicmannerinwhichsmoothingisdoneortheknotsareselected.Inregressionsplines,smoothingisachievedbythedeletionofnon-essentialknotsorequivalently,bysettingthejumpsatthoseknotstozerokeepingthejumpsattheotherknotsundisturbed.Insmoothingandpenalizedsplines,smoothingisachievedbyshrinkingthejumpsatalltheknotstowardszerousing 28

PAGE 29

1 ).Amajordifferencebetweensmoothingsplinesandpenalizedsplinesisthat,intheformer,alltheuniquedatapointsareusedasknotsbutinthelatterthenumberofknotsaremuchsmallerresultinginmoreexibility.Infact,penalizedsplinescanbeseenasageneralizationofregressionandsmoothingsplines. Thewideapplicabilityofpenalizedsplinesindiversesettingsismainlyduetoitscorrespondencewithlinearmixedeffectsmodels.Infact,penalizedsplinescanbeshowntobebestlinearunbiasedpredictors(BLUP)'sinamixedmodelframework.Toseethis,werewrite( 1 )as where=(,)0,=(0,1,...,p)0,=(1,2,...,K)0andDisaknownpositivesemi-denitepenaltymatrixsuchthatD=0B@0(p+1)(p+1)0(p+1)(K)0(K)(p+1)1K1CA 1 )correspondstosetting=I. LetXbethematrixwiththeithrowXi=(1,xi,...,xpi)andZbethematrixwiththeithrowZi=f(xi1)p+,...,(xi1)p+).Usingthisformulationin( 1 )withthebasisfunctionin( 1 )anddividingbytheerrorvariance2e,wehave 2ekk2(1) ByassumingthatisavectorofrandomeffectswithCov()=2Iwhere2=2e=whileasthesetofxedeffectsparameters,theabovepenalizedsplineframework 29

PAGE 30

whereCov(e)=2eIandandeareindependent. BayesianP-splineshaverecentlybecomepopularbecausetheycombinetheexibilityofnon-parametricmodelsandtheexactinferenceprovidedbytheBayesianinferentialprocedure.Thisisevenmoretruebecauseoftheseamlessfusionofpenalizedsplinesintothemixedmodelframework( Wand 2003 )asshownabove.Thisequivalencealsocarriesovertothemannerinwhichsmoothingisdone.Smoothingcanbeachievedbyimposingpenaltiesonthesplinecoefcients,asshownin( 1 )orbyassumingadistributionalformfor,forexampleNK(0,2IK).IntheBayesiancontext,priorsareplacedon2andtheotherparametersandusualposteriorsamplingiscarriedout.Sincesamplesaregeneratedfromthesmoothingparameteralongsidetheotherparameters,thismethodisalsoknownasautomaticscatterplotsmoothing.Inalltheproblemstackledinthisdissertation,wewillbeusingBayesianinferentialproceduresonpenalizedsplinesasshownabove. 30

PAGE 31

Lewisetal. 1996 )haveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory,whenavailable,mayleadtoagainininformationonthecurrentdiseasestatusofasubjectandmorepreciseestimationoftheoddsratioofdisease.Itmayalsoprovideinsightsonhowthepresentdiseasestatusofasubjectisbeinginuencedbypastexposureconditionsconditionalonthecurrentones.Inthiswork,wepresentaBayesiansemiparametricapproachforanalyzingcasecontroldatawhenlongitudinalexposureinformationisavailableforbothcasesandcontrols. Statisticalanalysisofcase-controldatawaspioneeredby Corneld ( 1951 ), Corneldetal. ( 1961 )and MantelandHaenszel ( 1959 ).Sincethen,importantandfarreachingcontributionshavebeenmadeinvirtuallyeveryaspectoftheeld.Someofthenotableonesareequivalenceofprospectiveandretrospectivelikelihoods( PrenticeandPyke 1979 ),measurementerrorinexposures( Roederetal. 1996 )andmatchedcase-controlstudies( Breslowetal. 1978 ).ImportantcontributionsintheBayesianparadigmincludebinaryexposures( ZelenandParker 1986 ),continuousexposures( MullerandRoeder 1997 ),categoricalexposures( SeamanandRichardson 2001 ),equivalence( SeamanandRichardson 2004 )andmatching( Diggleetal. ( 2000 ); GhoshandChen ( 2002 )). Theanalysisofcomplexdatascenariosinacasecontrolframeworkisarelativelynewareaofresearch.Specically,analysisoflongitudinalcasecontrolstudieshasonly 31

PAGE 32

ParkandKim ( 2004 )areoneoftherstcontributorstothisarea.Theyproposedanordinarylogisticmodeltoanalyzelongitudinalcasecontroldatabutignoredthelongitudinalnatureofthecohort.Theyalsoshowedthatordinarygeneralizedestimatingequations(GEE)basedonanindependentcorrelationstructurefailsinthisframework. Inviewoftheabovechallenges,weproposetousefunctionaldataanalytictechniques,speciallynonparametricregressionmethodologytomodelboththetime 32

PAGE 33

EilersandMarx ( 1996 ); Ruppertetal. ( 2003 )).Wealsoexpresstheeffectoftheexposuresonthecurrentdiseasestateasapenalizedsplinetoaccountforanypossibletimevaryingpatternsofinuence.AnalysisiscarriedoutinahierarchicalBayesianframework.Ourmodelingframeworkisquiteexiblesinceitcanaccommodateanypossiblenon-lineartimevaryingpatternintheexposureandinuenceproles.Itisdifculttoachievethesamegoalinapurelyparametricsetting. Inacase-controlstudy,thenaturallikelihoodistheretrospectivelikelihood,basedontheprobabilityofexposuregiventhediseasestatus. PrenticeandPyke ( 1979 )showedthatthemaximumlikelihoodestimatorsandasymptoticcovariancematricesofthelog-oddsratiosobtainedfromaretrospectivelikelihoodarethesameasthatobtainedfromaprospectivelikelihood(basedontheprobabilityofdiseasegivenexposure)underalogisticformulationforthelatter.Thus,case-controlstudiescanbeanalyzedusingaprospectivelikelihoodwhichgenerallyinvolvesfewernuisanceparametersthanaretrospectivelikelihood. SeamanandRichardson ( 2004 )provedasimilarresultintheBayesiancontext.Specically,theyshowedthatposteriordistributionofthelog-oddsratiosbasedonaprospectivelikelihoodwithauniformpriordistributiononthelogodds(thatanindividualwithbaselineexposureisdiseased)isexactlyequivalenttothatbasedonaretrospectivelikelihoodwithaDirichletpriordistributionontheexposureprobabilitiesinthecontrolgroup.Thus,Bayesiananalysisofcase-controlstudiescanbecarriedoutusingalogisticregressionmodelundertheassumptionthatthedatawasgeneratedprospectively. Weshowthattheresultsof SeamanandRichardson ( 2004 )appliesfortheproposedsemiparametricframeworkthusenablingustoperformtheanalysisbasedonaprospectivelikelihoodeventhoughacasecontrolstudyisretrospectiveinnature.Weperformmodelcheckingbasedontheposteriorpredictivelosscriterion( Gelfandand 33

PAGE 34

, 1998 ).Oncetheoptimalmodelisidentied,modelassessmentiscarriedoutusingcasedeletiondiagnostics( BradlowandZaslavsky 1997 ). Etzionietal. 1999 ).Thisdatasetisbasedonabiomarkerbasedscreeningprocedureforprostatecancertoelucidatetheassociationbetweenprostatecancerandprostate-specicantigen(PSA).Theeffectivenessofbiomarkerbasedscreeningproceduresforprostatecanceriscurrentlyatopicofintensedebateandinvestigationintherealmsofhealthcarepractice,policyandresearch.Sincethediscoveryofprostate-specicantigen(PSA)andtheobservationthatserumPSAlevelsmaybesignicantlyincreasedinprostatecancerpatients,alotofefforthasbeendedicatedtoidentifyingeffectivePSAbasedtestingprogramswithfavorablediagnosticproperties. Inthisstudy,thelevelsoffreeandtotalPSAweremeasuredintheseraof71prostatecancercasesand70controls.Participantsinthisstudyincludedmenaged50to65athighriskoflungcancer.TheywererandomizedtoreceiveeitherplaceboorBetaCaroteneandRetinol.Theinterventionhadnonoticeableeffectontheincidenceofprostatecancer,withsimilarnumberofcasesobservedintheinterventionandcontrolarms.SeveralPSAmeasurementsrecordedforthecasesweretakenaslongas10yearspriortotheirdiagnosis.The71prostatecancercaseswerediagnosedbetweenSeptember1988andSeptember1995inclusive.Theindividualsdeemedcontrolswereselectedamongindividualsnotyetdiagnosedashavingcancerbythetimeofanalysis.Astheexposurevariable,weusethenaturallogarithmofthetotalPSA(Ptotal)althoughthenegativelogarithmoftheratiooffreetototalPSA(Pratio)canalsobeconsidered.Inadditiontotheabovemeasurements,observationswerecollectedontime(years)relativetoprostatecancerdiagnosisandageatblooddrawforthecases 34

PAGE 35

2-1 showsthePSAtrajectoryagainstageforsomerandomlychosencasesandcontrols. Etzionietal. ( 1999 )analyzedthisdatasetbymodelingthereceiveroperatingcharacteristic(ROC)curvesassociatedwithboththebiomarkers(PtotalandPratio)asafunctionofthetimewithrespecttodiagnosis.Theyobservedthatalthoughthetwomarkersperformedsimilarlyeightyearspriortodiagnosis,PtotalwassuperiortoPratioattimesclosertodiagnosis. Therestofthechapterisorganizedasfollows.InSection 2.2 ,weintroducethesemiparametricmodelingframework.Section 2.3 describesthedetailsofposteriorinference.InSection 2.4 ,wediscussrelevantBayesianequivalenceresultsforourframework.Section 2.5 outlinesthemodelcomparisonandmodelassessmentproceduresweperformed.WedescribethedataanalysisresultsbasedontheprostatecancerdatasetinSection 2.6 andendwithadiscussioninSection 2.7 2.2.1Notation 35

PAGE 36

Longitudinalexposure(PSA)prolesof3randomlysampledcases(1stcolumn)and3randomlysampledcontrols(2ndcolumn)plottedagainstage. 36

PAGE 37

Ourmodelingframeworkbearssomeresemblancetothatof Zhangetal. ( 2007 )whousedatwostagefunctionalmixedmodelapproachformodelingtheeffectofalongitudinalcovariateproleonascalaroutcome.Theyproposedalinearfunctionalmixedeffectsmodelformodelingtherepeatedmeasurementsonthecovariate.Theeffectofthecovariateproleonthescalaroutcomewasmodeledusingapartialfunctionallinearmodel.Indoingso,theytreatedtheunobservedtruesubject-speciccovariatetimeproleasafunctionalcovariate.Forttingpurposes,theydevelopedatwo-stagenonparametricregressioncalibrationmethodusingsmoothingsplines.Thus,estimationatboththestageswasconvenientlycastintoauniedmixedmodelframeworkbyusingtherelationbetweensmoothingsplinesandmixedmodels.ThekeydifferencesbetweentheirframeworkandoursisthatweuseBayesianinferentialtechniquestosimultaneouslyestimatetheparametersoftheexposureanddiseasemodels.Moreover,insteadofalinearmodelingframework,weuseacombinationoflinearandlogisticmodelssinceourresponseisbinary. whereeijN(0,2e),f(a)isthepopulationmeanfunctionmodelingtheoverallPSAtrendasafunctionofageforallthesubjectswhilegi(a)isthesubjectspecicdeviationfunctionreectingthedeviationoftheithsubjectspecicprolefromthemeanpopulationprole. Thereasonformodelingexposureasafunctionofageisthatforarandomlychosensubjectwithunknowndiseasestatus,thePSAvalueatacertaintimepointshoulddependonthesubject'sageatthattimepointcontrollingforthetimewithrespect 37

PAGE 38

Werepresentbothf(aij)andgi(aij)usingp-splinesasfollows wherep,(aij)=[1,aij,...,apij,(aij1)p+,...,(aijK)p+]0andq,(aij)=[1,aij,...,aqij,(aij1)q+,...,(aijM)q+]0aretruncatedpolynomialbasisfunctionsofdegreespandqwithknots(1,...,K)and(1,...,M)respectively( Durbanetal. 2004 ).Generally,MK. whereL(.)isthelogisticdistributionfunction,Xi(t+adi)isthetrue,error-freeunobservedsubject-specicexposureprolemodeledasf(t+adi)+gi(t+adi)while(t+adi)isanunknownsmoothfunctionofagewhichreectsthetimepatternoftheeffectofthePSAtrajectoryonthecurrentdiseasestatusfortheithsubject.In( 2 ),weusetherelationaij=tij+aditomodeltheexposuretrajectoryX(.)andtheinuencefunction(.)asafunctionoftimewithrespecttodiagnosis.Indoingso,wecaneasilyassesstheeffectofthetrajectoryonthecurrentdiseasestateatanygivenpointbeforediagnosisforaparticularsubject.cisthetimebywhichwegobackinthepasttorecordtheexposurehistoryfortheithsubject;e.g.c=8wouldimplythat,fortheithsubject,theexposureobservationsrecordedsinceeightyearspriortodiagnosisarebeingconsideredforanalysis.Thus,bychangingthevalueofc,theeffectofdifferentiallengthsofPSAtrajectoriesonthecurrentdiseasestatuscanbestudied. 38

PAGE 39

wherer,(t+adi)=[1,(t+adi),...,(t+adi)r,(t+adi1)r+,...,(t+adiK)r+]0,=(0,...,K+r)0and(1,...,K)aretheknots. Asspecialcasesof( 2 ),wemayconsider(t+adi)=0,inwhichcasethecovariateistheareaunderthePSAprocessfXi(t+adi),ct0gand0isitseffectonthediseaseprobability(orlogitofthediseaseprobability).Wecanalsoassume(t+adi)=0+1(t+adi)whichsigniesalinearpatternoftheeffectoftheexposuretrajectoryonthediseaseprobability.Intheabovemodels,theknotscanbechosenonagridofequallyspacedquantilesoftheages. Replacing( 2 )and( 2 )intheR.H.Sof( 2 ),wehave whereMi=Z0cp,(t+adi)r,(t+adi)0dtandQi=Z0cq,(t+adi)r,(t+adi)0dt. Forpre-chosendegreesofthebasisfunctionsandtheknots,bothMiandQiarematricesandareavailableinclosedforms.Weassumenormaldistributionalformsforthesplinecoefcientsin( 2 )and( 2 )inordertopenalizethejumpsofthesplineattheknots.Thus,wehavep+kN(0,2)(k=1,...,K);bi,q+mN(0,2b)(m=1,...,M)andk+rN(0,2)(k=1,...,K).Finally,therandomsubjectspecicdeviationfunctiongi(aij)ismodeledasbijN(0,2j)(i=1,...,N;j=0,...,q). 39

PAGE 40

2.3.1LikelihoodFunction Thelikelihoodfortotheithsubject,conditionalontherandomeffectsisgivenby wherep(Yij,ai,bi,2e)istheprobabilitydistributioncorrespondingtothetrajectorymodel,p(Dij,,)denotesthelogisticdistributioncorrespondingtothediseasemodelwhiletherestdealswiththedistributionalstructuresonthesplinecoefcientsandrandomeffects. Sincethetrajectorymodel( 2 )hasanormaldistributionalstructurewhilethediseasemodel( 2 )hasalogisticstructure,thelikelihoodfunctionandhencetheposteriorhaveacomplicatedform.Toalleviatethisproblem,weapproximatethelogisticdistributionasamixtureofnormalsusingawellknowndataaugmentationalgorithmproposedby AlbertandChib ( 1993 ).ThisisbrieyexplainedinSection3.3. 40

PAGE 41

LikelihoodApproximation AlbertandChib ( 1993 )toapproximatethelikelihoodandthussimplifyposteriorinference.Theyshowedthatalogisticregressionmodelonbinaryoutcomescanbewellapproximatedbyanunderlyingmixtureofnormalregressionstructureonlatentcontinuousdata.Indoingso,itcanbeshownthatalogitlinkisapproximatelyequivalenttoaStudent-tlinkwith8degreesoffreedom. Asin AlbertandChib ( 1993 ),weintroducelatentvariablesZ1,Z2,...,ZNsuchthatDi=1ifZi>0andDi=0otherwise.LetZibeindependentlydistributedfromatdistributionwithlocationHi=+0Mi+b0iQi,scaleparameter1anddegreesoffreedom.Equivalently,withtheintroductionoftheadditionalrandomvariablei,thedistributionofZicanbeexpressedasscalemixturesofnormaldistribution 26 )as 41

PAGE 42

2 ).Since,themarginalposteriordistributionofisanalyticallyintractable,weconstructanMCMCalgorithmtosamplefromitsfullconditionals.Indoingso,weusemultiplechainsandmonitorconvergenceofthesamplersusingGelmanandRubindiagnostics( GelmanandRubin 1992 ). 1.2 ), SeamanandRichardson ( 2004 )showedthatforcertainchoicesofthepriorsonthelogodds,posteriorinferencefortheparameterofinterestbasedonaprospectivelogisticmodelcanbeshowntobeequivalenttothatbasedonaretrospectiveone.Asaresult,aprospectivemodelingframeworkcanbeusedtoanalyzecase-controldatawhicharegenerallycollectedretrospectively.HereweshowthattheBayesianequivalenceresultsof SeamanandRichardson ( 2004 )canbeextendedtothesemiparametricframeworkwehaveproposed.Thisenablesustouseaprospectivelogisticframework(asdescribedinSection( 2.2.2 ))toanalyzethePSAdataset. Ourmodelingframeworkhingesontheideathatforeverysubject,insteadofasingleexposureobservation,aseriesofpastexposureobservationsareavailable.Weusethisexposuretrajectoryorexposureproleinanalyzingthepresent 42

PAGE 43

Rubin ( 1981 )andlaterby Gustafsonetal. ( 2002 )canbeappliedtothetrajectoryasawholei.efXi(t),ct0gcanbeassumedtobeadiscreterandomvariablewithsupportfZ1(t),...,ZJ(t),ct0g,thesetofallobservableexposuretrajectorieswherefZj(t),ct0,j=1,...,JgisanitecollectionofelementsinthesupportoftheXij's.LetY0jandY1jbethenumberofcontrolsandcaseshavingexposureprolefZj(t),ct0g.WedenotetheNullorbaselinetrajectoryasfX(t)=0,ct0g. TheoddsratioofdiseasecorrespondingtofZj(t),ct0gwithrespecttobaselineexposureisexpZ0cZj(t)(t)dt.AssumingthatacontrolhasexposureprolefZj(t),ct0gwithprobabilityj=PJk=1k,itcanbeeasilyshownthatP(X(t)=Zj(t),ct0jD=1)=jexpZ0cZj(t)(t)dt 43

PAGE 44

since(t)=(t)0=0(t)by( 2 ).Weassume1=1foridentiability.Hered=0and1standsforcontrolsandcasesrespectively.Assuming#tobethebaselineoddsofdisease,theprospectivelikelihoodisgivenby Basedontheabovesetup,wehavethefollowingequivalenceresults: 2 )withrespecttoisthesameasthatobtainedbymaximizingL(#,)in( 2 )withrespectto#. 44

PAGE 45

(ii)Assuming=(1,...,J)andj=j=JXk=1k,theposteriordensityof(,)is (iii)Themarginalposteriordensitiesofobtainablefromp(w,jy)andp(,jy)arethesame. Theproofsoftheabovetheoremaresimilarinnaturetothosein SeamanandRichardson ( 2004 )andaregivenintheAppendixA.SincewehaveconsiderednearuniformpriorforandourprioronensurestheexistenceandnitenessofE(),theconditionsofTheorem2areessentiallysatisedforourframework. Basedontheaboveresults,itcanbeconcludedthatthemarginalposteriordistributionof-theparameterofinterest,willbethesameregardlessofwhetherwetaprospectiveorretrospectivemodel.Thus,wecananalyzethePSAdatausingtheprospectivesemiparametricmodelingframeworkdescribedabove.Bayesianequivalencecanalsobeshowninthemoregeneralcaseofmulticategorycasecontrolsetup,i.ewhentherearemultiple(>2)diseasestates.Wehavethefollowingresult

PAGE 46

KXl=1ldl1CCCCCAndkKYk=11k!() TheproofoftheabovetheoremisgiveninAppendixA. 2.5.1PosteriorPredictiveLoss GelfandandGhosh ( 1998 ).Thiscriterionisbasedontheideathatanoptimalmodelshouldprovideaccuratepredictionofareplicateoftheobserveddata. 46

PAGE 47

( 1998 )obtainedthiscriterionbyminimizingtheposteriorlossforagivenmodelandthen,forallmodelsunderconsideration,selectingtheonewhichminimizesthiscriterion.Foragenerallossfunction,thiscriterioncanbeexpressedasalinearcombinationoftwodistinctpartsi.eagoodness-of-tpartandapenaltypart.Forourframework,theposteriorpredictivelosscanbewrittenas k+1NXi=1Var(^Di)(2) where^Di=E(Drepijy,D)andVar(^Di)=Var(Drepijy,D)=E(Drepijy,D)(E(Drepijy,D))2.Forourframework,Drep=(Drep1,...,DrepN)isthereplicateddiseasestatusvectorforallthesubjects.ItisstraightforwardtocalculatetheexpectedvalueoftheabovecriterionusingtheposteriorsamplesobtainedfromtheGibbssampler.Lowervaluesofthiscriterionwouldimplyabettermodelt.Weassumek=1andobtainthevaluesofposteriorpredictivelossfordifferentlengthsofexposuretrajectoriesanddifferentnumberofknots.TheresultsaregiveninTable 2-3 andexplainedinSection 2.6 .Fortheoptimalmodelselectedusingtheposteriorpredictivelosscriterion,modelassessmentwasperformedusingKappameasuresofagreementandcasedeletiondiagnostics.Themethodologyisdescribedbelow. Agresti 2002 )whichcomparesagreementagainstthatwhichmightbeexpectedbychance.Thevalueofrangesfrom1to1;=1impliesperfectagreementwhile=1impliescompletedisagreement.Avalueof0indicatesnoagreementaboveandbeyondthatexpectedbychance. 47

PAGE 48

Theobserveddiseasestatus(vis-a-viscaseorcontrolstatus)ofasubjectisobtainedfromthedatasetwhilethepredicteddiseasestatusiscalculatedfromtheposteriorestimatesoftheparameters.AtiterationnoftheGibbssampler,wecancalculatethequantity^p(n)i=^P(n)(Di=1jXi(t+adi),t2[c,0])=L(n)(+0Mi+b0iQi)whereL(.)canbeeithertheexactlogitcdfortheapproximateStudent-tcdf(with8degreesoffreedom).Basedonthevalueof^p(n)i,wecanassign^D(n)i=8><>:1if^p(n)i>0.50if^p(n)i0.5 Hampeletal. 1987 ).Thesediagnosticscanbeusedtodetectobservationswithanunusualeffectonthettedmodelandthusmayleadtoidenticationofdataormodelerrors. BradlowandZaslavsky ( 1997 )appliedcaseinuencetoolsin 48

PAGE 49

LetHi=+0Mi+b0iQiandSij=p,(aij)0+q,(aij)0bi.SupposeL(YijjSij,2e)bethedensityfunctioncorrespondingtothetrajectorymodel,whileL(DijHi)betheoneforthediseasemodel.Weworkedwiththefollowingthreetypesofweightingschemesbasedonthoseproposedby BradlowandZaslavsky ( 1997 ) HerendenotethenthiterationoftheGibbssampler,thesubscriptidenotethedeletionofyiandthesuperscriptdenoteunnormalizedweights.Inthelastweighingscheme,L(YijjSij,2e)andL(DijHi)aretheusuallikelihoodswiththepopulationlevelparametersi.e(,,,2e)replacedbythefulldataposteriormedians.Herefulldataposterioristheposteriordistributionobtainedfromthecompletedataseti.etheonehavingallthesubjects. 2.2 toanalyzetheprostatecancerdatasetdescribedinSection 2.1.2 .MultipleobservationsonfreeandtotalPSAwereobtainedfor71prostatecancercasesand70controls.Forsomesubjects,observationswerecollectedasfaras10yearspriortodiagnosis.WeusethenaturallogarithmoftotalPSA(Ptotal)asourexposureofinterest.Ourprinciple 49

PAGE 50

Forthepurposeofouranalysis,wehaveusedalinearp-spline(p=1)withasubjectspecicslopeparametertomodeltheexposuretrajectoryasfollows Fortheprospectivediseasemodel( 2 ),weconsideredtwospecicscenariosviz.constantinuence,(t+adi)=0andlinearinuence,(t+adi)=0+1(t+adi).Theresultsforthesetwocasesaresummarizedbelow. Onttingtheabovemodel,weobservedthatforalltrajectorylengths,0issignicant(its95%credibleintervaldoesnotcontain0).Foranyparticularinterval(i.echoiceofc),theposteriormeansand95%credibleintervalsof0donotchangemuchwiththenumberofknots(K).Inaddition,0increasesasthetrajectorylengthdecreasesi.easwemoveclosertothepointofdiagnosis.ThisislikelyrelatedtothescaleoftheareaunderthePSAprocessbutitalsoseemstosupportthewellknownmedicalfactthattotalPSAisabetterdiscriminatorofprostatecancerattimesclosertodiagnosisthanattimesfurtheroff( Catalonaetal. 1998 ).ToassesstheimpactofonlythepastPSAobservationsonthecurrentdiseasestate,weconsideredtheexposureintervalI=(10,5)and3knotsinthetrajectory.Theposteriormeanof0is0.298 50

PAGE 51

2.6.3 ParameterizingfZi(t+adi),ct0gasp,(t+adi)0+q,(t+adi)0di,asin( 2 ),wecanrewrite( 2 )asexp()0Z0cp,(t+adi)r,(t+adi)0dtexp(dibi)0Z0cq,(t+adi)r,(t+adi)0dt. expmZ0c(t+adi)dt=expcm(0+(adic=2)1).(2) 51

PAGE 52

2-1 showstheposteriormeansand95%credibleintervalsoftheoddsratioscorrespondingtodifferenttrajectorylengthsandageatdiagnosiswhenm=0.5.Foraxedtrajectorylength,theoddsratiosdecreaseasageatdiagnosisincreases.This Table2-1. Estimatesofoddsratiosfordifferenttrajectorylengthsandageatdiagnosisfora0.5verticalshiftoftheexposuretrajectoryforthelinearinuencemodel Age(3,0)(5,0)(8,0)(10,0) seemstosupportthenotionthatyoungersubjectstendtohavemoreaggressiveformofprostatecancerthanolderonesandthusaremostlikelytobebenetedfromearlydetection( Catalonaetal. 1998 ).Formostagesatdiagnosis,theoddsratiossteadilyincreaseaslongerexposuretrajectoriesareconsideredi.easpastexposureobservationsaretakenintoaccount.However,therateofincreaseishigherforlowerageatdiagnosis.Thus,considerationofpastexposureobservationsinadditiontorecentonesresultinasignicantgainininformationaboutthecurrentdiseasestatusofasubject.Finally,forthehighestageatdiagnosisconsidered(80),theoddsratiosdecreaseaslongerexposuretrajectoriesareconsidered.Thismayimplythatforasubjectwithveryhighageatdiagnosis,his/herpastexposureobservationsmaynotcontainsignicantamountsofinformationaboutthepresentdiseasestatus. Asbefore,wettedthediseasemodelontheintervalI=(10,5).Theposteriormeanand95%credibleintervalof0and1arerespectively1.24(0.29,2.19)and-0.015(-0.029,0.003)implyingthatexposureobservationsrecorded5-10yearspriortodiagnosisalsohasasignicanteffectonthecurrentdiseasestatus.Theposteriormeansand95%credibleintervalsoftheoddsratiosshowninTable 2-2 corroboratetheaboveconclusion. 52

PAGE 53

Posteriormeansand95%condenceintervalsofoddsratioforI=(10,5)forthelinearinuencemodel AgeatDiagnosis 50607080 Mean4.993.272.221.5695%C.I(1.96,10.41)(1.91,5.36)(1.67,2.98)(1.10,2.29) 2-3 ThePPLvaluesforthelinearmodelweresmallerthanthosecorrespondingtotheconstantinuencemodel.Thus,wecanconcludethatfortheprostatecancerdata,theclassoflinearinuencemodelstbetterthantheclassofconstantinuencemodels.Forbothsetups,themodelwith0knotshastheworstt(highestPPLcriterion)acrossalltrajectorylengths.Foragiventrajectory,themodelstendtoimprovewithanincreaseinthenumberofknotsuntilacertainnumberofknotsisreached.Furtherincreaseofknotstendtoworsenthet;thisagreeswiththendingsof Ruppert ( 2002 ).Theimportantpointtonotehereisthatthenumberofknotsandthelengthoftheexposuretrajectoryseemtointeractintheireffectonmodelt.Thebestttingconstantinuencemodelseemtobetheonewithexposuretrajectory(10,0)and3knots. Forthelinearinuencesetup,thePPLcriterionhasadecreasingtrendaslongerexposuretrajectoriesaretakenintoaccount.Thus,inclusionofpastexposuresresultinanimprovementofmodelt.Thismaybeindicativeofthefactthatpastexposureobservationscontainsignicantamountofinformationaboutthecurrentdiseasestatus.Inaddition,forthetrajectoryintervalI=(10,5),thePPLcriteriacorrespondingtothelinearandconstantinuencemodelsaremoderatelysmall.Thus,exposureobservationsrecorded5-10yearspriortodiagnosisalsoprovideamodestamountofinformationtowardpredictingthecurrentdiseasestatus,corroboratingtheconclusions 53

PAGE 54

Posteriorpredictivelosses(PPL)fortheconstantandlinearinuencemodelsforvaryingexposuretrajectoriesandknots KnotsModel(2,0)(5,0)(8,0)(10,0)(10,5) reachedearlier.Forthelinearsetup,themodelwithexposuretrajectoryI=(8,0)and4knotsperformthebest(hasthelowestPPLcriterionamongallthemodelsconsidered). Forthismodel,theposteriormeanofwasabout0.6with95%credibleinterval(0.535,0.680)whichindicatessubstantialagreementbeyondwhatisexpectedbychance.Wenextperformedcasedeletionanalysis.Wedeletedeachsubject(withalltheobservations)ratherthaneachobservationforasubject.Figure 2-2 (a)-(c)showsthecasedeletedposteriormeansand95%credibleintervalsfor1,0and1.(In 54

PAGE 55

2-2 (d)showstheplotoftheposteriormeansofthedifferenceprobabilitiesandthecorrespondingcondenceintervals.(Inthisgure,thesolidlinerepresentszerodifference.Thesolidpointsrepresentsthedifferenceindiseaseprobabilitiesbasedonthefullandcasedeletedposteriors.Theverticallinesegmentsarethe95%posteriorintervalsofthedifferences).Surprisingly,theobservationforcasenumber108hasasignicantdeparturefromtherest.Onanalyzingthissubject,itwasfoundthatithadtheuniquecombinationofveryhighageandveryhighvaluesofPSA.Infactithadthehighestmeanageinthesample,thehighestageatdiagnosiswhilethethirdhighestmeanPtotalvalue.Thesecharacteristicsmayhavecontributedtotheexceptionallyhighdifferenceinthepredictedprobabilityofdisease. Wealsoperformedcasedeletionanalysisoftheinterceptparametersofthediseaseandtrajectorymodelsandthevariancecomponents.Noneofthesubjectswerefoundtobeinuentialontheposteriorestimatesoftheseparameters.Thus,basedontheabovetwomeasures,wemayconcludethatthesemiparametriclinearinuencemodelwithtrajectoryI=(8,0)and4knotsseemstottheobserveddatarelativelywell. 55

PAGE 56

Sensitivityof1,0,1anddiseaseprobabilityestimatestocase-deletions. 56

PAGE 57

Inthiswork,wehaveappliedsemiparametricregressiontechniquesinanalyzinglongitudinalcasecontrolstudies.Wehaveusedpenalizedregressionsplinesinmodelingtheexposuretrajectoriesforthecasesandthecontrols.Thusourframeworkcanbeusedevenwhenexposureobservationsarecollectedatdifferenttimepointsacrosssubjectsi.ewhenexposuresareunbalancedinnature.Theexposuretrajectoryisusedasthepredictorinaprospectivelogisticmodelforthebinarydiseaseoutcome.Wehavealsomodeledtheslopeparameterofthediseasemodelasap-splinetoaccountforanytimevaryinginuencepatternoftheexposuretrajectoryonthecurrentdiseasestatus.Indoingso,wehavesummarizedtheexposurehistoryforthecasesandcontrolsinaexiblewaywhichallowedustoconsiderdifferentiallengthsoftheexposuretrajectoryinanalyzingitseffectonthecurrentdiseasestatus.Inordertosimplifytheanalysis,weusedthelogit-mixtureofnormalapproximation( AlbertandChib 1993 ).WeshowedthattheBayesianequivalenceresultsof SeamanandRichardson ( 2004 )essentiallyholdsforourframework,thusallowingustouseaprospectivelogisticmodelhavingfewernuisanceparametersalthoughthedatasetwascollectedretrospectively.AnalysishavebeencarriedoutinanhierarchicalBayesianframework.ParameterestimatesandassociatedcredibleintervalsareobtainedusingMCMCsamplers.Wehaveappliedourmethodologytoalongitudinalcasecontrol 57

PAGE 58

Weanalyzedourmodelusingdifferentiallengthsofexposuretrajectories.Indoingso,wehaveconcludedthatpastexposureobservationsdoprovidesignicantinformationtowardspredictingthecurrentdiseasestatusofasubject.Specically,wehaveshownthatacrossallageatdiagnosisgroups,theoddsofdiseasesteadilyincreaseaspastexposureobservationsaretakenintoaccountinadditiontotherecentones.Wealsoobservedthatforaxedtrajectorylength,theoddsofdiseasesteadilydecreaseastheageatdiagnosisincreasescorroboratingthemedicalfactthatyoungersubjectstendtohavemoreaggressiveformofprostatecancerandthusaremostlikelytobebenettedfromearlydetection.Weperformedmodelcomparisonusingposteriorpredictiveloss( GelfandandGhosh 1998 ).Thiscriterionindicatedthatmodelswithlongerexposuretrajectoriestendtoperformbetterthanthosewithshortertrajectories.Lastly,modelassessmentwasperformedontheoptimalmodelusingthekappastatisticandcasedeletiondiagnostics.Boththesetoolssuggestedthatourmodeltsrelativelywelltothedata. Someinterestingextensionscanbedonetooursetup.Forricherdatasets,itwillbeinterestingtomodelthesubjectspecicdeviationfunctionsasp-splines.Inaddition,wehaveonlyassumedconstantandlinearparameterizationsoftheinuencefunctionoftheprospectivediseasemodel.Foralargerdataset,ap-splineformulationcanalsobeusedfortheinuencefunctionwhichmaybringoutanyunderlyingnon-linearpatternofinuenceoftheexposuretrajectoryonthecurrentdiseasestatus.Althoughwehaveusedabinarydiseaseoutcome,itwillbeinterestingtoextendourframeworktoaccommodatemulti-categorydiseasestates.Ourmodelingframeworkcanalsobegeneralizedbyincorporatingalargerclassofnonparametricdistributionalstructures(likeDirichletprocessesorPolyatrees)forthesubjectspecicrandomeffects. 58

PAGE 59

59

PAGE 60

ThecurrentmethodologyoftheSAIPEprogramisbasedoncombiningstateandcountyestimatesofpovertyandincomeobtainedfromtheAmericanCommunitySurvey(ACS)withotherindicatorsofpovertyandincomeusingtheFay-Herriotclassofmodels( FayandHerriot 1979 ).Theindicatorsaregenerallythemeanandmedianadjustedgrossincome(AGI)fromIRStaxreturns,SNAPbenetsdata(formerlyknownasFoodStampProgramdata),themostrecentdecennialcensus,intercensalpopulationestimates,SupplementalSecurityIncomeReceipiencyandothereconomicdataobtainedfromtheBureauofEconomicAnalysis(BEA).EstimatesfromACSarebeingusedsinceJanuary2005ontherecommendationoftheNationalAcademyofSciencesPanelonEstimatesofPovertyforSmallGeographicAreas(2000).Incomeandpovertyestimatesuntil2004werebasedondatafromtheAnnualSocialandEconomicSupplement(ASEC)oftheCurrentPopulationSurvey(CPS). Apartfromvariouspovertymeasures,theSAIPEprogramprovidesannualstateandcountylevelestimatesofmedianhouseholdincome.Atthispoint,directACSestimatesofmedianhouseholdincomeareonlyavailablefortheperiod2005-2008.Thus,forillustrationpurpose,wehaveconsidereddatafromASECfortheperiod1995-1999inordertoestimatethestatelevelmedianhouseholdincomefor1999.Thisisbecause,themostrecentcensusestimatescorrespondtotheyear1999andthesecensusvaluescanbeusedforcomparisonpurposes.TheSAIPEregressionmodelforestimatingthemedianhouseholdincomefor1999useascovariates,themedianadjustedgrossincome(AGI)derivedfromIRStaxreturnsandthemedianhouseholdincomeestimatefor1999obtainedfromthe2000Census.Theresponsevariableisthedirectestimateofmedianhouseholdincomefor1999obtainedfromthe 60

PAGE 61

Bell 1999 ).NoninformativepriordistributionsareplacedontheregressionparametercorrespondingtotheIRSmedianincomesinceitwasfoundtobestatisticallysignicanteveninthepresenceofcensusdata,bothinthe1989and1999models. Fay ( 1987 )inthisregard.EstimationwascarriedoutinanempiricalBayes(EB)frameworksuggestedby Fayetal. ( 1993 ).Later, Dattaetal. ( 1993 )extendedtheEBapproachof Fay ( 1987 )andalsoputforwardunivariateandmultivariatehierarchicalBayes(HB)models.TheestimatesfromtheirEBandHBproceduressignicantlyimprovedovertheCPSmedianincomeestimatesfor1979. Ghoshetal. ( 1996 )exploitedtherepetitivenatureofthestate-specicCPSmedianincomeestimatesandproposedaBayesiantimeseriesmodelingframeworktoestimatethestatewidemedianincomeoffour-personfamiliesfor1989.Indoingso,theyusedatimespecicrandomcomponentandmodeleditasarandomwalk.TheyconcludedthatthebivariatetimeseriesmodelutilizingthemedianincomesoffourandvepersonfamiliesperformsthebestandproducesestimateswhicharemuchsuperiortoboththeCPSandCensusBureauestimates.Ingeneral,thetimeseriesmodelalwaysperformedbetterthanitsnon-timeseriescounterpart. 61

PAGE 62

Opsomeretal. ( 2008 )inwhichtheycombinedsmallarearandomeffectswithasmooth,non-parametricallyspeciedtrendusingpenalizedsplines.Indoingso,theyexpressedthenon-parametricsmallareaestimationproblemasamixedeffectsregressionmodelandanalyzeditusingrestrictedmaximumlikelihood.Theoreticalresultswerepresentedonthepredictionmeansquarederrorandlikelihoodratiotestsforrandomeffects.Inferencewasbasedonasimplenon-parametricbootstrapapproach.Themethodologywasusedtoanalyzeanon-longitudinal,spatialdatasetconcerningtheestimationofmeanacidneutralizingcapacity(ANC)oflakesinthenortheasternstatesofU.S. Ghoshetal. ( 1996 ),wehaveviewedthestatespecicannualhouseholdmedianincomevaluesaslongitudinalprolesorincometrajectories.ThisgainedmoregroundbecauseweusedthestatewideCPSmedianhouseholdincomevaluesforonlyveyears(1995-1999)inourestimationprocedure.Figure 3-1 showssamplelongitudinalCPSmedianhouseholdincomeprolesforsixstatesspanning1995to2004whileFigure 3-2 showstheplotsoftheCPSmedianincomeagainsttheIRSmeanandmedianincomesforallthestatesfortheyears1995through1999.ItisapparentthatCPSmedianincomemayhaveanunderlyingnon-linearpatternwithrespecttoIRSmeanincome,speciallyforlargevaluesofthelatter.Theabovetwofeaturesmotivatedustouseasemiparametricregressionapproach.Indoingso,wehavemodeledtheincometrajectoryusingpenalizedspline(orP-spline)( EilersandMarx 1996 )whichisacommonlyusedbutpowerfulfunctionestimationtoolinnon-parametricinference.TheP-splineis 62

PAGE 63

LongitudinalCPSmedianincomeprolesfor6statesplottedagainstIRSmeanandmedianincomes.(1stcolumn:IRSMeanIncome;2ndcolumn:IRSMedianIncome). 63

PAGE 64

GelfandandGhosh 1998 )hasbeenusedtoobtaintheparameterestimates. Wehavecomparedthestate-specicestimatesofmedianhouseholdincomefor1999withthecorrespondingdecennialcensusvaluesinordertotestfortheiraccuracy.Indoingso,weobservedthatthesemiparametricmodelestimatesimproveuponboththeCPSandtheSAIPEestimates.Interestingly,thepositioningoftheknotshadsignicantinuenceontheresultsaswillbediscussedlateron.WewanttomentionherethattheSAIPEmodelhadaconsiderableadvantageoveroursinthattheyusedthecensusestimatesofthemedianincomefor1999asapredictor.Insmallareaestimationproblems,thecensusestimatesareregardedasthegoldstandardsincethesearethemostaccurateestimatesavailablewithvirtuallynegligiblestandarderrors.So,usingthoseasexplanatoryvariableswasanaddedadvantageoftheSAIPEstatelevelmodels.ThefactthatourestimatesstillimproveontheSAIPEmodelbasedestimatesisatestamenttotheexibilityandstrengthofthesemiparametricmethodologyspeciallywhenobservationsarecollectedovertime.ItalsoindicatesthatitmaybeworthwhiletotakeintoaccountthelongitudinalincomepatternsinestimatingthecurrentincomeconditionsofthedifferentstatesoftheU.S. Therestofthechapterisorganizedasfollows.InSection 3.2 weintroducethetwotypesofsemiparametricmodelswehaveused.Section 3.3 goesoverthehierarchicalBayesiananalysisweperformed.InSection 3.4 ,wedescribetheresultsofthedata 64

PAGE 65

BIRSmedianincomeplot PlotsofCPSmedianincomeagainstIRSmeanandmedianincomesforallthestatesoftheU.S.from1995to1999. analysiswithregardtothemedianhouseholdincomedataset.InSection 3.5 ,wediscusstheBayesianmodelassessmentprocedureweusedtotestthegoodness-of-tofourmodels.WeendwithadiscussioninSection 3.6 .Theappendixcontainstheproofsoftheposteriorproprietyandtheexpressionsofthefullconditionaldistributions. 3.2.1GeneralNotation 65

PAGE 66

wheref(xij)isanunspeciedfunctionofxijreectingtheunknownresponse-covariaterelationship. Weapproximatef(xij)usingaP-splineandrewrite( 3 )as whereij=X0ij+Z0ij+bi+uijisourtargetofinference. HereXij=(1,xij,...,xpij)0,Zij=f(xij1)p+,...,(xijK)p+g0,=(0,...,p)0isthevectorofregressioncoefcientswhile=(1,...,K)0isthevectorofsplinecoefcients.Theabovesplinemodelwithdegreepcanadequatelyapproximateanyunspeciedsmoothfunction.Typically,linear(p=1)orquadratic(p=2)splinesservesmostpracticalpurposessincetheyensureadequatesmoothnessinthettedcurve.mandtrespectivelydenotethenumberofsmallareasandthenumberoftimepointsatwhichtheresponseandcovariatesaremeasured.Thus,inourcase,m=51,forallthe50statesoftheU.S.andtheDistrictofColumbiaandt=5fortheyears1995-1999.biisastate-specicrandomeffectwhileuijrepresentsaninteractioneffectbetweentheithstateandthejthyear.Weassumebii.i.dN(0,2b)andN(0,2IK).2controlstheamountofsmoothingoftheunderlyingincometrajectory.Moreover,itisassumed 66

PAGE 67

3.1.1 .InthedatasetsprovidedbytheCensusBureau,theseestimatesaregivenforallthestatesateachofthetimepoints.Theknots(1,...,K)areusuallyplacedonagridofequallyspacedsamplequantilesofxij's. From( 3 )and( 3 ),wehave 3 )andmodeleditasarandomwalkasfollows whereij=X0ij+Z0ij+bi+vj+uij Beforeproceedingtothenextsection,wemaynotethatunlikethemodelsof Ghoshetal. ( 1996 ),themodelsgivenin( 3 )and( 3 )incorporatestatespecicrandomeffects(bi).Thisrectiesalimitationoftheformeraspointedoutin Rao ( 2003 ). 67

PAGE 68

3.3.1LikelihoodFunction Here,L(Uja,b)denotesanormaldensitywithmeanaandvariancebwhileL(bij2b)andL(j2)denotesanormaldistributionwithmean0andvariances2band2respectively. Fortherandomwalkmodel,theparameterspacefortheithstatewouldbei=(i,,,bi,v,2,2b,2,2v)wherev=(v1,...,vt)isthevectoroftimespecicrandomeffects.Thus,thelikelihoodfunctionfortheithstatewillhaveanextracomponentcorrespondingtovasfollows whereL(vjjvj1,2v)denotesanormaldistributionwithmeanvj1andvariance2vwherev0=0. 68

PAGE 69

Thus,wehavethefollowingpriors:uniform(Rp+1),(2j)1G(cj,dj)(j=1,...,t),(2b)1G(c,d),(2)1G(c,d)and(2v)1G(cv,dv).HereXG(a,b)denotesagammadistributionwithshapeparameteraandrateparameterbhavingtheexpressionf(x)/xa1exp(bx),x0.Sincewehavechosenimproperpriorsfor,posteriorproprietyofthefullposteriorhavebeenshown.Wehavethefollowingtheorem Fortherandomwalkmodel,therewillbeanadditionalterm(2v).Bytheconditionalindependenceproperties,wecanfactorizethefullposterioras[,,,b,2b,2,f21,...,2tgjY,X,Z]/[Yj][j,,b,f21,...,2tg,X,Z][bj2b][j2][][2][2b]tYj=1[2j] 69

PAGE 70

GelmanandRubin ( 1992 )andrunn(2)parallelchains.Foreachchain,werun2diterationswithstartingpointsdrawnfromanoverdisperseddistribution.Todiminishtheeffectsofthestartingdistributions,therstditerationsofeachchainarediscardedandposteriorsummariesarecalculatedbasedontherestofthediterates.Thefullconditionalsforboththemodelsaregivenintheappendix. 3.2.2 toanalyzethemedianhouseholdincomedatasetreferredtoinSection 3.1.3 .TheresponsevariableYijandthecovariatesXijrespectivelydenotetheCPSmedianhouseholdincomeestimateandthecorrespondingIRSmean(ormedian)incomeestimatefortheithstateatthejthyear(i=1,...,51;j=1,...,5).Thestate-specicmeanormedianincomeguresareobtainedfromIRStaxreturndata.TheCensusBureaugetslesofindividualtaxreturndatafromtheIRSforuseinspecicallyapprovedprojectssuchasSAIPE.Foreachstate,theIRSmean(median)incomeisthemean(median)adjustedgrossincome(AGI)acrossallthetaxreturnsinthatstate.LikeotherSAIPEmodelcovariatesobtainedfromadministrativerecordsdata,thesevariablesdonotexactlymeasurethemedianincomeacrossallhouseholdsinthestate.OneofthereasonsforthisisthattheAGIwouldnotnecessarilybethesameastheexactincomegureandthetaxreturnuniversedoesnotcovertheentirepopulationi.esomehouseholdsdonotneedtoletaxreturns,andthosethatdonotarelikelytodifferinregardtoincomethanthosethatdo.However,theuseofthemeanormedianAGIasacovariateonlyrequiresittobecorrelatedwithmedianhouseholdincome,notnecessarilybethesamething.Specicallyforthisstudy,wehaveusedIRSmeanincomeasourcovariate.Thisisbecause,itseemstopossess 70

PAGE 71

3-2A ),andsoitismoresuitedtoasemiparametricanalysis. Inordertochecktheperformanceofourestimates,weplantousefourcomparisonmeasures.ThesewereoriginallyrecommendedbythepanelonsmallareaestimatesofpopulationandincomesetupbytheCommitteeonNationalStatisticsinJuly1978andareavailableintheirJuly1980report(p.75).Theseare ThebasicstructureofourmodelswouldremainthesameasinSection 3.2.2 .WehaveusedtruncatedpolynomialbasisfortheP-splinecomponentinboththemodels.SinceFig2adoesnotindicateahighdegreeofnon-linearity,wehaverestricted 71

PAGE 72

Ruppert 2002 ).Generally,atmost35to40knotsarerecommendedforeffectivelyallsamplesizesandfornearlyallsmoothregressionfunctions.Followingthegeneralconvention,wehaveplacedtheknotsonagridofequallyspacedsamplequantilesoftheindependentvariable(IRSmeanincome). GelmanandRubin ( 1992 ).Weranthreeindependentchainseachwithasamplesizeof10,000andwithaburn-insampleofanother5,000.Weinitiallysampledtheij'sfromt-distributionswith2dfhavingthesamelocationandscaleparametersasthecorrespondingnormalconditionalsgivenintheAppendix.ThisisbasedontheGelman-Rubinideaofinitializingcertainsamplesofthechainfromoverdisperseddistributions.However,onceinitialized,thesuccessivesamplesofij'saregeneratedfromregularunivariatenormaldistributions.ConvergenceoftheGibbssamplerwasmonitoredbyvisuallycheckingthedynamictraceplots,acfplotsandbycomputingtheGelman-Rubindiagnostic.Thecomparisonmeasuresdeviatedslightlyfordifferentinitialvalues.Wechosetheleastofthoseasthenalmeasurespresentedinthetablesthatfollows. 72

PAGE 73

WettedModelI(SPM)withallpossibleknotchoicesfrom0to40butthebestresultswereachievedwith5knots.Theestimates(with5knots)improvedsignicantlyovertheCPSestimatesbasedonallthefourcomparisonmeasures.Additionofmoreknotsseemedtodegradethetofthemodel.Thismayhappenaspointedoutin Ruppert ( 2002 ).Ontheotherhand,theSAIPEmodelbasedestimateswereslightlysuperiortotheSPMestimates. Next,wettedthesemiparametricrandomwalkmodel(SPRWM)toourdata.Overall,therandomwalkstructureleadtosomeimprovementintheperformanceoftheestimates.However,forthemodelwith5knots,theperformanceoftheestimatesremainednearlythesame.Thismaybebecause5knotsissufcienttocapturetheunderlyingpatternintheincometrajectoryandtherandomwalkcomponentdoesnotleadtoanyfurtherimprovement.Lastbutnottheleast,therandomwalkmodelestimates,althoughgenerallybetterthanthoseofthebasicsemiparametricmodel,stillcannotclaimtobesuperiortotheSAIPEestimatesforallthecomparisonmeasures.Table 3-1 reportstheposteriormean,medianand95%CIfortheparametersoftheSPRWMwith5knots. Itisofinterestthatthe95%CIfor1,4and5doesnotcontain0indicatingthesignicanceoftherst,fourthandfthknots.ThisisindicativeoftherelevanceofknotsinthepenalizedsplinetontheCPSmedianincomeobservations.ThesameistrueforthecoefcientsofSPM. 73

PAGE 74

ParameterestimatesofSPRWMwith5knots ParameterMeanMedian95%CI 3.1.1 ,theSAIPEstatemodelsusethecensusestimatesofmedianincome(for1999)asoneofthepredictorwhichessentiallygivesthemabigedgeoverus.Thismaybeoneofthereasonswhytheestimatesobtainedfromthesemiparametricmodelsareatmostcomparable,butnotsuperiortotheSAIPEestimates.Butthatdoesn'truleoutthefactthatthesemiparametricmodelshaveroomforimprovement.Inthissection,wewilllookforanypossibledecienciesintheourmodelsandwilltrytocomeupwithsomeimprovements,ifthereisany. AsmentionedinSection 3.4.1 ,selectionandproperpositioningofknotsplaysapivotalroleincapturingthetrueunderlyingpatterninasetofobservations.Poorlyplacedknotsdoeslittleinthisregardandcanevenleadtoanerroneousorbiasedestimateoftheunderlyingtrajectory.Ideally,asufcientnumberofknotsshouldbeselectedandplaceduniformlythroughouttherangeoftheindependentvariabletoaccuratelycapturetheunderlyingobservationalpattern. Figures 3-3A and 3-3B showstheexactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Inboththecases,theknotsareplacedonagridofequallyspacedsamplequantilesofIRSmeanincome.Inboththegures,theknotslieontheleftofIRSmean=50000,theregionwherethedensityofobservationsishigh.Theknotstendtolieinthisregionbecausetheyareselectedbasedonquantileswhichisadensity-dependentmeasure.Thus,inboththegures,thecoverageareaofknots(i.ethepartoftheobservationalpatternwhichiscapturedbytheknots)isthe 74

PAGE 75

BPositioningof7Knots Exactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Theknotsaredepictedastheboldfacedtrianglesatthebottom. regiontotheleftofthedottedverticallines.Ontheotherhand,thenon-linearpatternistangibleonlyinthelowdensityareaoftheploti.etheregionlyingtotherightofIRSmean=50000.Evidently,noneoftheknotslieinthispartofthegraph.Thus,wecanpresumethatinboththecases(5and7knots),theunderlyingnon-linearobservationalpatternisnotbeingadequatelycaptured. Asanaturalsolutiontothisissue,wedecidedtoplacehalfoftheknotsinthelowdensityregionofthegraphwhiletheotherhalfinthehighdensityregion.Theexactboundarylinebetweenthehighdensityandlowdensityregionsishardtodetermine.WetesteddifferentalternativesandcameupwithIRSmean=47000asatentativeboundarybecauseitgavethebestresults.Inboththeregions,weplacedtheknotsatequallyspacedsamplequantilesoftheindependentvariable.Figure 3-4 showsthenewknotpositionsfor5knots. ItisclearfromFigure 3-4 thatthenewknotsaremoredispersedthroughouttherangeofIRSmeanthantheoldones.Theregionbetweentheboldanddashedverticallinesdenotestheadditionalcoveragethathasbeenachievedwiththeknot 75

PAGE 76

Positionsof5knotsafterrealignment.Theknotsaretheboldfacedtrianglesatthebottom.Theregionbetweenthedashedandboldlinesistheadditionalcoverageareagainedfromtherealignment. rearrangement.Basedonthenumberofdatapointsinsidethisregion,itisclearthatamuchlargerproportionofobservationshasbeencapturedwiththeknotrealignment.Noknotsareintheregionbeyondtheboldverticallines(i.ebeyondIRSmean56000)possiblyduetotheverylowdensityoftheobservationsinthatarea.Overall,itseemsthat,thenewknotscancapturesomeoftheunderlyingnon-linearpatterninthedatasetwhichtheoldknotsfailedtoachieve.Wealsoexperimentedbyplacingalltheknotsinthelowdensityregion(beyondIRSmean=47000)buttheresultswerenotsatisfactory.Thisindicatesthattheknotsshouldbeuniformlyplacedthroughouttherangeoftheindependentvariabletogetanoptimalt. Wehaveworkedwith5knotsbecauseitperformedconsistentlywellforboththeSPMandSPRWmodels.Onttingthesemiparametricmodelswiththenewknotalignment,wedidachievesomeimprovementintheresults.Table 3-2 reports 76

PAGE 77

3-3 depictsthepercentageimprovementofthesemiparametricestimatesovertheCPSandSAIPEestimates.Here,SPM(5)andSPRWM(5)respectivelydenotethesemiparametricmodelswiththerealigned5knots. Table3-2. ComparisonmeasuresforSPM(5)andSPRWM(5)estimateswithknotrealignment EstimateARBASRBAABASD CPS0.04150.00271,753.335,300,023SAIPE0.03260.00151,423.753,134,906SPM(5)0.0280.00121173.712,334,379SPRWM(5)0.02950.00131256.082,747,010 Table3-3. PercentageimprovementsofSPM(5)andSPRWM(5)estimatesoverSAIPEandCPSestimates EstimateModelARBASRBAABASD SPM(5)14.11%20.00%17.56%25.54%SAIPESPRWM(5)9.51%13.33%11.78%12.37%SPM(5)32.53%55.55%33.06%55.96%CPSSPRWM(5)28.92%51.85%28.36%48.17% Itisclearthat,withtheknotrealignment,thecomparisonmeasurescorrespondingtothesemiparametricestimateshavedecreasedsubstantially,speciallysofortheSPM.ThenewcomparisonmeasuresforthesemiparametricmodelsarequitelowerthanthosecorrespondingtotheSAIPEestimates.Thus,wemaysaythatthesemiparametricmodelestimatesperformsbetterthantheSAIPEestimateswiththerealignedknots.Thisimprovementisapparentlyduetotheadditionalcoverageoftheobservationalpatternthatisbeingachievedwiththerelocationoftheknots.Asaresultofthisincreasedcoverage,alargerproportionoftheunderlyingnonlinearpatternintheobservationsinbeingcapturedbythenewknots.Althoughwehavedonethisexercisewithonly5knots,itwouldbeinterestingtoexperimentwithothertypesofknotalignment 77

PAGE 78

3-4 andTable 3-5 reporttheposteriormean,medianand95%CIfortheparametersinSPM(5)andSPRWM(5)respectively. Table3-4. ParameterestimatesofSPM(5) Table3-5. ParameterestimatesofSPRWM(5) Itisofinteresttonotethat,withtheknotrealignment,alltheknotcoefcients(i.ethe's)aresignicantforbothSPMandSPRWM.Fortheoldconguration,someoftheknotcoefcientswerenotsignicantforthemodels.Thiscorroboratesthefactthat,withtheknotrealignment,alltheveknotsaresignicantlycontributingtothecurvettingprocessintermsofcapturingthetrueunderlyingnon-linearpatternintheobservations. Ghoshetal. ( 1996 ),henceforthreferredtoastheGNKmodel.Theirunivariatemodelisasfollows where(bjjbj1)N(0,2b),uijN(0,2j)andeijN(0,2ij). 78

PAGE 79

wherebii.i.dN(0,2b)whileuijandeijhavethesamedistributionasabove.Clearly,theonlydifferencebetween( 3 )and( 3 )isthattheformercontainsatimespecicrandomcomponentwhilethelattercontainsaareaspecicrandomcomponent. Ghoshetal. ( 1996 )showedthattheestimatesfromthebivariateversionoftheGNKmodel( 3 )performsmuchbetterthanthecensusbureauestimatesinestimatingthemedianhouseholdincomeof4-personfamiliesintheUnitedStates.Table 3-6 depictsthecomparisonmeasurescorrespondingtotheabovemodels. Table3-6. Comparisonmeasuresfortimeseriesandothermodelestimates EstimateARBASRBAABASD CPS0.04150.00271,753.335,300,023SAIPE0.03260.00151,423.753,134,906GNK0.03970.00251709.585,229,869SPM(0)0.03370.00171408.73,137,978SPM(5)0.0280.00121173.712,334,379SPRWM(5)0.02950.00131256.082,747,010 Itisclearthat,althoughtheestimatesfromtheGNKmodelperformslightlybetterthantheCPS,thosearequiteinferiortothesemiparametricandSAIPEestimates.Thismaybebecausethestatespecicrandomeffectsinthesemiparametricmodelscanaccountforthewithin-statecorrelationsintheincomevalues,somethingwhichtheGNKmodelfailstodo.SincethecomparisonmeasuresforSPM(0)aremuchlowerthanthosefortheGNKmodel,wecanalsoconcludethattheareaspecicrandomeffectismuchmorecriticalthanatimespecicrandomcomponentinthissituation. 79

PAGE 80

Johnson ( 2004 ).ThisisessentiallyanextensionoftheclassicalChi-squaregoodness-of-ttestwherethestatisticiscalculatedateveryiterationoftheGibbssamplerasafunctionoftheparametervaluesdrawnfromtherespectiveposteriordistribution.Thus,aposteriordistributionofthestatisticisobtainedwhichcanbeusedforconstructingglobalgoodness-of-tdiagnostics. Toconstructthisstatistic,weform10equallyspacedbins((k1)=10,k=10),k=1,...,10,withxedbinprobabilities,pk=1=10.Themainideaistoconsiderthebincountsmk(~)toberandomwhere~denotesaposteriorsampleoftheparameters.AteachiterationoftheGibbssampler,binallocationismadebasedontheconditionaldistributionofeachobservationgiventhegeneratedparametervaluesi.eYijwouldbeallocatedtothekthbinifF(Yijj~)2((k1)=10,k=10),k=1,...,10.TheBayesianchi-squarestatisticisthencalculatedasRB(~)=10Xk=1"mk(~)npk Theonlyassumptionsforthisstatistictoworkarethattheobservationsshouldbeconditionallyindependentandtheparametervectorshouldbenitedimensional.The 80

PAGE 81

BSemiparametricRWModel Quantile-quantileplotofRBvaluesfor10000drawsfromtheposteriordistributionofthebasicsemiparametricandsemiparametricrandomwalkmodels.TheX-axisdepictstheexpectedorderstatisticsfroma2distributionwith9degreesoffreedom. secondassumptionnaturallyholdsinourcase.Regardingtherstone,sincewehavemultipleobservationsovertimeforeverystate,theremaybewithin-statedependencebetweenthose.Thus,insteadoftakingalltheobservations(i.etheCPSmedianincomevalues),wedecidedtousethelastobservationforeachstate.Forthebasicsemiparametricmodel(SPM),theabovesummarymeasureswererespectively0.049and0.5whilefortherandomwalkmodel(SPRWM),thesewere0.047and0.51.ThesemeasuressuggestthatbothSPMandSPRWMtsthedataquitewell.Figure 3-5A and 3-5B showsthequantile-quantileplotsofRBvaluesobtainedfrom10000samplesofSPMandSPRWMwith5knots.BoththeplotsdemonstrateexcellentagreementbetweenthedistributionofRBandthatofa2(9)randomvariable. JohnsonpointsoutthattheBayesianchi-squareteststatisticisalsoanusefultoolforcodeverication.IftheposteriordistributionofRBdeviatessignicantlyfromitsnulldistribution,itmayimplythatthemodelisincorrectlyspeciedortherearecodingerrors.Sincethesummarymeasuresarequiteclosetothecorrespondingnullvalues, 81

PAGE 82

FayandHerriot 1979 ).Inthisstudy,wehaveproposedasemiparametricclassofmodelswhichexploitthelongitudinaltrendinthestate-specicincomeobservations.Indoingso,wehavemodeledtheCPSmedianincomeobservationsasanincometrajectoryusingpenalizedsplines( EilersandMarx 1996 ).Wehavealsoextendedthebasicsemiparametricmodelbyaddingatimeseriesrandomwalkcomponentwhichcanexplainanyspecictrendintheincomelevelsovertime.Wehaveusedasourcovariate,themeanadjustedgrossincome(AGI)obtainedfromIRStaxreturnsforallthestates.AnalysishasbeencarriedoutinahierarchicalBayesianframework.OurtargetofinferencehasbeenthemedianhouseholdincomesforallthestatesoftheU.S.andtheDistrictofColumbiafortheyear1999.Wehaveevaluatedourestimatesbycomparingthosewiththecorrespondingcensusestimatesof1999usingsomecommonlyusedcomparisonmeasures. Ouranalysishasshownthatinformationofpastmedianincomelevelsofdifferentstatesdoprovidestrengthtowardstheestimationofstatespecicmedianincomesforthecurrentperiod.Infact,ifthereisanunderlyingnon-linearpatterninthemedianincomelevels,itmaybeworthwhiletocapturethatpatternasaccuratelyaspossibleandusethatintheinferentialprocedure.Intermsofmodelingtheunderlyingobservationalpattern,thepositioningofknotsprovedtobebothimportantandinteresting.The 82

PAGE 83

Theabovemodelscanbeextendedinvariouswaysbasedonthenatureoftheobservationalpatternandthequality(orrichness)ofthedataset.Someobviousextensionsaregivenasfollows:(1)Inthemodelsconsideredabove,thesplinestructuref(xij)representsthepopulationmeanincometrajectoryforallthestatescombined.Thedeviationoftheithstatefromthemeanismodeledthroughtherandominterceptbi.Thisimpliesthatthestate-specictrajectoriesareparallel.Amoreexible 83

PAGE 84

Heregi(x)isanunspeciednonparametricfunctionrepresentingthedeviationoftheithstate-specictrajectoryfromthepopulationmeantrajectoryf(x).gi(x)isalsomodeledusingP-splinewithalinearpart,bi1+bi2xandanon-linearone,PKk=1wik(xk)+thusallowingformoreexibility.Boththesecomponentsarerandomwith(bi1,bi2)0N(0,)(beingunstructuredordiagonal)andwikN(0,2w).Thisextensionisparticularlyrelevantinsituationswherethestate-specicincometrajectoriesarequitedistinctfromthepopulationmeancurveandthusneedtobemodeledexplicitly.Weplantopursuethisextensionifwecanprocurearicherdatasetwithlongerstatespecicincometrajectories.(2)Sometimesthefunctiontobeestimated(herethemedianincomepattern)mayhavevaryingdegreesofsmoothnessindifferentregions.Inthatcase,asinglesmoothingparametermaynotbeproperandaspatiallyadaptivesmoothingprocedurecanbeused( RuppertandCarroll 2000 ).(3)WeusedthetruncatedpolynomialbasisfunctiontomodeltheincometrajectorybutothertypesofbaseslikeB-splines,radialbasisfunctionsetccanalsobeused.(4)Althoughweusedaparametricnormaldistributionalassumptionfortherandomstateandtimespeciceffects,abroaderclassofdistributionslikethemixturesofDirichletprocesses( MacEachernandMuller 1998 )orPolyatrees( HansonandJohnson 2000 )maybetested. Lastbutnottheleast,wethinkthatsemiparametricmodelingapproachholdsalotofpromiseforsmalldomainproblemsspeciallywhenobservationsforeachdomainarecollectedovertime.TheassociatedclassofsemiparametricmodelscanwellbeanattractivealternativetothemodelsgenerallyemployedbytheU.S.CensusBureau. 84

PAGE 85

TheU.S..CensusBureauhasalwaysbeenconcernedwiththeestimationofincomeandpovertycharacteristicsofsmallareasacrosstheUnitedStates.Theseestimatesplayavitalroletowardstheadministrationoffederalprogramsandtheallocationoffederalfundstolocaljurisdictions.Forexample,statelevelestimatesofmedianincomeforfour-personfamiliesareneededbytheU.S.DepartmentofHealthandHumanServices(HHS)inordertoformulateitsenergyassistanceprogramtolowincomefamilies.Sinceincomecharacteristicsforsmallareasaregenerallycollectedovertime,theremaywellbeatimevaryingpatterninthoseobservations.Neglectingthosepatternsmayleadtobiasedestimateswhichdoesnotreectthetruepicture.Inthisstudy,weputforwardamultivariateBayesiansemiparametricprocedurefortheestimationofmedianincomeoffour-personfamiliesforthedifferentstatesoftheU.S.whileexplicitlyaccommodatingforthetimevaryingpatternintheobservations. 85

PAGE 86

Inestimatingthemedianincomeoffour-personfamilies,theU.S.CensusBureaureliedondatafromthreesources.ThebasicsourcewastheannualdemographicsupplementtotheMarchsampleoftheCurrentPopulationSurvey(CPS)whichusedtoprovidethestatespecicmedianincomeestimatesfordifferentfamilysizes.Thesecondsourcewasthedecennialcensusestimatesfortheyearpreceedingthecensusyeari.e1969,1979,1989andsoon.Lastly,theCensusBureaualsousedtheannualestimatesofpercapitaincome(PCI)providedbytheBureauofEconomicAnalysis(BEA)oftheU.S.DepartmentofCommerce.Eachoftheabovedatasources(andtheresultingestimates)havesomedisadvantageswhichneccesiatedanestimationprocedurethatusedacombinationofallthreetoproducethenalmedianincomeestimates.TheCPSestimateswerebasedonsmallsampleswhichresultedinsubstantialvariability.Ontheotherhand,decennialcensusestimates,althoughhavingnegligiblestandarderrors,wereonlyavailableevery10years.Duetothislaginthereleaseofsuccessivecensusestimates,therewasasignicantlossofinformationconcerninguctuationsintheeconomicsituationofthecountryingeneralandsmallareasinparticular.Lastly,thepercapitaincomeestimatesdidnothaveassociatedsamplingerrorssincetheywerenotobtainedusingtheusualsamplingtechniques.Thedetailsoftheestimationprocedureappearsin Fayetal. ( 1993 ). TheCensusBureaubasedtheirestimationprocedureonabivariateregressionmodelsuggestedby Fay ( 1987 ).Indoingso,theyusedmedianincomeobservationsforthreeandvepersonfamiliesinadditiontothoseoffourpersonfamilies.ThebasicdatasetforeachstatewasabivariaterandomvectorwithonecomponenttheCPSmedianincomeestimatesoffourpersonfamiliesandtheothercomponentbeingtheweightedaverageofCPSmedianincomesofthreeandvepersonfamilies,withweights0.75and0.25respectively.Boththeregressionequationsusedthebaseyear 86

PAGE 87

Adjustedcensusmedian(c)=PCI(c) PCI(b)censusmedian(b) HerePCI(c)andPCI(b)denotesthepercapitaincomeestimatesproducedbytheBEAforthecurrentandbaseyearsrespectively.Thus,intheaboveexpression,thecurrentyearadjustedcensusmedianestimateisobtainedbyadjustingthebaseyearcensusmedianbytheproportionalgrowthinthePCIbetweenthebaseyearandthecurrentyear.Intheregressionequation,thebaseyearcensusmedianadjustsforanypossibleoverstatementoftheeffectofchangeinthePCIinestimatingthecurrentmedianincomes.Finally,theCensusBureauusedanempiricalBayesian(EB)technique( Fay ( 1987 ); Fayetal. ( 1993 ))tocalculatetheweightedaverageofthecurrentCPSmedianincomeestimateandtheestimatesobtainedfromtheregressionequation. Dattaetal. ( 1993 )extendedandrenedtheideasof Fay ( 1987 )andproposedamoreappealingempiricalBayesianprocedure.TheyalsoperformedanunivariateandmultivariatehierarchicalBayesiananalysisofthesameproblemandshowedthatboththeEBandHBproceduresresultedinsignicantimprovementovertheCPSmedianincomeestimatesfortheunivariateandmultivariatemodels.However,themultivariatemodelresultedinconsiderablylowerstandarderrorandcoefcientofvariationthantheunivariatemodelalthoughthepointestimatesweresimilar.Later, Ghoshetal. ( 1996 )(henceforthreferredtoasGNK)presentedaBayesiantimeseriesanalysisofthesameproblembyexploitingtheinherentrepetitivenatureoftheCPSmedianincomeestimates.Indoingso,theyestimatedthestatewidemedianincome 87

PAGE 88

Semiparametricregressionmethodshavenotbeenusedinsmallareaestimationcontextsuntilrecently.Thiswasmainlyduetomethodologicaldifcultiesincombiningthedifferentsmoothingtechniqueswiththeestimationtoolsgenerallyusedinsmallareaestimation.Thepioneeringcontributioninthisregardistheworkby Opsomeretal. ( 2008 )inwhichtheycombinedsmallarearandomeffectswithasmooth,non-parametricallyspeciedtrendusingpenalizedsplines( EilersandMarx 1996 ).Indoingso,theyexpressedthenon-parametricsmallareaestimationproblemasamixedeffectsregressionmodelandanalyzeditusingrestrictedmaximumlikelihood.Theyalsopresentedtheoreticalresultsonthepredictionmeansquarederrorandlikelihoodratiotestsforrandomeffects.Inferencewasbasedonasimplenon-parametricbootstrapapproach.Theyappliedtheirmodeltoanon-longitudinal,spatialdatasetconcerningtheestimationofmeanacidneutralizingcapacity(ANC)oflakesinthenortheasternstatesofU.S. 88

PAGE 89

Ghoshetal. ( 1996 ),wehavetreatedthestatespecicmedianincomeobservationsaslongitudinalprolesorincometrajectories.Aswithanylongitudinallyvaryingobservations,theincomeproles(bothstate-specicandoverall)mayhaveanon-linearpatternovertime.Moreover,thesuccessiveincomeobservationsmaybeunbalancedinnature.Thesefeaturesmotivatedustouseasemiparametricregressionapproachinourmodelingframework.Indoingso,wehavemodeledtheincometrajectoryusingpenalizedspline(orP-spline)whichisacommonlyusedbutpowerfulfunctionestimationtoolinnon-parametricinference.TheP-splineisexpressedusingtruncatedpolynomialbasisfunctionswithvaryingdegreesandnumberofknotsalthoughothertypesofbasisfunctionslikeB-splinesorthinplatesplinescanalsobeused.Ascovariates,wehaveusedtheadjustedcensusmedianincomessinceitwasfoundtobethemosteffectivecovariateby Ghoshetal. ( 1996 ).Wetestedfourdifferentregressionmodelsviz(1)AunivariatemodelwithonlytheCPSmedianincomeoffour-personfamilyastheresponsevariable;(2)AbivariatemodelwiththeCPSmedianincomesofthreeandfourpersonfamiliesastheresponsevariables;(3)AbivariatemodelwiththeCPSmedianincomesoffourandvepersonfamiliesastheresponsevariables;andlastly(4)AbivariatemodelwiththeCPSmedianincomesoffourpersonfamilyandweightedaverageoftheCPSmedianincomesofthreeandvepersonfamilies(withweights0.75and0.25)astheresponsevariables.Inallthecases,ourprimaryobjectivehasbeentheestimationofmedianincomesoffour-personfamiliesofallthe50U.S.statesandtheDistrictofColumbiafor1989.Foreachofthesemodels,analysishasbeencarriedoutusingahierarchicalBayesianapproach.Sincewechosenon-informativeimproperpriorsfortheregressionparameters,proprietyoftheposteriorhasbeenrigorouslyprovedbeforeproceedingwiththecomputations(seeTheorem3in 89

PAGE 90

GelfandandSmith 1990 )hasbeenusedtoobtaintheparameterestimates. Wehavecomparedthestate-specicestimatesofmedianhouseholdincomefor1989withthecorrespondingdecennialcensusvaluesinordertotestfortheiraccuracy.Indoingso,weobservedthatthesemiparametricmodelestimatesimproveuponboththeCPSandtheCensusBureauestimates.Interestingly,foralltheabovemodels,thesemiparametricestimatesaregenerallysuperiororatleastcomparabletothecorrespondingestimatesfromthetimeseriesmodelsof Ghoshetal. ( 1996 ).Thisisatestamenttotheexibilityandstrengthofthesemiparametricmethodologyspeciallywhenobservationsarecollectedovertime.ItalsoindicatesthatitmaybeworthwhiletotakeintoaccountthelongitudinalincomepatternsinestimatingthecurrentincomeconditionsoftheU.S.states.Lastly,thesemiparametricmodelingframeworkisverygeneralandcanbeappliedtoanysituationwherevariouscharacteristicsofsmallareasarecollectedovertime. Therestofthechapterisorganizedasfollows.InSection 4.2 weintroducethebivariatesemiparametricmodelingframework.Section 4.3 goesoverthehierarchicalBayesiananalysisweperformed.InSection 4.4 ,wedescribetheresultsofthedataanalysiswithregardtothemedianhouseholdincomedataset.Finally,weendwithadiscussionandsomereferencestowardsfutureworkinSection 4.5 .Theappendixcontainstheproofsoftheposteriorproprietyandtheexpressionsofthefullconditionaldistributionsforourmodels. 4.2.1Notation 90

PAGE 91

3 .Here,wewillexplainthebivariateframeworkwhichisoftwotypesvizasimplebivariatemodelandabivariaterandomwalkmodel.ThesecanalsobeseenasextensionsoftheunivariatemodelsexplainedinSection 3.2.2 Thisisthemostgeneralstructuresincethedegreesofthesplineaswellasthenumberandpositionoftheknotsaredifferentforthetwomodels.Iffori=1,2,...,m;j=1,2,...,t,fYij1,Xij1gandfYij2,Xij2ghavesimilarrelationship,wecanassumep=qandk1=k2,k=1,2,...,K1(=K2). Equation( 4 )canberewrittenas 91

PAGE 92

4 )asfollows whereij=U0ij+Z0ij+bi+vj+uij. AsinSection 3.2.2.2 ,weassumethat(vjjvj1,v)N(vj1,v)withv0=0.Alternatively,wemaywritevj=vj1+wjwherewji.i.dN(0,v). 92

PAGE 93

3 Here,L(Xj,)denotesamultivariatenormaldensitywithmeanvectorandvariancecovariancematrix. Forthebivariaterandomwalkmodel,theparameterspacefortheithstatewouldbei=(i,,,bi,v,f1,...,tg,0,,v)wherev=(v01,...,v0t)0isthevectoroftimespecicrandomeffects.ThehierarchicalBayesianframeworkisgivenby 1.

PAGE 94

4 )willhaveanextracomponentcorrespondingtovgivenbyL(vjjvj1,v)whichhasanormaldistributionwithmeanvj1andcovariancematrixv. Thus,wehavethefollowingpriors:uniform(Rp+q+2),jIW(Sj,dj)(j=1,...,t),IW(S,d),0IW(S0,d0)andvIW(Sv,dv)HereXIW(A,b)denotesainverseWishartdistributionwithscalematrixAanddegreesoffreedombhavingtheexpressionf(X)/jXj(b+p+1)=2exp(tr(AX1)=2),pbeingtheorderofA. Fortherandomwalkmodeltherewillbeanadditionalterm(v).Byconditionalindependenceproperties,wecanfactorizethefullposterioras[,,,b,0,,f1,...,tgjY,U,Z]/[Yj][j,,b,f1,...,tg,X,Z][bj0][j][][][0]tYj=1[j] 94

PAGE 95

GelmanandRubin ( 1992 )andrunn(2)parallelchains.Foreachchain,werun2diterationswithstartingpointsdrawnfromanoverdisperseddistribution.Todiminishtheeffectsofthestartingdistributions,therstditerationsofeachchainarediscardedandposteriorsummariesarecalculatedbasedontherestofthediterates.Thefullconditionalsforboththemodelsaregivenintheappendix. Onceposteriorsamplesaregeneratedfromthefullconditionalsoftheparameters,Rao-Blackwellizationyieldsthefollowingposteriormeansandvariancesofij and 4.2.2 .toanalyzethemedianincomedatasetreferredtoinSection 4.1.3 .Thebasicdatasetforourproblemisthetriplet(Yij1,Yij2,Yij3)andtheassociatedvariancecovariancematrixij(i=1,...,51;j=1,...,11).HereYij1,Yij2andYij3respectivelydenotetheCPSmedianincomesof 95

PAGE 96

Fortheunivariatesetup,theresponseandcovariatesarerespectivelyYij1andXij1.Forthebivariatesetup,thebasicdatavectorisadupletwithrstcomponentYij1andsecondcomponentiseitherYij2,Yij3or0.75Yij2+0.25Yij3.Theadjustedcensusmediansarechosenanalogously.Asmentionedbefore,ourtargetofinferencearethestatespecicmedianincomesoffourpersonfamiliesfor1989. Inordertochecktheperformanceofourestimates,weplantousefourcomparisonmeasures.ThesewereoriginallyrecommendedbythepanelonsmallareaestimatesofpopulationandincomesetupbytheCommitteeonNationalStatisticsinJuly1978andisavailableintheirJuly1980report(p.75).Theseare

PAGE 97

ThebasicstructureofourmodelswouldremainthesameasinSection 4.2.2 .WehaveusedlineartruncatedpolynomialbasisfunctionsfortheP-splinecomponentinourmodelssincethemedianincomeprolesdidnotexhibitahighdegreeofnon-linearity.Forhighlynon-linearprolesaquadraticorcubicpolynomialbasisfunctionrepresentationcanbeused.Innon-parametricregressionproblems,theproperselectionofknotsplaysacriticalrole.Ideally,asufcientnumberofknotsshouldbeselectedandplaceduniformlythroughouttherangeoftheindependentvariablesothattheunderlyingobservationalpatternisproperlycaptured.Toofewortoomanyknotsgenerallydegradesthequalityofthet.Thisisbecause,iftoofewknotsareused,thecompleteunderlyingpatternmaynotbecapturedproperly,thusresultinginabiasedt.Ontheotherhand,oncethereareenoughknotstotimportantfeaturesofthedata,furtherincreaseintheknotshavelittleeffectonthetandmayleadtooverparametrization( Ruppert 2002 ).Generally,atmost35to40knotsarerecommendedforeffectivelyallsamplesizesandfornearlyallsmoothregressionfunctions.Followingthegeneralconvention,wehaveplacedtheknotsonagridofequallyspacedsamplequantilesoftheindependentvariable(adjustedcensusmedianincome). GelmanandRubin ( 1992 ).Weranthreeparallelchains,withvaryinglengthsandburn-ins.Weinitiallysampledtheij'sfrommultivariatet-distributionswith2dfhavingthesamelocationandscalematricesasthecorrespondingmultivariatenormalconditionalsgivenintheAppendix.ThisisbasedontheGelman-Rubinideaofinitializingthechainatoverdisperseddistributions.However,onceinitialized,the 97

PAGE 98

Wettedboththeunivariateandbivariatemodelstothemedianincomedataset.Indoingso,weworkedwithallpossibleknotchoicesfrom0to40.Here,wewouldonlyshowtheresultscorrespondingtothebestperformingmodeli.ethemodelwiththelowestvaluesofthecomparisonmeasures. Intheunivariateframework,themodelwith3knotsintheincometrajectoryperformedthebest.Table 4-1 reportsthecomparisonmeasuresforthismodel(denotedasUSPM(3))alongwiththoseoftheCPSestimates(CPS),CensusBureauestimates(Bureau),andtheunivariateGNKtimeseries(GNK.TS)andnon-timeseries(GNK.NTS)estimates.Table 4-2 reportsthepercentageimprovementofthetimeseries,non-timeseriesandthesemiparametricestimatesoverthecensusbureauestimates. FromTable 4-1 ,itisclearthatthesemiparametricestimatessignicantlyimproveupontheCPS,timeseriesandnon-timeseriesestimateswithrespecttoallthecomparisonmeasures.Infact,thesemiparametricestimatesperformslightlybetterthanthebivariateCensusBureauestimatestoowithrespecttoARBandAAB.This 98

PAGE 99

Comparisonmeasuresforunivariateestimates EstimateARBASRBAABASD CPS0.07350.00842,928.8213,811,122.39Bureau0.02960.00131,183.902,151,350.18GNK.TS0.03380.00181,351.673,095,736.14GNK.NTS0.03630.00211,457.473,468,496.61USPM(3)0.02890.00141169.742,549,698.26 Table4-2. PercentageimprovementsofunivariateestimatesoverCensusBureauestimates EstimateARBASRBAABASD GNK.TS-14.19%-38.46%-14.17%-43.90%GNK.NTS-22.64%-61.54%-23.11%-61.22%USPM(3)2.37%-7.69%1.2%-18.52% isalsoreectedinTable 4-2 wherethesemiparametricestimatesmarginallyimproveupontheBureauestimatesfortheabovetwocomparisonmeasures.Overall,thedegreeofdominanceoftheBureauestimatesonthetimeseriesandnontimeseriesestimatesismuchlargercomparedtothatonthesemiparametricestimates.Theseresultsindicatethat,intheunivariateframework,thesemiparametricmodelwith3knotsperformsignicantlybetterthanthetimeseriesandnon-timeseriesmodelsof Ghoshetal. ( 1996 ). Now,wemoveontothebivariatenon-randomwalksetup.First,weconsiderthemodelwithresponsevectortheCPSmedianincomeof4and3personfamiliesi.e(Yij1andYij2).Thecovariatesarethecorrespondingadjustedcensusmedians.SinceweassumedinverseWishartpriorsforthevariancecovariancematrices,thevaluesofthecomparisonmeasuresweredependentonthedegreesoffreedomoftheWishartdistributionandthenumberofknotsintheincometrajectory.Weworkedwithdifferentcombinationsofthetwointtingthesemodels.Thebestresults(lowestcomparisonmeasures)wereobtainedfortwomodels,bothwith6knotsbutwithdegreesoffreedoms7and9respectively.ThesemodelsaredenotedbyBSPM(1)(4,3)andBSPM(2)(4,3)respectively.Whenweconsiderthemedianincomesof4and5person 99

PAGE 100

Comparisonmeasuresforbivariatenon-randomwalkestimates EstimateARBASRBAABASD CPS0.07350.00842,928.8213,811,122.39Bureau0.02960.00131,183.902,151,350.18GNK.TS(4,3)0.02950.00131,171.712,194,553.67GNK.NTS(4,3)0.03230.00161,287.782,610,249.94BSPM(1)(4,3)0.02740.00131079.632,182,669.56BSPM(2)(4,3)0.02860.00111131.611,880,089.29GNK.TS(4,5)0.02300.0009932.511,618,025.33GNK.NTS(4,5)0.02950.00131,179.942,216,738.06BSPM(4,5)0.02550.00101033.121,859,373.98GNK.TS(4,3+5)0.02870.00131,150.242,116,692.71GNK.NTS(4,3+5)0.03240.00151,297.122,530,938.06BSPM(1)(4,3+5)0.02710.00121078.52,128,679.65BSPM(2)(4,3+5)0.02890.00121132.101,838,598.30 families,thelowestcomparisonmeasureswereobtainedforthemodelwith4knotsintheincometrajectoryand7degreesoffreedom.WedenotethismodelbyBSPM(4,5). Lastly,forthemodelwithmedianincomesof4personfamiliesandtheweightedaverageincomesof3and5personfamilies(withweights0.75and0.25)asresponsevectors,thebestresultswereobtainedfortwomodels,bothwith6knotsandwithdegreesoffreedoms7and9respectively.WedenotethesemodelsasBSPM(1)(4,3+5)andBSPM(2)(4,3+5)respectively.Table 4-3 reportsthecomparisonmeasuresforthesemodelsalongwiththoseofCPS,Bureau,andthecorrespondingbivariateGNKtimeseriesandnon-timeseriesestimates.Table 4-4 reportsthepercentageimprovementoftheaboveestimatesoverthecensusbureauestimates. FromTable 4-3 andTable 4-4 ,itisclearthatbothBSPM(4,3)andBSPM(4,3+5)estimatesimproveuponthebivariatetimeseriesandnontimeseriesestimateswithrespecttonearlyallthefourcomparisonmeasures.ThesemiparametricestimatesalsoimprovesupontheCensusBureauestimatesandtherawCPSestimates.Forthemodelwithmedianincomeoffourandvepersonfamiliesasresponse,thesemiparametricestimatesfallswellbehindthebivariatetimeseriesestimatesof Ghoshetal. ( 1996 )butsignicantlyimprovesupontheCPSandCensusBureauestimates. 100

PAGE 101

Percentageimprovementsofbivariatenon-randomwalkestimatesoverCensusBureauestimates EstimateARBASRBAABASD GNK.TS(4,3)-0.48%-2.52%1.03%-2.01%GNK.NTS(4,3)-8.99%-22.45%-8.77%-21.33%BSPM(1)(4,3)7.43%0.00%8.81%-1.46%BSPM(2)(4,3)3.38%15.38%4.42%12.61%GNK.TS(4,5)22.19%30.52%21.23%24.79%GNK.NTS(4,5)0.31%-0.18%0.33%-3.04%BSPM(4,5)13.85%23.08%12.74%13.57%GNK.TS(4,3+5)2.94%3.56%2.84%1.61%GNK.NTS(4,3+5)-9.36%-17.18%-9.56%-17.64%BSPM(1)(4,3+5)8.45%7.69%8.90%1.05%BSPM(2)(4,3+5)2.37%7.69%4.37%14.54% Nowletusconsiderthebivariaterandomwalkmodel.Forthecasewith4and3personfamilies,thelowestcomparisonmeasureswereobtainedforthreemodelswithdegreesoffreedomsandnumberofknots(3,6),(5,6)and(9,1)respectively.WedenotethesemodelsasBRWM(1)(4,3),BRWM(2)(4,3)andBRWM(3)(4,3)respectively.EachofthesemodelssignicantlyimprovesupontheCPSandCensusBureauestimatesandarealsosuperiortothebivariatetimeseriesandnon-timeseriesmodelsproposedby Ghoshetal. ( 1996 )(GNK).Therandomwalkestimatesalsoseemtoimprovemarginallyoverthosecorrespondingtothenon-randomwalksemiparametricmodel.Whenweconsiderthemedianincomeestimatesof4and5personfamilies,therandomwalkmodelwithdegreesoffreedom5and1knotinthetrajectoryseemstoperformthebest.ThecomparisonmeasuresaresignicantlybetterthantheCPS,Bureauandthenon-timeseriesmodelofGNK.However,theyfallmarginallyshortofthetimeseriesestimatesbutfarebetterthanthecorrespondingestimatesobtainedfromthenon-randomwalkmodel(BSPM(4,5)).WedenotethismodelasBRWM(4,5).Lastly,forthemodelwithmedianincomesof4personfamiliesandtheweightedaverageincomesof3and5personfamilies(withweights0.75and0.25)asresponsevectors,thebestresultswereobtainedforthemodelwith5degreesoffreedomand1knotinthetrajectory.ThecomparisonmeasuresweresignicantlybetterthantheCPS, 101

PAGE 102

Comparisonmeasuresforbivariaterandomwalkmodel EstimateARBASRBAABASD BRWM(1)(4,3)0.02610.00111043.331,902,416.1BRWM(2)(4,3)0.02740.00101094.251,804,969.06BRWM(3)(4,3)0.02580.00121037.032,114,599.65BRWM(4,5)0.02450.0010978.121,672,183.6BRWM(4,3+5)0.02440.0011990.501,941,833.29 BureauandGNK(bothtimeseriesandnon-timeseries)whileitalsoimproveduponthenon-randomwalksemiparametricmodel.WedenotethismodelasBRWM(4,3+5).Table 4-5 reportsthecomparisonmeasuresfortherandomwalkmodels. EstimationofmedianincomesoffourpersonfamiliesfordifferentstatesofU.S.(hereplayingtheroleofsmallareas)isofinteresttotheU.S.BureauoftheCensus.Towardsthisend,theBureauofCensuscollectedannualmedianincomeestimatesof3,4and5personfamiliesforallthestatesandtheDistrictofColumbiaforeveryyear.ButthemethodologyusedbytheCensusBureaudoesnottakeintoaccountthelongitudinalnatureofthestate-specicmedianincomeobservations. 102

PAGE 103

Ghoshetal. ( 1996 ).Wealsoextendedthebasicsemiparametricframeworkbyincorporatingatimeseries(randomwalk)componenttoaccountforthewithinstatedependenceinthesuccessiveincomeobservations.Theclassofrandomwalkmodelsseemedtoimproveupontheirnon-randomwalkcounterpartsbutmorestudiesarerequiredtobedonebeforereachingadeniteconclusionabouttheirrelativeperformance.Overall,westronglythinkthatsemiparametricproceduresholdsalotofpromiseforsmallareaestimationproblems,specicallyinsituationswheremultipletimevaryingobservationsofsomecharacteristicareavailableforthesmallareas. 103

PAGE 104

Inmydissertation,Ihaveconcentratedontheapplicationofsemiparametricmethodologiesinanalyzingunorthodoxdatascenariosoriginatingindiverseeldslikecasecontrolstudiesandsmallareaestimation.Intheformerscenario,Ihaveusedpenalizedsplinestomodellongitudinalexposureprolesanditsinuencepatternonthecurrentdiseasestatusforagroupofcasesandcontrols.Indoingso,Ihavecometotheconclusionthatpastexposureobservationsmayhavesignicanteffectonthepresentdiseasestatus.Ourmodelingframeworkisquitegeneralandexibleinthesensethatitcanbeusedtomodelanypossiblepatternsofexposureprolesandalsoitcancapturecomplextimevaryingpatternsofinuenceoftheexposurehistoryonthecurrentdiseasestatus.WeappliedourmodelingframeworkonanestedcasecontrolstudyofprostatecancerwheretheexposurewastheProstateSpecicAntigen(PSA).Inthesecondscenario,wehaveusedsemiparametricprocedurestomodeltheincometrajectoriesofdifferentsmallareasandhaveusedthatinformationtoestimatethemedianincomesofthosesmallareasatagiventimepointinthefuture.OurmodelbasedestimatesseemedtoperformbetterthantheusualBureauofCensusestimateswhicharebasedontheincomeobservationsfromaparticulartimepointandhencearenon-longitudinalinnature.Wehavealsoextendedthesemiparametricmodelingframeworktothebivariatescenarioinestimatingthemedianincomeofvaryingfamilysizesforeachsmallarea.Inboththesecases,thesemiparametricincomeestimatesnotonlyimprovesonthecensusestimatesbutarealsocomparabletoestimatesbasedontimeseriesmodels.Thus,wecanconcludethatsemiparametricmethodology,ifproperlyapplied,holdsalotofpromiseforcomplicateddata-drivensituationsarisingindiversestatisticalsettingsliketheoncementionedabove. Theexibilityandpowerofthenonparametricandsemiparametricproceduresimmediatelyimpliesthatamultitudeofinterestingbutusefulextensionscanbecarried 104

PAGE 105

1.4 ,selectionandproperpositioningofknotsisavitalaspectinanysmoothingprocedureinvolvingsplines.Traditionally,knotsareplacedatequallyspacedsamplequantilesoftheindependentvariablesandthat'swhatwehavedoneinboththecasecontrolandsmallareascenarios.Butthisprocedurehasitsfairshareofdrawbacks-itwasevidentintheunivariatesmallareaproblemwheretheoriginalplacementoftheknotsfailedtoaccountforthelowdensityregionofthedatapatternwherethenon-linearitywasmostlyconcentrated.Thiswasprobablybecauseofthequantiledependentplacementprocedureoftheknots. Recently,therehasbeensomeresearchondata-drivenoradaptiveknotplacementproceduresinwhichthenumberandlocationsoftheknotsarecontrolledbythedataitselfratherthanbeingpre-specied.Theadvantageofthisprocedureisthatfewernumberofknotswouldberequiredwhichwouldbeplacedinoptimallocationsalongthedomain.Thus,theresultingsplinetwillbeexibleenoughtocaptureanyunderlyingheterogeneityinthedatapattern.BothFrequentistandBayesianapproacheshavebeenproposedtowardsthisend.SomeFrequentistcontributionsinclude Friedman ( 1991 )and Stoneetal. ( 1997 )whousedforwardandbackwardknotselectionschemesuntilthebestmodelisidentied. ZhouandShen ( 2001 )usedanalternativealgorithmwhichledtotheadditionofknotsatlocationswhichalreadypossessedsomeknots.Bayesiantreatmentofthisproblemsrevolvesonthenotionoftreatingtheknotnumberandknotlocationsasfreeparameters.SomenotableBayesiancontributionsinclude 105

PAGE 106

( 1998 )whoplacedpriorsonthenumberandlocationsoftheknots.Thentheysampledfromthefullposteriorsoftheparameters(includingknotlocationsandnumbers)usingreversiblejumpMCMCmethods( Green 1995 ).However,theyrestrictedtheknotstobelocatedonlyatthedesignpointsoftheindependentvariable. DiMatteoetal. ( 2001 )followedthesamebasicprocedureas Denisonetal. ( 1998 )buttheydidnotrestricttheknotstobelocatedonlyatthedesignpointsoftheexperiment.Theyalsopenalizedmodelswithunnecessarilylargenumberofknots. BottsandDaniels ( 2008 )proposedaexibleapproachforttingmultiplecurvestosparsefunctionaldata.Indoingso,theytreatedthenumbersandlocationsofknotsofthepopulationaveragedandsubjectspeciccurvesasdistinctrandomvariablesandsampledfromtheirposteriordistributionsusingreversiblejumpMCMCmethods.Theyusedfree-knotb-splinestomodelthepopulationaveragedandsubjectspeciccurves.Inalltheabovecontributions,Poissonpriorsareplacedontheknotnumberswhileatpriorsareplacedontheknotpositions.TheusefulnessandexibilityoftheBayesianapproachliesinthefactthatthenumberandlocationsofknotsareautomaticallydeterminedfromtheMCMCscheme.Thus,thismethodologyisoftenknownasBayesianAdaptiveRegressionSplines.However,thesamplingprocedureisquiteintensivesincetheparameterdimensionvariesateveryiteration.BottsandDanielssubstantiallyreducedthecomputationalburdenbydealingwiththeapproximateposteriordistributionofonlythenumberandpositionsoftheknotsbyintegratingouttheotherparametersbyusingLaplacetransformations. Animmediatebutworthwhileextensiontowhatwehavealreadydonewouldbetoincorporateanadaptiveknotselectionschemeintoboththecasecontrolandsmallareamodelingframeworks.Fortheformersetup,thiswouldcorrespondtodecipheringtheoptimalnumberofknotsforthepopulationmeanPSAtrajectoryandtheinuencefunction.So,dependingontheparticularstudyorthedatasetathand,anyunderlyingpatternintheinuenceprole(oftheexposuretrajectoryonthediseasestate)canbe 106

PAGE 107

Someotherinterestingextensionstoourworkcanbe 1. Incorporatinginformative(non-ignorable)missingness( LittleandRubin 1987 )inthelongitudinalexposure(casecontrol)orincome(smallarea)proles. 2. Incorporatingnon-parametricdistributionalstructureslikemixturesofDirichletprocesses( MacEachernandMuller 1998 ),Polyatrees( HansonandJohnson 2000 )onthesubject(orarea)specicrandomeffects. 3. Extendingthesemi-parametriccasecontrolmodelingframeworktosituationsinvolvingmultiple(>2)orevencategoricaldiseasestates. Now,Ibrieyexplainsomeworkthatwearecurrentlyengagedindoing. 5.2.1IntroductionandBriefLiteratureReview LittleandRubin 1987 ).Broadlytheseareofthreetypesviz: 1. 2. 3. 107

PAGE 108

LittleandRubin ( 1987 ).Theseapproachesdifferinthewaytheyfactorthejointdistributionofthemissingdataandtheresponse.Intheformerapproach,thepopulationisrststratiedbythepatternofdropoutresultinginamodelforthewholepopulationthatisamixtureoverthepatterns.Ontheotherhand,theselectionmodellingapproachrstmodelsthehypotheticalcompletedataandthenamodelforthemissingdataprocess(conditionalonthehypotheticalcompletedata)isappendedtothecompletedatamodel.InthisstudywewillfocusonthePatternmixture(PM)modelingapproach. SupposeourstudyconsistsofNsubjects,eachofwhomcanbemeasuredatTtimepoints.LetYiandtheDirespectivelydenotetheresponsevectoranddropouttimefortheithsubject.DiissuchthatDi=8><>:tiftheithsubjectdropsoutbetweenthe(t1)thandtthobservationtimes.T+1iftheithsubjectisacompleter. So,fortheithsubject,yiandDiareassumedtobeassociatedordependent.Thus,inthisapproachmodelsarebuiltfor[YijDi]butinferencesarebasedonf(y)=XDf(yjD)P(D). AnimportantbutrealisticsituationthatmayariseinlongitudinalstudiesisthatthenumberofuniquedropouttimesT(vis-a-vis,thenumberoftimesasubjectismeasured)maybelarge.Asaresultthenumberofsubjectshavingaparticulardropouttimemaybequitesmall.Thus,straticationbydropoutpatternmayleadtosparse 108

PAGE 109

HoganandLaird ( 1998 )suggestedparameterstobesharedacrosspatterns. Hoganetal. ( 2004 )suggestedwaystogrouptheTdropouttimesintom
PAGE 110

Diggleetal. 2002 )usedtocapturetheserialdependenceintheresponseprocess. Thereexistsanotherclassofmodelsknownasmarginalizedlatentvariablemodelswhichtakescareoftheexchangeableornon-diminishingdependencepatternamongtherepeatedresponseobservationsusingrandomintercepts. SchildcroutandHeagerty ( 2007 )combinedthemarginalizedtransitionandlatentvariablemodelsbyproposingaunifyingmodelthattakesintoaccountbothserialandlongrangedependenceamongtheresponseobservations.Theirmodelcanbeusedinsituationswithmoderatetolargenumberofrepeatedmeasurementspersubjectwherebothserial(shortrange)andexchangeable(longrange)responsecorrelationcanbeidentied. Inthisstudy,wecombinethemethodologiesproposedin Heagerty ( 2002 ), SchildcroutandHeagerty ( 2007 )and RoyandDaniels ( 2008 )andproposeanewmodelwhichaccountsforbothserial(shortterm)andlong-rangedependenceamongtheresponseobservationsinsituationswherethenumberofuniquedropouttimesislarge.Wegroupthedropouttimesusingalatentvariableapproachtakingintoaccounttheuncertaintyinthenumberofgroups.Wealsomodelthemarginalcovariateeffectsofinterest. 110

PAGE 111

Heagerty ( 1999 )proposedmarginallyspeciedlogisticmodelswhichleadtodirectmodelingofthemarginalcovariateeffects.LetYitandXitrespectivelybetheresponseobservationandthecovariatevectorcorrespondingtotheithindividualatthetthtimepoint,i=1,2,...,N;t=1,2,...,T.LetE(YitjXit,)bethemarginalmeanofYit.Itisspeciedas Theabovestructureisthemarginalregressionmodel.Now,inordertospecifythedependenceamong(Yi1,Yi2,...,YiT)thefollowingconditionalmodelisspecied wherebiN(0,).itcanbecomputedbysolvingthefollowingconvolutionequation Thusisafunctionorand.Inthisstudywewillbeproposingamodelwhichwillmarginalizeovertherandomeffectsandthedrop-outdistributiontodirectlymodelthemarginalcovariateeffectsofinteresttakingintoaccountboththeserialandexchangeabledependencestructureamongtheYit's. Letusbrieygooverthenecessarynotationswithrespecttosubjecti.LetYi=(Yi1,Yi2,...,YiT)betheresponsevector.LettheTuniquedropouttimesbegroupedintomclassesbythelatentindicatorsSi=(Si1,...,Sim).HereSijisanindicatorforclassj,j=1,...,m(m<>:1iftheithsubjectisinclassj0otherwise.

PAGE 112

1. Dependencebetweenresponseanddropouttimemodeledbythelatentclasses. 2. Shortrange(serialdependence)betweenYitand(Yit1,...,Yitp)modelledbyaMTM(p). 3. Longrangeornon-diminishingdependenceamongtheYit'smodelledbythesubjectspecicrandomeffectsbi,i=1,...,N. WerstspecifytheMarginalmodelas Theabovemodelmarginalizesoverthesubjectspecicrandomeffectsandoverthelatentclassdistribution(implicitlyoverthedropoutdistribution)aswell.Inordertofullyspecifytheassociationduetorepeatedmeasurementsandnonignorabilityinthemissingnessprocess,wespecifyaconditionalmodelinadditiontothemarginalmodel.Byconditional,wemeanconditionedovertherandomeffectsandlatentclasses.WeassumethattherelevantinformationinthedropouttimesiscapturedbythelatentvariableS-thisisobviousbecausethespeciclatentclassasubjectwouldbelongtowouldsolelydependonhis/herdropouttime.Thus,wespecifyamixturedistributionovertheselatentclasses,asopposedtooverDitself. Beforedelvingintothemodel,itisimportanttonotethattheconditionalmodelparametersarenotofmaininterest,andinfactwillbeviewedasnuisanceparameters.Thisisbecausewearenotinterestedinestimatingeithersubject-speciceffects(i.e.effectsconditionalontherandomeffects)orclass-speciccovariateeffects(i.e.effectsofcovariatesonYgivenaparticulardropoutclass).Moreover,theconditionalmodelshouldbesospeciedthatitiscompatiblewiththemarginalmodel( 5 ).Aswewillseebelow,thisleadstoasomewhatcomplicatedmodel.Specifyingthisconditionalmodel 112

PAGE 113

WeassumethatYit,conditionalontherandomeffectsbiandlatentclassSi,arefromanexponentialfamilywithdistribution where,inthemostgeneralcase,[bijSij=1,Xi]N(0,2j(Xi))andit,k(Sij=1)=V0it,kjkforj=1,2,...,mandk=1,2,...,p,whereVitandZitarebothsubsetsofXit.Thus,thevarianceofbimaydependonthelatentclassandthecovariatevectorfortheithsubject.Moreover,(1k,2k,...,mk)determineshowthedependencebetweenYitandYitkvariesasafunctionofthecovariatesVit,kconditionalonthelatentclasses.Wealsomakethesum-to-zeroconstrainti.em=Pmj=1jforthepurposeofidentiability.Lastly,inthisconditionalmodel,eachsubjecthasitsownintercept,andtheeffectofeachcovariate,isallowedtodifferbydropoutclassviatheregressioncoefcients,(j). Theprobabilitiesofthelatentclassesgiventhedrop-outtimesarespeciedasproportionalodd'smodel( Agresti 2002 )givenby where0,10,2...0,M1and1areunknownparameters.Thustheclassprobabilitiesareassumedtobeamonotonefunctionofdropouttime(infact,linearonthelogitscale). 113

PAGE 114

Lastly,thedrop-outtimesDiareassumedtofollowamultinomialdistributionwithmassateachpossibledrop-outtimes,parameterizedby'.HerewemaketheimportantassumptionthatYitisindependentofDigivenSi.Ourmaintargetofinferencearethecovariateeffectsaveragedovertheclassesi.eMaveragedoverM.Theinterceptitin( 5 )isdeterminedbythefollowingrelationshipbetweenthemarginalandconditionalmodelsE(Yitj)=XDXSp(SijDi)P(Di)ZXAfE(Yitjyit1,...,yitp,bi,Si)p(yit1,...,yitpjbi,Si)gp(bijSi)dbi 114

PAGE 115

Proportionalityin( 5 )holdsbecauseweassumethatthemissingandobservedresponsesfromsubjectiareindependent,givenSiandbi(i.e.[YmijYi,bi,Si]=[Ymijbi,Si]).FollowingtheOPEFformulation,wehaveLi(YijYfig,Sij=1,bi,(j),)=expTXt=1yititTXt=1(it)=(mi)+TXt=1h(Yit,)

PAGE 116

Wecanavoidtheintegral(w.r.tbi)in( 5 )ifwealsosamplethebi'salongwiththeotherparametersfromthefullposterior( 5 ).Inthatcase,thefullposteriormayberewrittenas where Forthemostgeneralcase,wehaveassumedanOPEFstructureforeachYitconditionalonthepast.Sincetheoutcomesarebinary,wecansimplifyittoaBernoullidistributioni.e wherecit=E(Yitjyit1,yit2,...,yitp,bi,Sij=1)=g1it+bi+MXj=1SijZ0ij(j)+pXk=1it,kyitk. 116

PAGE 117

1+e0j+1Di1+e0j1+1Di Now,asmentionedearlier,Diisthedropouttimefortheithsubject.Also,thereareTuniquedropouttimes.Let,fort=1,2,...,Tit=8><>:1iftheithsubjectdropsoutbetweenthe(t1)thandtthobservationtimes0otherwise. 1. LetNq(0,0)assumingthat8i=1,2,...,Nandt=1,2,...,T,Xitisqdimensional. 2. Let(1),(2),...,(m)iidNr(0,0).whererqsinceZitXit8i=1,2,...,Nandt=1,2,...,T. 3. Let21,22,...,2miidU(a,b)where0
PAGE 118

7. Forthetimebeingwekeepthepriorof,()unspecied. Now,combining( 5 5 )andthepriorsspeciedabove,wecanwritedownthefullposteriordistributionofmandw,(w,mjY,X,D)uptoaconstant.Thus,wecangetthefullconditionaldistributionofalltherelevantparametersandproceedwithsamplegenerationusingMCMC. TheassumptionofconditionalindependencebetweenYiandDigivenSiandthecovariatescanbeveriedbyperformingalikelihoodratiotest(Frequentist)orusingBayesfactors(Bayesian).Thenullmodelisgivenby( 5 )andthealternativemodelmaybewrittenas wheref(Di)maybeasmoothbutunspeciedfunctionofDi.Thus,thenullhypothesisofconditionalindependence(betweenYiandDigivenSiandXi)wouldbesimplyf(Di)=0.Thetestcanbecarriedoutbyrstttingthenullmodel(??).Then,theposteriorprobabilityofclassmembershipforeachsubjectcanbeestimatedby^P(Sij=1jDi,Yi,Xi,^w)=RLi(YijYfig,Sij=1,bi,^j,^)p(Sij=1jDi;^)p(Dij^)dF(bijSij,^2j) 5 )usingaweightedlikelihood(theweightsbeingtheaboveposteriorprobabilityofclassmembership).Analternativewayofdoingtheaboveconditionalindependencetestswouldbetousescoretestsbasedonsmoothingsplinesasusedinproportionalhazardsmodelsby Linetal. ( 2006 ). 118

PAGE 119

5 )hasthemostgeneralform.Wecansimplifyitbyassumingalineareffectofdrop-outtimeinwhichcasethealternative(simpler)modelwouldbe whereeachhj()isaknownfunctionandthe'sareparameters.ThenullhypotheseswouldbeH0:1=...=J=0.Thelineardrop-outeffectwouldimplyJ=1andh(Di)=Di.TheLRTcanthenbeperformedasbeforebyttingmodels( 5 )and( 5 )usingthesameweightsgivenabove.WecanalsouseBayesfactorsforcarryingouttheseanalysis. Heagerty ( 1999 )proposedMarginallySpeciedLogisticNormalmodelsforlongitudinalbinarydata.Heproposedtwomodels:therstonewasamarginallogisticregressionmodelwhichlinkstheaverageresponsetothecovariatesbythefollowingequation: HereYijandXijrespectivelydenotethebinaryresponseandtheexogenouscovariatevectorrecordedattimejfortheithsubject,i=1,2,...,N;j=1,2,...,ni.Thesecondmodelisaconditionalmodelwhichexplainsthewithin-subjectdependenceamong 119

PAGE 120

Animportantassumptionthatismadeisthatconditionalonbi=(bi1,bi2,...,bini),thecomponentsofYiareindependent.Finally,itisassumedthat(bijXi)N(0,i)whereimodelsthedependenceamongthebi's(andthus,indirectlyamongtheYi's)andcanbeobtainedasafunctionoftheobservationtimesti=(ti1,ti2,...,tini)andaparametervector. Heagerty ( 1999 )referredtothemodelsgivenin( 5 )and( 5 )asthemarginallyspeciedlogisticnormalmodels. Undertheabovemodellingframework,theparameterijcanbeexpressedasafunctionofboththemarginallinearpredictorij=X0ijandij,thestandarddeviationofbij.WritingbijasijzwherezN(0,1),ijcanbeobtainedasthesolutiontothefollowingconvolutionequation: whereh(.)istheinverseofthelogitlinkand(.)isthestandardnormaldensityfunction.Given(ij,ij),theaboveequationcanbesolvedforijusingnumericalintegrationandNewton-Raphsoniteration. 5 )willbeafunctionofthemarginalmeanparametersandtherandomeffectscovarianceparametersandshouldbecomputedforboththemaximumlikelihoodandestimatingequationmethodology( Heagerty 1999 ).Formaximumlikelihoodestimation,thecontributionoftheithsubjecttotheobserveddatalikelihoodisascertainedbyrstassumingalineartransformationoftheformbi=CiziwhereCiisaniqmatrixandziNq(0,Iqq).Theabovetransformationeffectivelylinksupbitoalowerdimensionalrandomeffectzi.Thecontributionoftheithsubject(totheobserveddatalikelihood)cannowbeexpressedasamixtureovertherandom 120

PAGE 121

whereq(zi)=qYk=1(zik).SinceLi(,)cannotbeevaluatedanalytically,numericalproceduresarerequiredtonditsvalue. Heagerty ( 2002 )usedGauss-HermiteQuadraturetoperformthecalculationbutassumedq=1.Withincreasingvaluesofq,thecomputationalburdenincreasesexponentiallyandisnotfeasibleatall.Wearecurrentlytryingtodevelopalternativeandlesscomputationallyintensivemethodologiestoaccomplishtheaboveobjectives.WeareworkingwithMultivariateLogisticandMultivariatetdistributionsagainstaBayesianframeworkasin O'brienandDunson ( 2004 ).Wehopethatthismethodologywillprovideabetteralternativetothearduousnumericalmethodsmentionedbelow. 121

PAGE 122

logdj=log+dlog#+logj+d0Z0cZj(t)(t)dt Thus,thelikelihoodwillbe A )wehave Differentiating( A )w.r.tand#andsolvingtheresultingequationswehave A )andthenexponentiating,weobtaintheexpressionofL(,)in( 2 ). Again,differentiating( A )w.r.tj,wehave Itiseasytoshowthatifwereplace( A )in( A )andthenexponentiate,wegettheexpressionforL(#,)in( 2 ).Sincetheorderofmaximizationisimmaterial,itfollowsthat,L(,)andL(#,),oncemaximizedoverthenuisanceparameters(#and

PAGE 123

Replacingtheexpressionofdjfrom( 2 ),wehave 2 ). (ii)First,weperformthetransformationfromto(,),where=JXj=1j.Thus,j=j,j=1,...,J.ThejacobianoftransformationwillbeJ1. Usingthistransformationin( A )andaftersomemanipulation,wehave 123

PAGE 124

A )w.r.t#weobtain Integrationof( A )w.r.tyields( 2 )aftersomeminormanipulation. (iii)Theorderinwhichp(#,,jy)isintegratedw.r.ttheparametersdoesnotmakeanydifferenceinthemarginalposteriordensityofp().Thus,integrationofp(w,jy)w.r.tworp(,jy)w.r.twillyieldthesamemarginalposteriordensityp(jy)of. 1. AsinSeamanandRichardson(2004),theassumptionofexistenceandnitenessofE0Z0cZq(t)(t)dtandE0Z0cZr(t)(t)dtisautomaticallysatisedprovidedthepriordensityp()ensuresthatE()existsandisnite. 2. Theposteriorproprietyofp(#,,jy)in( )canbeshowninasimilarwaytothatinSeamanandRichardson(2001). 3. Thepriordistributionp()ofinducesapriordistributionontheinuencefunctionf(t),ct0ginthelogisticcase-controlmodelin( 23 )since(t)=0(t),ct0. LetP(D=djX(t)=Zk(t),ct0)=pdk,(d=0,1,...,r;k=1,...,K)andP(X(t)=Zk(t),ct0jD=0)=k=PKl=1l.LetndkbethenumberofindividualswithD=dandfX(t)=Zk(t),ct0g.ItcanbeshownthatP(X(t)=Zk(t),ct0jD=d)=kpdk=p0k KXl=1lpdl=p0l

PAGE 125

KXl=1lpdl=p0l1CCCCCAndk. KXl=1ldl1CCCCCAndk. TheaugmentedmodelisgivenbyZdkjdkpoisson(dk)wherelog(dk)=log(#d)+log(dk)+log(k),log(0k)=log(k),d=1,...,r;k=1,...,K. 125

PAGE 126

NotingthatZ10expk(1+rXd=1#ddk)!(k)Prd=0ndk1dk/1+rXd=1#ddk!Prd=0ndk,wehave,byintegratingoutin( A ), Now,integratingout(#1,...,#r)from( A ),wehave Next,wemakethetransformationk='kand'=KXl=1lhavingjacobian'1.Hencethepriordistributionin( A )becomes(,#,',)/rYd=1#1d!'1KYk=11k!().

PAGE 127

A )canberewrittenas Integratingout'from( A ),wehave KXl=1ldl1CCCCCAndkKYk=11k!() From( A )and( A ),itisclearthatposteriorinferencefortheparameterofinterest,remainsthesameundereithertheprospectivelikelihoodLportheretrospectivelikelihoodLRaslongastheposteriorisproper.Itcanbeshownthattheposteriorwillbeproperforanyproperpriorforifn0k18k=1,...,K. 127

PAGE 128

WehavetoshowthatIMwhereMisanynitepositiveconstant. Integratingrstw.r.t,wehave 2Xi(iXiZibi1)01(iXiZibi1)d=jXiX0i1Xij1=2exp1 2XiW0i1Wi+Q 2PiW0i1XiPiX0i1Xi1PiX0i1Wi,Wi=iZibi1and1=diag(21,22,...,2t). Now,W0i1Wi=W0i1=21=2Wi=S0iSiwhereSi=1=2Wi.Similarly,W0i1Xi=S0iTi,X0i1Wi=T0iSiandX0i1Xi=T0iTiwhereTi=1=2Xi. 128

PAGE 129

B )becomes1 2XiS0iSiXiS0iTiXiT0iTi1XiT0iSi=1 2S0SS0T(T0T)1T0S=1 2S0IT(T0T)1T0S=Q,say whereS=(S01,...,S0m)0andT=(T01,...,T0m)0.Since(IT(T0T)1T0)isidempotent,S0IT(T0T)1T0Sisnon-negative,implyingQ0andthusexp(Q)1. Next,weconsiderintegrationw.r.t2i.e Assumingmax=max(1,...,t),wehave,8j=1,...,t,2j2max)Xij2jX0ijXij2maxX0ij)Pi,jXij2jX0ij2maxPi,jXijX0ijandthus Combining( B )and( B ),wehaveIjXi,jXijX0ijj1=2Z...Z(2max)(p+1)=2tYj=1(2j)m=2cj+1exp(dj=2j)d21...d2t 2expdk (B) 129

PAGE 130

Combining( B )and( B ),wehave where=().Sinceallthecomponentsoftheintegrandin( B )haveproperdistributions,theaboveintegralwouldbenitethusprovingposteriorpropriety. Fortherandomwalkmodel,theintegrandin( B )willhaveanadditionallikelihoodtermQtj=1L(vjjvj1,2v)andapriorterm(2v).Thederivationwouldthenproceedexactlyasaboveandtheintegrandin( B )willalsocontaintheseadditionalterms.Butsincebothoftheseareproperdistributions(normalandinversegammarespectively),Iwillstillbeniteundertheconditionsstatedinthetheorem. 2Xi,j(ijX0ijZ0ijbi)01j(ijX0ijZ0ijbi)d<1(B) inordertoproveposteriorpropriety. Usingthesametypeofalgebraicmanipulationsasintheunivariatecase,theL.H.Sof( B )canbeshowntobe 2Xi,jW0ij1jWij+1 2Q 130

PAGE 131

Asbefore,theexpressionwithintheexponentin( B )canberewrittenasK=1 2Xi,jS0ijSijXi,jS0ijTijXi,jT0ijTijXi,jT0ijSij=1 2S0IT(T0T)1T0S0. Thus, exp1 2Xi,jW0ij1jWij+1 2Q1 So,inordertoproveposteriorpropriety,wehavetoshow Hereristheorderofj,j=1,2,...,t.(r=2inourcase). Letj1,j2,...,jrbethedistincteigenvaluesof1j,j=1,2,...,t.Sincejisavariancecovariancematrix,itispositivedeniteandsymmetric.Hence,1jalsohasthesameproperties.Thus,jk>0,8k=1,2,...,r. Now,8j=1,2,...,r,

PAGE 132

2jXi,jXijX0ijj1 2 Sincej1jj=rYk=1jk,8j=1,...,t, 2=rYk=1(jk)(m+djr1) 2 Now,replacing( B )and( B )intheexpressionofIin( B ),wehave 2Z..Z(min)p+q+2 2tYj=1rYk=1(jk)(m+djr1) 2exp"TV1j1j whereTdenotestrace.Letmin=lm,l2[1,...,t];m2[1,...,r]. Then,II1I2where 2ZrYfk=1,k6=mg(lk)(m+dlr1) 2(lm)(m+dlpq2)r1 2expTV1l1l 2ZrYfk=1,k6=mg(lk)p+q+2 2j1lj(m+dlpq2)r1 2expTV1l1l 2exp"TV1j1j 2jVjjm+dj whichisnite. Thus,inordertoshowposteriorpropriety,wehavetoprovethatI2<1. 132

PAGE 133

2j1lj(m+dlpq2)r1 2expTV1l1l BytheAM-GMinequality,wehave, 21 2 21 2=1 2=1 2 where(l)kkdenotesthekthdiagonalelementof1l. Since1lhasaWishartdistribution,(l)kkkk2dl,(k=1,...,r)implyingthatPrk=1(l)kk<1. Combining( B )and( B ),wehave,IZ1 2j1lj(m+dlpq2)r1 2expTV1l1l 2ZrXk=1(l)kk(r1)(p+q+2) 2j1lj(m+dlpq2)r1 2expTV1l1l 2whereC=1 2

PAGE 134

Now, 2(r1)(p+q+2) 2r1rXk=1((l)kk)(r1)(p+q+2) 2 2(r1)(p+q+2) 2r1ErXk=1((l)kk)(r1)(p+q+2) 2 whichisnitebecause 2<18k=1,...,r)rXk=1((l)kk)(r1)(p+q+2) 2<1)ErXk=1((l)kk)(r1)(p+q+2) 2<1)ErXk=1(l)kk(r1)(p+q+2) 2<1 ThusIisniteimplyingposteriorpropriety. 134

PAGE 135

1. andisthep+K+1orderpriorvariance-covariancematrixof. 2. 3. andbistheq+M+1ordervariance-covariancematrixofb. 4. andisther+K+2ordervariance-covariancematrixof(,). 5. 2,+(Zi0Mib0iQi) 2where 6. 2NXi=1b2ij,j=0,...,q. 135

PAGE 136

2NXi=1ni+1,1 2NXi=1niXj=1yijp,(aij)0q,(aij)0bi2. 8. 2KXk=12p+k. 9. 2NXi=1q+MXj=q+1b2ij. 10. 2KXk=12r+k. Here,G(x,y)denotesaGammadensitywithshapeparameterxandrateparameteryrespectively. C.2.1SemiparametricUnivariateSmallAreaModel 1.

PAGE 137

20+d 2mXi=1(ijX0ijZ0ijbi)2+dj 2mXi=1b2i+d 1. 137

PAGE 138

10.

PAGE 139

Agresti,A.(2002).Categoricaldataanalysis.Wiley. Albert,J.andChib,S.(1993).Bayesiananalysisofbinaryandpolychotomousresponsedata.JournaloftheAmericanStatisticalAssociation88,669. Althman,P.(1971).Theanalysisofmatchedproportions.Biometrika58,561. Ashby,D.,Hutton,J.,andMcGee,M.(1993).SimpleBayesiananalysesforcase-controlledstudiesincancerepidemiology.Statistician42,385. Battese,G.,Harter,R.,andFuller,W.(1988).Anerrorcomponentmodelforpredictionofcountycropareasusingsurveyandsatellitedata.JournaloftheAmericanStatisticalAssociation83,28. Bell,W.(1999).Accountingforuncertaintyaboutvariancesinsmallareaestimation.BulletinoftheInternationalStatisticalInstitute. Botts,C.andDaniels,M.(2008).AfexibleapproachtoBayesianmultiplecurvetting.ComputationalStatisticsandDataAnalysis52,5100. Bradlow,E.andZaslavsky,A.(1997).CaseinuenceanalysisinBayesianinference.JournalofComputationalandGraphicalStatistics6,314. Breslow,E.T.andDay,N.E.(1980).StatisticalMethodsinCancerResearch,Volume1.InternationalAgencyforResearchonCancer,Lyon. Breslow,E.T.,Day,N.E.,Halvorsen,K.T.,Prentice,R.L.,andSabai,C.(1978).Estimationofmultiplerelativeriskfunctionsinmatchedcase-controlstudies.Ameri-canJournalofEpidemiology108,299. Breslow,N.(1996).Statisticsinepidemiology:Thecase-controlstudy.JournaloftheAmericanStatisticalAssociation91,14. Carroll,R.J.,Wang,S.,andWang,C.Y.(1995).Prospectiveanalysisoflogisticcasecontrolstudies.JournaloftheAmericanStatisticalAssociation90,157. Catalona,W.,Partin,A.,Slawin,K.,andBrawer,M.(1998).Useofthepercentageoffreeprostate-specicantigentoenhancedifferentiationofprostatecancerfrombenignprostaticdisease:Aprospectivemulticenterclinicaltrial.JournaloftheAmericanMedicalAssociation19,1542. Corneld,J.(1951).Amethodofestimatingcomparativeratesfromclinicaldata:applicationstocancerofthelung,breast,andcervix.JournaloftheNationalCancerInstitute11,1269. Corneld,J.,Gordon,T.,andSmith,W.W.(1961).Quantalresponsecurvesforexperimentallyuncontrolledvariables.BulletinoftheInternationalStatisticalInstitute38,97. 139

PAGE 140

Denison,D.,Mallick,B.,andSmith,A.(1998).AutomaticBayesiancurvetting.JournaloftheRoyalStatisticalSociety,SeriesB60,333. Diggle,P.,Heagerty,P.,Liang,K.,andZeger,S.(2002).Theanalysisoflongitudinaldata,2ndEdition.NewYork:OxfordUniversityPress. Diggle,P.,Morris,S.,andWakeeld,J.(2000).Pointsourcemodelingusingmatchedcase-controldata.Biostatistics1,89. DiMatteo,I.,Genovese,C.,andKass,R.(2001).Bayesiancurvettingwithfreeknotsplines.Biometrika88,1055. Durban,M.,Harezlak,J.,Wand,M.,andCarroll,R.(2004).Simplettingofsubjectspeciccurvesforlongitudinaldata.StatisticsinMedicine00,1. Eilers,P.andMarx,B.(1996).FlexiblesmoothingwithB-splinesandpenalties.Statisti-calScience11,89. Ericksen,E.andKadane,J.(1985).Estimatingthepopulationincensusyear:1980andbeyond(withdiscussion).JournaloftheAmericanStatisticalAssociation80,98. Escobar,M.andWest,M.(1995).Bayesiandensityestimationandinferenceusingmixtures.JournaloftheAmericanStatisticalAssociation90,577588. Etzioni,R.,Pepe,M.,Longton,G.,Hu,C.,andGoodman,G.(1999).Incorporatingthetimedimensioninreceiveroperatingcharacteristiccurves:Acasestudyofprostatecancer.MedicalDecisionMaking19,242. Eubank,R.(1988).Splinesmoothingandnonparametricregression.NewYork:MarcelDekker. Eubank,R.(1999).Nonparametricregressionandsplinesmoothing.NewYork:MarcelDekker. Fan,J.andGijbels,I.(1996).Localpolynomialmodelinganditsapplications.ChapmanandHall. Fay,R.(1987).Applicationofmultivariateregressiontosmalldomainestimation,inR.Platek,J.N.K.Rao,C.E.Srndal,andM.P.Singh(Eds).SmallAreaStatistics. Fay,R.andHerriot,R.(1979).Estimationofincomefromsmallplaces:anapplicationofJames-Steinprocedurestocensusdata.JournaloftheAmericanStatisticalAssociation74,269. 140

PAGE 141

Friedman,J.(1991).Multivariateadaptiveregressionsplines.TheAnnalsofStatistics19,1. Gelfand,A.andGhosh,S.(1998).Modelchoice:Aminimumposteriorpredictivelossapproach.Biometrika85,1. Gelfand,A.andSmith,A.(1990).Samplingbasedapproachestocalculatingmarginaldensities.JournaloftheAmericanStatisticalAssociation85,398. Gelman,A.andRubin,D.(1992).Inferencefromiterativesimulationusingmultiplesequences(withdiscussion).StatisticalScience7,457. Ghosh,M.andChen,M.-H.(2002).Bayesianinferenceformatchedcasecontrolstudies.Sankhya,B64,107. Ghosh,M.,Nangia,N.,andKim,D.(1996).Estimationofmedianincomeoffour-personfamilies:ABayesiantimeseriesapproach.JournaloftheAmericanStatisticalAssociation91,1423. Ghosh,M.andRao,J.N.K.(1994).Smallareaestimation:Anappraisal.StatisticalScience9,55. Godambe,V.P.(1976).Conditionallikelihoodandunconditionaloptimumestimatingequations.Biometrika63,277. Green,P.(1995).ReversiblejumpMarkovChainMonteCarlocomputationandBayesianmodeldetermination.Biometrika82,711. Green,P.andSilverman,B.(1994).Nonparametricregressionandgeneralizedlinearmodels:aroughnesspenaltyapproach.ChapmanandHall/CRC. Gustafson,P.,Le,N.,andValle,M.(2002).ABayesianapproachtocase-controlstudieswitherrorsincovariables.Biostatistics3,229. Hampel,F.,Ronchetti,E.,Rousseeuw,P.,andStahel,W.(1987).Robuststatistics:Theapproachbasedoninuencefunctions.Wiley. Hanson,T.andJohnson,W.(2000).Spatiallyadaptivepenaltiesforsplinetting.AustralianandNewZealandJournalofStatistics2,205. Heagerty,P.(1999).Marginallyspeciedlogisticnormalmodelsforlongitudinalbinarydata.Biometrics55,688. Heagerty,P.(2002).Marginalizedtransitionmodelsandlikelihoodinferenceforlongitudinalcategoricaldata.Biometrics58,342. 141

PAGE 142

Hogan,J.andLaird,N.(1998).Mixturemodelsforthejointdistributionofrepeatedmeasuresandeventtimes.StatisticsinMedicine16,239. Hogan,J.,Roy,J.,andKorkontzelou,C.(2004).Tutotialinbiostatistics:Handlingdrop-outinlongitudinalstudies.StatisticsinMedicine23,1455. Jiang,J.andLahiri,P.(2006).Mixedmodelpredictionandsmallareaestimation.Test15,1. Johnson,V.(2004).ABayesian2testforgoodness-of-t.AnnalsofStatistics32,2361. Lewis,M.,Heinemann,L.,MacRae,K.,Bruppacher,R.,andSpitzer,W.(1996).Theincreasedriskofvenomousthromboembolismandtheuseofthirdgenerationprogestagens:Roleofbiasinobservationalresearch.Contraception54,5. Lin,J.,Zhang,D.,andDavidian,M.(2006).Smoothingsplinebasedscoretestsforproportionalhazardsmodels.Biometrics62,803. Lindstrom,M.(1999).Penalizedestimationoffree-knotsplines.JournalofComputa-tionalandGraphicalStatistics8,333. Lipsitz,S.,Parzen,M.,andEwell,M.(1998).Inferenceusingconditionallogisticregressionwithmissingcovariates.Biometrics54,295. Little,R.andRubin,D.(1987).StatisticalAnalysiswithMissingData.NewYork:Wiley&Sons. MacEachern,S.andMuller,P.(1998).EstimatingmixturesofDirichletprocessmodels.JournalofComputationalandGraphicalStatistics2,223. Mantel,N.andHaenszel,W.(1959).Statisticalaspectsoftheanalysisofdatafromretrospectivestudiesofdisease.JournaloftheNationalCancerInstitute22,719. Marshall,R.(1988).Bayesiananalysisofcase-controlstudies.StatisticsinMedicine7,12231230. Morris,C.(1983).ParametricempiricalBayesinference:theoryandapplicaions.JournaloftheAmericanStatisticalAssociation78,47. Muller,P.,Parmigiani,G.,Schildkraut,J.,andTardella,L.(1999).ABayesianhierarchicalapproachforcombiningcase-controlandprospectivestudies.Biometrics55,858. Muller,P.andRoeder,K.(1997).ABayesiansemiparametricmodelforcase-controlstudieswitherrorsinvariables.Biometrika84,523. 142

PAGE 143

O'brien,S.andDunson,D.(2004).Bayesianmultivariatelogisticregression.Biometrics60,739. Opsomer,J.,Claeskens,G.,Ranalli,M.,andBreidt,F.(2008).Non-parametricsmallareaestimationusingpenalizedsplineregression.JournaloftheRoyalStatisticalSociety,SeriesB70,265. Paik,M.andSacco,R.(2000).Matchedcase-controldataanalyseswithmissingcovariates.AppliedStatistics49,145. Park,E.andKim,Y.(2004).Analysisoflongitudinaldataincase-controlstudies.Biometrika91,321. Prentice,R.L.andPyke,R.(1979).Logisticdiseaseincidencemodelsandcasecontrolstudies.Biometrika66,403. Rao,J.N.K.(2003).SmallAreaEstimation.WileyInterScience,NewYork. Rathouz,P.,Satten,G.,andCarroll,R.(2002).Semiparametricinferenceinmatchedcase-controlstudieswithmissingcovariatedata.Biometrika89,905. Robinson,G.(1991).ThatBLUPisagoodthing:theestimationofrandomeffects.StatisticalScience6,15. Roeder,K.,Carroll,R.,andLindsay,B.(1996).Asemiparametricmixtureapproachtocase-controlstudieswitherrorsincovariables.JournaloftheAmericanStatisticalAssociation91,722. Roy,J.(2003).Modelinglongitudinaldatawithnon-ignorabledropoutsusingalatentdropoutclassmodel.StatisticsinMedicine59,829. Roy,J.andDaniels,M.(2008).Ageneralclassofpatternmixturemodelsfornonignorabledropoutswithmanypossibledropouttimes.Biometrics64,538. Rubin,D.(1981).TheBayesianbootstrap.TheAnnalsofStatistics9,130. Ruppert,D.(2002).Selectingthenumberofknotsforpenalizedsplines.JournalofComputationalandGraphicalStatistics11,735. Ruppert,D.andCarroll,R.(2000).Spatiallyadaptivepenaltiesforsplinetting.AustralianandNewZealandJournalofStatistics2,205. Ruppert,D.,Wand,M.,andCarroll,R.(2003).SemiparametricRegression.CambridgeUniversityPress,Cambridge,U.K. Satten,G.andCarroll,R.(2000).Conditionalandunconditionalcategoricalregressionmodelswithmissingcovariates.Biometrics56,384. 143

PAGE 144

Schildcrout,J.andHeagerty,P.(2007).Marginalizedmodelsformoderatetolongseriesoflongitudnalbinaryresponsedata.Biometrics63,322. Seaman,S.R.andRichardson,S.(2001).Bayesiananalysisofcase-controlstudieswithcategoricalcovariates.Biometrika88,1073. Seaman,S.R.andRichardson,S.(2004).EquivalenceofprospectiveandretrospectivemodelsintheBayesiananalysisofcase-controlstudies.Biometrika91,15. Sinha,S.,Mukherjee,B.,andGhosh,M.(2004).Bayesiansemiparametricmodelingformatchedcase-controlstudieswithmultiplediseasestates.Biometrics60,41. Sinha,S.,Mukherjee,B.,Ghosh,M.,Mallick,B.,andCarroll,R.(2005).SemiparametricBayesiananalysisofmatchedcase-controlstudieswithmissingexposure.JournaloftheAmericanStatisticalAssociation100,591. Stone,C.,Hansen,M.,Kooperberg,C.,andTruong,Y.(1997).Polynomialsplinesandtheirtensorproductsinextendedlinearmodeling.TheAnnalsofStatistics25,1371. Wahba,G.(1990).Splinemodelsforobservationaldata.CBMS-NSFRegionalConferenceSeriesinAppliedMathematics. Wand,M.(2003).Smoothingandmixedmodels.ComputationalStatistics18,223. Wand,M.andJones,M.(1995).KernelSmoothing.ChapmanandHall. Zelen,M.andParker,R.(1986).CasecontrolstudiesandBayesianinference.StatisticsinMedicine5,261269. Zhang,D.,Lin,X.,andSowers,M.(2007).Twostagefunctionalmixedmodelsforevaluatingtheeffectoflongitudinalcovriateprolesonascalaroutcome.Biometrics63,351. Zhou,S.andShen,X.(2001).Spatiallyadaptiveregressionsplinesandaccurateknotselectionschemes.JournaloftheAmericanStatisticalAssociation96,247. 144

PAGE 145

DhimanBhadrareceivedhisBachelorofScienceinstatisticsfromPresidencyCollege,Calcutta(India)in2002andMasterofScienceinstatisticsfromCalcuttaUniversityin2004.HejoinedtheDepartmentofstatisticsatUniversityofFloridainJanuary2005forpursuingaPhDinstatistics.HeplanstograduateinAugust2010. 145