<%BANNER%>

Bayesian Semiparametric Regression and Related Applications

Permanent Link: http://ufdc.ufl.edu/UFE0041832/00001

Material Information

Title: Bayesian Semiparametric Regression and Related Applications
Physical Description: 1 online resource (145 p.)
Language: english
Creator: Bhadra, Dhiman
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: bayesian, case, current, mcmc, odds, penalized, random, semiparametric
Statistics -- Dissertations, Academic -- UF
Genre: Statistics thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Case-Control studies and small area estimation are two distinct areas of modern Statistics. The former deals with the comparison of diseased and healthy subjects with respect to risk factor(s) of a disease with the aim of capturing disease - exposure association specially for rare diseases. The later area is concerned with the measurements of characteristics of small domains - regions whose sample size is so small that the usual survey based estimation procedures cannot be applied in the inferential routines. Both these areas are important in their own right. Case-control studies forms one of the pillars of modern biostatistics and epidemiology and has diverse applications in various health related issues, specially those involving rare diseases like Cancer. On the other hand, estimates of characteristics for small areas are widely used by Federal and local governments for formulating policies and decisions, in allocating federal funds to local jurisdictions and in regional planning. My dissertation deals with the application of Bayesian semiparametric procedures in modeling unorthodox data scenarios that may arise in case control studies and small area estimation. The first part of the dissertation deals with an analysis of longitudinal case-control studies i.e case-control studies for which time varying exposure information are available for both cases and controls. In a typical case-control study, the exposure information is collected only once for the cases and controls. However, some recent medical studies have indicated that a longitudinal approach of incorporating the entire exposure history, when available, may lead to more precise estimates of the odds ratios of disease. We use semiparametric regression procedures to model the exposure profiles of the cases and controls and also the influence pattern of the exposure profile on the disease status. This enables us to analyze how the present disease status of a subject is influenced by his/her past exposure conditions conditional on the current ones. Analysis is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) algorithms. The proposed methodology is motivated by, and applied to a case-control study of prostate cancer where longitudinal biomarker information is available for the cases and controls. The second and third part of my dissertation deals with univariate and multivariate semiparametric procedures for estimating characteristics of small areas across the United States. In the second part, we put forward a semiparametric modeling procedure for estimating the median household income for all the states of the U.S. and the District of Columbia. Our models include a nonparametric functional part for accomodating any unspecified time varying income pattern and also a state specific random effect to account for the within-state correlation of the income observations. Model fitting and parameter estimation is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) methodology. It is seen that the semiparametric model estimates can be superior to both the direct estimates and the Census Bureau estimates. Overall, our study indicates that proper modeling of the underlying longitudinal income profiles can improve the performance of model based estimates of household median income of small areas. In the third part of the dissertation, we put forward a bivariate semiparametric modeling procedure for the estimation of median income of four-person families for the different states of the U.S. and the District of Columbia while explicitly accommodating for the time varying pattern in the income observations. Our estimates tend to have better performances than those provided by the Census Bureau and also have comparable performances to some established methodologies specially those involving time series modeling techniques. Based on our findings in parts two and three, we come to the conclusion that semiparametric and nonparametric regression models can be a attractive alternative to the more traditional modeling frameworks specially in situations where information on different characteristics of small areas are available at multiple time points in the past.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Dhiman Bhadra.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Ghosh, Malay.
Local: Co-adviser: Daniels, Michael J.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-08-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041832:00001

Permanent Link: http://ufdc.ufl.edu/UFE0041832/00001

Material Information

Title: Bayesian Semiparametric Regression and Related Applications
Physical Description: 1 online resource (145 p.)
Language: english
Creator: Bhadra, Dhiman
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: bayesian, case, current, mcmc, odds, penalized, random, semiparametric
Statistics -- Dissertations, Academic -- UF
Genre: Statistics thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Case-Control studies and small area estimation are two distinct areas of modern Statistics. The former deals with the comparison of diseased and healthy subjects with respect to risk factor(s) of a disease with the aim of capturing disease - exposure association specially for rare diseases. The later area is concerned with the measurements of characteristics of small domains - regions whose sample size is so small that the usual survey based estimation procedures cannot be applied in the inferential routines. Both these areas are important in their own right. Case-control studies forms one of the pillars of modern biostatistics and epidemiology and has diverse applications in various health related issues, specially those involving rare diseases like Cancer. On the other hand, estimates of characteristics for small areas are widely used by Federal and local governments for formulating policies and decisions, in allocating federal funds to local jurisdictions and in regional planning. My dissertation deals with the application of Bayesian semiparametric procedures in modeling unorthodox data scenarios that may arise in case control studies and small area estimation. The first part of the dissertation deals with an analysis of longitudinal case-control studies i.e case-control studies for which time varying exposure information are available for both cases and controls. In a typical case-control study, the exposure information is collected only once for the cases and controls. However, some recent medical studies have indicated that a longitudinal approach of incorporating the entire exposure history, when available, may lead to more precise estimates of the odds ratios of disease. We use semiparametric regression procedures to model the exposure profiles of the cases and controls and also the influence pattern of the exposure profile on the disease status. This enables us to analyze how the present disease status of a subject is influenced by his/her past exposure conditions conditional on the current ones. Analysis is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) algorithms. The proposed methodology is motivated by, and applied to a case-control study of prostate cancer where longitudinal biomarker information is available for the cases and controls. The second and third part of my dissertation deals with univariate and multivariate semiparametric procedures for estimating characteristics of small areas across the United States. In the second part, we put forward a semiparametric modeling procedure for estimating the median household income for all the states of the U.S. and the District of Columbia. Our models include a nonparametric functional part for accomodating any unspecified time varying income pattern and also a state specific random effect to account for the within-state correlation of the income observations. Model fitting and parameter estimation is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) methodology. It is seen that the semiparametric model estimates can be superior to both the direct estimates and the Census Bureau estimates. Overall, our study indicates that proper modeling of the underlying longitudinal income profiles can improve the performance of model based estimates of household median income of small areas. In the third part of the dissertation, we put forward a bivariate semiparametric modeling procedure for the estimation of median income of four-person families for the different states of the U.S. and the District of Columbia while explicitly accommodating for the time varying pattern in the income observations. Our estimates tend to have better performances than those provided by the Census Bureau and also have comparable performances to some established methodologies specially those involving time series modeling techniques. Based on our findings in parts two and three, we come to the conclusion that semiparametric and nonparametric regression models can be a attractive alternative to the more traditional modeling frameworks specially in situations where information on different characteristics of small areas are available at multiple time points in the past.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Dhiman Bhadra.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Ghosh, Malay.
Local: Co-adviser: Daniels, Michael J.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-08-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0041832:00001


This item has the following downloads:


Full Text





BAYESIAN SEMIPARAMETRIC REGRESSION AND RELATED APPLICATIONS


By
DHIMAN BHADRA


















A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2010































@ 2010 Dhiman Bhadra
































To my mother and to the memory of my father









ACKNOWLEDGMENTS

I had the good fortune to be a student at the Department of Statistics at University

of Florida. It is here that I came in close contact with some of the preeminent statisticians

of the day and learnt a lot from them. I deeply acknowledge the tremendous help,

encouragement and endless support that I received from my advisor Prof. Malay Ghosh,

my co-advisor Prof. Michael J. Daniels and Prof. Alan Agresti throughout the highs and

lows of doing my research work. They not only taught me statistics or the art of writing

papers or solving problems they introduced me to the spirit of discovery and the joy of

learning, something that will stay with me forever and would motivate me in ways I can

never imagine. However the list doesn't end here since each and every member of the

faculty opened up new doors for me through which knowledge flowed past and enriched

me along the way. My endless gratitude to each and everyone of them. I also wish to

thank Prof. Bhramar Mukherjee (currently at the Department of Biostatistics at University

of Michigan) for her help and inspiration over the years.

Last but not the least, my unending gratitude to my mother whose sacrifice,

unconditional love and blessing was always with me, guiding me along the way. I

would end by conveying my deepest respect to the memory of my father he was there

with me always throughout this journey.









TABLE OF CONTENTS
page

ACKNOW LEDGMENTS .................... .............. 4

LIST OFTABLES ..................... ................. 8

LIST OF FIGURES .................... ................. 9

ABSTRACT ..................... ............... .... 10

CHAPTER

1 INTRODUCTION .................... ............... 13

1.1 Overview of Dissertation ............................ 13
1.2 Review of Case-Control Studies ..................... 14
1.3 Review of Small Area Estimation ....................... 21
1.4 Non-Parametric Regression Methodology ............. 25

2 BAYESIAN SEMIPARAMETRIC ANALYSIS OF CASE CONTROL STUDIES
WITH TIME VARYING EXPOSURES ........................ 31

2.1 Introduction .................... ............... 31
2.1.1 Setting .................... .............. 32
2.1.2 Motivating Dataset : Prostate Cancer Study ............ 34
2.2 M odel Specification .. .. .. .. .. .. .. .. 35
2.2.1 N otation . .. 35
2.2.2 Model Framework.................... ......... 35
2.3 Posterior Inference . .. 40
2.3.1 Likelihood Function .. .. .. .. .. .. 40
2.3.2 Priors . . 40
2.3.3 Posterior Computation ......................... 41
2.3.3 Posterior Computation . 41
2.4 Bayesian Equivalence ............................. 42
2.5 Model Comparison and Assessment .. ... 46
2.5.1 Posterior Predictive Loss ..... .. .. .. 46
2.5.2 Kappa statistic . 47
2.5.3 Case Influence Analysis ..... .... 48
2.6 Analysis of PSA Data ............................. 49
2.6.1 Constant Influence Model ..... .... 50
2.6.2 Linear Influence Model ......................... 51
2.6.3 Overall Model Comparison ... 53
2.6.4 Model Assessment ......................... 54
2.7 Conclusion and Discussion ... 55











3 ESTIMATION OF MEDIAN HOUSEHOLD INCOMES OF SMALL AREAS : A
BAYESIAN SEMIPARAMETRIC APPROACH .. ................


3.1 Introduction . .
3.1.1 SAIPE Program and Related Methodology .
3.1.2 Related Research .....................
3.1.3 Motivation and Overview . .
3.2 M odel Specification ........................
3.2.1 General Notation .. .. .. .. .. .. .
3.2.2 Semiparametric Income Trajectory Models .
3.2.2.1 Model I : Basic Semiparametric Model (SPM) .
3.2.2.2 Model II : Semiparametric Random Walk Model
3.3 Hierarchical Bayesian Inference . .
3.3.1 Likelihood Function .. .
3.3.2 Prior Specification .. .. .. .. ..
3.3.3 Posterior Distribution and Inference .
3.4 Data Analysis ..........................
3.4.1 Comparison Measures and Knot Specification .
3.4.2 Computational Details ...................
3.4.3 Analytical Results . .
3.4.4 Knot Realignment .. .
3.4.5 Comparison with an Alternate Model .
3.5 Model Assessment .........................
3.6 D discussion . .

4 ESTIMATION OF MEDIAN INCOME OF FOUR PERSON FAMILIES
MULTIVARIATE BAYESIAN SEMIPARAMETRIC APPROACH ..


......
. .
......
. .
......
......
. .
. .
(SPRWM)
. .
. .
......
. .
......
. .
......
. .
. .
. .
......
......

: A
. .

......
. .
. .
. .
......
......
. .
. .
. .
. .
. .
. .
. .
......
. .
. .
. .
. .


4.1 Introduction . .
4.1.1 Census Bureau Methodology .
4.1.2 Related Literature. .
4.1.3 Motivation and Overview .
4.2 Model Specification ................
4.2.1 Notation . .
4.2.2 Semiparametric Modeling Framework .
4.2.2.1 Simple bivariate model .
4.2.2.2 Bivariate random walk model .
4.3 Hierarchical Bayesian Analysis .
4.3.1 Likelihood Function .
4.3.2 Prior Specification .
4.3.3 Posterior Distribution and Inference .
4.4 Data Analysis ....................


4.4.1 Comparison Measures and Knot Specification
4.4.2 Computational Details .
4.4.3 Analytical Results. .
4.5 Conclusion and Discussion .


59
59
61
62
65
65
66
66
67
68
68
68
69
70
71
72
73
74
78
80
82



85

85
85
87
89
90
90
91
91
92
93
93
94
94
95
96
97
98
102


.









5 CONCLUSION AND FUTURE RESEARCH ................... 104

5.1 Adaptive Knot Selection ............................ 105
5.2 Analyzing Longitudinal Data with Many Possible Dropout Times using
Latent Class and Transitional Modelling . 107
5.2.1 Introduction and Brief Literature Review ..... 107
5.2.2 Modeling Framework .......................... 110
5.2.3 Likelihood, Priors and Posteriors ... 114
5.2.4 Specification of Priors ......................... 117

APPENDIX

A PROOF OF BAYESIAN EQUIVALENCE RESULTS .... 122

B PROOF OF POSTERIOR PROPRIETY FOR THE SMALL AREA MODELS .128

B.1 Univariate Small Area Model ..... .. ... ... 128
B.2 Bivariate Small Area Model .......................... 130

C FULL CONDITIONAL DISTRIBUTIONS .. .. 135

C.1 Semiparametric Case Control Model . 135
C.2 Semiparametric Small Area Models . 136
C.2.1 Semiparametric Univariate Small Area Model .... 136
C.2.2 Univariate Random Walk Model .. .. 137
C.2.3 Bivariate Random Walk Model ..................... 137

R EFER ENC ES . . 139

BIOGRAPHICAL SKETCH ................................ 145









LIST OF TABLES


Table page

1-1 Atypical 2 x 2 table ........... ...... .............. 15

2-1 Estimates of odds ratios for different trajectory lengths and age at diagnosis
for a 0.5 vertical shift of the exposure trajectory for the linear influence model 52

2-2 Posterior means and 95% confidence intervals of odds ratio for
/ = (-10, -5) for the linear influence model ... 53

2-3 Posterior predictive losses (PPL) for the constant and linear
influence models for varying exposure trajectories and knots ... 54

3-1 Parameter estimates of SPRWM with 5 knots . ... 74

3-2 Comparison measures for SPM(5)* and
SPRWM(5)* estimates with knot realignment . ... 77

3-3 Percentage improvements of SPM(5)* and SPRWM(5)*
estimates over SAIPE and CPS estimates ..... 77

3-4 Parameter estimates of SPM(5)* .................. ....... 78

3-5 Parameter estimates of SPRWM(5)* ..... ....... 78

3-6 Comparison measures for time series and
other m odel estim ates .. .. .. .. .. .. .. .. 79

4-1 Comparison measures for univariate estimates ..... 99

4-2 Percentage improvements of univariate
estimates over Census Bureau estimates .... 99

4-3 Comparison measures for bivariate non-random
w alk estim ates . . 100

4-4 Percentage improvements of bivariate non-random
walk estimates over Census Bureau estimates ..... 101

4-5 Comparison measures for bivariate random walk model ... 102









LIST OF FIGURES


Figure page

2-1 Longitudinal exposure (PSA) profiles of 3 randomly sampled cases (1st column)
and 3 randomly sampled controls (2nd column) plotted against age. 36

2-2 Sensitivity of /3, 0o, 1i and disease probability estimates to case-deletions. 56

3-1 Longitudinal CPS median income profiles for 6 states plotted against IRS mean
and median incomes. (1st column : IRS Mean Income; 2nd column : IRS Median
Inco m e ). . . .. 63

3-2 Plots of CPS median income against IRS mean and median incomes for all
the states of the U.S. from 1995 to 1999. ........... ......... 65

3-3 Exact positions of 5 and 7 knots in the plot of CPS median income against
IRS mean income. The knots are depicted as the bold faced triangles at the
bottom ........................................ 75

3-4 Positions of 5 knots after realignment. The knots are the bold faced triangles
at the bottom. The region between the dashed and bold lines is the additional
coverage area gained from the realignment. ... 76

3-5 Quantile-quantile plot of RB values for 10000 draws from the posterior distribution
of the basic semiparametric and semiparametric random walk models. The
X-axis depicts the expected order statistics from a X2 distribution with 9 degrees
of freedom .................. .................. 81









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

BAYESIAN SEMIPARAMETRIC REGRESSION AND RELATED APPLICATIONS

By

Dhiman Bhadra

August 2010

Chair: Malay Ghosh
Cochair: Michael J. Daniels
Major: Statistics

Case-Control studies and small area estimation are two distinct areas of modern

Statistics. The former deals with the comparison of diseased and healthy subjects

with respect to risk factors) of a disease with the aim of capturing disease exposure

association specially for rare diseases. The later area is concerned with the measurements

of characteristics of small domains regions whose sample size is so small that the

usual survey based estimation procedures cannot be applied in the inferential routines.

Both these areas are important in their own right. Case-control studies forms one of

the pillars of modern biostatistics and epidemiology and has diverse applications in

various health related issues, specially those involving rare diseases like Cancer. On the

other hand, estimates of characteristics for small areas are widely used by Federal and

local governments for formulating policies and decisions, in allocating federal funds to

local jurisdictions and in regional planning. My dissertation deals with the application of

Bayesian semiparametric procedures in modeling unorthodox data scenarios that may

arise in case control studies and small area estimation.

The first part of the dissertation deals with an analysis of longitudinal case-control

studies i.e case-control studies for which time varying exposure information are available

for both cases and controls. In a typical case-control study, the exposure information is

collected only once for the cases and controls. However, some recent medical studies

have indicated that a longitudinal approach of incorporating the entire exposure history,









when available, may lead to more precise estimates of the odds ratios of disease. We
use semiparametric regression procedures to model the exposure profiles of the cases

and controls and also the influence pattern of the exposure profile on the disease status.

This enables us to analyze how the present disease status of a subject is influenced

by his/her past exposure conditions conditional on the current ones. Analysis is carried

out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC)

algorithms. The proposed methodology is motivated by, and applied to a case-control

study of prostate cancer where longitudinal biomarker information is available for the

cases and controls.

The second and third part of my dissertation deals with univariate and multivariate

semiparametric procedures for estimating characteristics of small areas across

the United States. In the second part, we put forward a semiparametric modeling

procedure for estimating the median household income for all the states of the U.S.

and the District of Columbia. Our models include a nonparametric functional part for

accommodating any unspecified time varying income pattern and also a state specific

random effect to account for the within-state correlation of the income observations.

Model fitting and parameter estimation is carried out in a hierarchical Bayesian
framework using Markov chain Monte Carlo (MCMC) methodology. It is seen that

the semiparametric model estimates can be superior to both the direct estimates and

the Census Bureau estimates. Overall, our study indicates that proper modeling of the
underlying longitudinal income profiles can improve the performance of model based

estimates of household median income of small areas.

In the third part of the dissertation, we put forward a bivariate semiparametric

modeling procedure for the estimation of median income of four-person families for the

different states of the U.S. and the District of Columbia while explicitly accommodating

for the time varying pattern in the income observations. Our estimates tend to have

better performances than those provided by the Census Bureau and also have









comparable performances to some established methodologies specially those involving

time series modeling techniques. Based on our findings in parts two and three, we come

to the conclusion that semiparametric and nonparametric regression models can be a

attractive alternative to the more traditional modeling frameworks specially in situations
where information on different characteristics of small areas are available at multiple

time points in the past.









CHAPTER 1
INTRODUCTION

1.1 Overview of Dissertation

My dissertation primarily deals with the application of Bayesian semiparametric

methodologies in dealing with unorthodox data scenarios arising in case-control

studies and small area estimation. So, before going into the details of the specific

problems, I will introduce some of the basic principles and techniques that will provide

the necessary background to understand the key ideas. I will start by reviewing the

existing literature on case-control studies (Cornfield, 1951; Breslow et al., 1978, 1980;

Breslow, 1996) and small area estimation (Ghosh and Rao, 1994; Pfefferman, 1999;

Rao, 2003). I will then give a broad overview of non-parametric regression approaches

(Ruppert, Wand and Carroll, 2003; Wand, 2003) specifically related to penalized splines
(Eilers and Marx, 1996).

In Chapter 2, I present an analysis of a case-control study when longitudinal, time

varying exposure observations are available for the cases and controls. Semiparametric

regression procedures are used to flexibly model the subject specific exposure profiles

and also the influence pattern of the exposure profiles on the disease status. This

enables us to analyze whether past exposure observations affect the current disease

status of a subject conditional on his/her current exposure condition. The proposed

methodology is motivated by and applied to a case control study of prostate cancer

where longitudinal biomarker information are available for the cases and controls.

We also show the details of the hierarchical Bayesian implementation of our models

and some equivalence results that have enabled us to use a prospective modeling

framework on a retrospectively collected dataset.

In Chapter 3, I propose a Bayesian semiparametric modeling procedure for

estimating the median household income of small areas when area-specific longitudinal

income observations are available. Our models include a nonparametric functional









part for accommodating any unspecified time varying income pattern and also an
area specific random effect to account for the dependence in the income observations
within each area. Model fitting and parameter estimation is carried out in a hierarchical

Bayesian framework using Markov chain Monte Carlo (MCMC) sampling schemes. We

apply our methodology to estimate the median household income of all fifty U.S. states
and the District of Columbia for a particular year. In doing so, we come to the conclusion

that proper modeling of the underlying longitudinal income profiles can improve the
performance of model based estimates of small areas.

Chapter 4 deals with an extension of the methodology in Chapter 3 where a
bivariate semiparametric procedure has been used to estimate the median income of

families of varying sizes across small areas. This can also be seen as an extension

of the time series modeling framework of Ghosh et al. (1996). We show that the
semiparametric models generally have better performance than their time series

counterparts and in a few situations, the performances are comparable. We want to
convey the message that semiparametric regression methodology can provide an

attractive alternative to the traditional modeling techniques specially when time varying
information are available for small areas.
In Chapter 5, we provide an overall discussion of our results and also point to some

interesting open problems and areas for future research that may be worth pursuing.
1.2 Review of Case-Control Studies

Case-control study is one area of Public Health and Epidemiology where statisticians
have made far reaching contributions over the years. The fundamental principle of these

studies is the comparison of a group of diseased subjects (cases) and a group of

disease-free subjects (controls) with respect to one or more risk factors of the disease.
A primary goal of these studies is to analyze whether any or all of the risk factors are

associated with the disease in any way. Case-control studies are useful for detecting
disease-exposure association for rare diseases like cancer, where following a healthy









population (or cohort) over time is often impractical. Thus, case control studies are

generally retrospective in nature.

Case-control studies have consistently attracted the attention of statisticians, and

as a result, a rich and voluminous body of work has developed over the years. Notable

work in the Frequentist domain include Cornfield (1951) who pioneered the logistic

model for the probability of disease given exposure. He was the first to demonstrate

that the exposure odds ratio for cases versus controls equals the disease odds ratio

for exposed versus unexposed and that the latter in turn approximates the ratio of the

disease rates if the disease is rare. Let D and E be dichotomous factors respectively

characterizing the disease and exposure status of individuals in a population. A common

measure of association between D and E is the (disease) odds ratio

P(D= 1IE= 1)/P(D= 0|1E= 1)
P(D= IE = O)/P(D= 0|E = 0)

By applying the Bayes theorem, the above expression can be rewritten as

= P(E = 1D = 1)/P(E= 0D = 1) (1-2)
P(E = l1D = O)/P(E = 0|D = 0)

which is the exposure odds ratio. Another well known measure of association is the

relative risk (RR) of disease for different exposure values given by P(D = 1 E =

1)/P(D = 1IE = 0). For rare diseases, both P(D = 0|E = 0) and P(D = 0|E = 1)

are close to one and the disease odds ratio is approximately equal to the relative risk

of disease. The classic paper by Mantel and Haenszel (1959) further clarified the

relationship between a retrospective case-control study and a prospective cohort study.

They considered a series of 2 x 2 tables as in Table 1-1

Table 1-1. A typical 2 x 2 table
Disease Status Exposed Not Exposed Total
Case nli no1i nli
Control noii nooi noi
Total eli eoi Ni









Let there be I tables of the above form. Then, the Mantel-Haenszel (MH) estimator

of the common odds ratio across the tables is given by
/
Sn11inooi/Ni
Omh = i (1-3)
nolinnoi/Ni
i= 1

It may be of interest to test for the equality of the odds ratios across the I tables i.e

Ho 81 = ... = 01

The test statistic for the above hypotheses is given by the Mantel Haenszel test statistic
/
{ni", E(n11ilOmh)}2
Tmh = i=1 (1-4)
Var(nllil, mh)

which follows an approximate X2 distribution with I 1 degrees of freedom under the null

hypotheses. The derivation of the variance of the above estimator initially posed some

challenge but was eventually addressed in several subsequent papers (Breslow, 1996).

Breslow and Day (1980) marked the development of likelihood based inference methods

for odds ratio. Methods to evaluate the simultaneous effects of multiple quantitative risk

factors on disease rates were pioneered in the 1960's.

In a case-control study, the appropriate likelihood is the "retrospective likelihood"

of exposure given the disease status. Cornfield et al. (1961) noted that if the exposure

distributions in the case and control populations are normal with different means but a

common covariance matrix, then the prospective probability of disease (D) given the

exposure (X) has the logistic form i.e


P(D = 1IX = x) = L(a +O'x) (1-5)


where L(u) = 1/1 + exp(-u). However, there is a conceptual complication in

using a prospective likelihood based on P(DIX) whereas a case-control sampling









design naturally leads to a retrospective likelihood i.e involving P(XID). This issue

was sorted by Prentice and Pyke (1979) who showed that the maximum likelihood

estimators and asymptotic covariance matrices of the log-odds ratios obtained from

the retrospective likelihood are the same as that obtained from a prospective likelihood

under a logistic formulation for the latter. Thus, case-control studies can be analyzed

using a prospective likelihood which generally involves fewer nuisance parameters than

a retrospective likelihood. Carroll et al. (1995) extended the prospective formulation to

the situation of missing data and measurement error in the exposure variables.

In a case control set-up, matching if often used for selecting "comparable" controls

to eliminate bias due to confounding. Statistical techniques for analyzing matched case

- control data were first developed by Breslow et al. (1978). In the simplest setting, the

data consist of m matched sets, say, Si,..., Sm, with Mi controls matched with a case in

each set or stratum. A prospective stratified logistic disease incidence model given by

P(D = l|z, 5,) = L(a, +/3'(z zo)) (1-6)

is assumed. ai's are the stratum specific intercept terms, treated as nuisance parameters

and are eliminated by conditioning on the number of cases in each stratum. The

generated conditional logistic likelihood yields the optimum estimating function

(Godambe, 1976) for estimating P. The classical methods for analyzing unmatched

and matched studies suffer from loss of efficiency when the exposure variable is partially

missing. Lipsitz et al. (1998) proposed a pseudo-likelihood method to handle missing

exposure variables. Rathouz et al. (2002) developed a more efficient semiparametric

method of estimation which took into account missing exposures in matched case

control studies. Satten and Kupper (1993), Paik and Sacco (2000) and Satten and

Carroll (2000) addressed the problem of missing exposure from a full likelihood

approach by assuming a distribution of the exposure variable in the control population.









Although a great amount of work has been done in the frequentist domain,

Bayesian modeling for case-control studies did not really start until the late 1980's.

The development of Markov chain Monte Carlo techniques lead to a rapid progression in

this front. Althman (1971) is probably the first Bayesian work which considered several

2 x 2 contingency tables with a common odds ratio and performed a Bayesian test of

association based on the common odds ratio. Later, Zelen and Parker (1986), Nurminen

and Mutanen (1987) and Marshall (1988) considered identical Bayesian formulations of

a case control model with a single binary exposure. These works dealt with inference

from the posterior distribution of summary statistics like the log odds ratio, risk ratio

and risk difference. Ashby et al. (1993) analyzed a case control study from a Bayesian

perspective and used it as a source of prior information for a second study. Their paper

emphasized the practical relevance of the Bayesian perspective in a epidemiological

study as a natural framework for integrating and updating knowledge available at each

stage.

Muller and Roeder (1997) introduced a novel aspect to Bayesian treatment of

case-control studies by considering continuous exposure with measurement error. Their

approach is based on a nonparametric model for the retrospective likelihood of the

covariates and the imprecisely measured exposure. They chose the non-parametric

distribution to be a class of flexible mixture distributions, obtained by using a mixture

of normal models with a Dirichlet process prior on the mixing measure (Escobar and

West, 1995). The prospective disease model relating disease to exposure is assumed

to have a logistic form characterized by a vector of log odds ratio parameters P. This

paper pioneered the use of continuous covariates, measurement error and flexible

non-parametric modeling of exposures in a Bayesian setting and brought to light

the tremendous possibility of modern Bayesian computational techniques in solving

complex data scenarios in case-control studies. Seaman and Richardson (2001)

extended the binary exposure model of Zelen and Parker to any number of categorical









exposures. They achieved this by replacing the usual binomial model by a multinomial

one and using a MCMC scheme to estimate the log odds ratio of disease at each

category with respect to the baseline category. As in Muller and Roeder, they assumed

a prospective logistic likelihood and a flexible prior for the exposure distribution and

derived the implied retrospective likelihood. Muller et al. (1999) considered any number

of continuous and binary exposures. However, in contrast to Seaman and Richardson,

they specified a retrospective likelihood and then derived the implied prospective

likelihood. They also addressed the problem of handling categorical and quantitative

exposures simultaneously.

Continuous covariates can be treated in the Seaman and Richardson framework

by discretizing them into groups and little information is lost if the discretization is

sufficiently fine. Gustafson et al. (2002) treated the problem of measurement errors in

exposure by approximating the imprecisely measured exposure by a discrete distribution

supported on a suitably chosen grid. In the absence of measurement error, the support

is chosen as the set of observed values of the exposure, a device that resembles the

Bayesian Bootstrap (Rubin, 1981). They assigned a Dirichlet(1, 1,..., 1) prior on the

probability vector corresponding to the grid points. Seaman and Richardson (2004)

proved equivalence between the prospective and retrospective likelihood in the Bayesian

context. Specifically, they showed that posterior distribution of the log-odds ratios based

on a prospective likelihood with a uniform prior distribution on the log odds (that an

individual with baseline exposure is diseased) is exactly equivalent to that based on a

retrospective likelihood with a Dirichlet prior distribution on the exposure probabilities in

the control group. Thus, Bayesian analysis of case-control studies can be carried out

using a logistic regression model under the assumption that the data was generated

prospectively.

Diggle et al. (2000) introduced Bayesian analysis for matched case controls studies

when cases are individually matched to controls. They introduced nuisance parameters









to represent the separate effect of matching in each matched set. Ghosh and Chen

(2002) developed general Bayesian inferential techniques for matched case-control

problems in the presence of one or more binary exposure variables. Their framework

was more general than that of Zelen and Parker (1986). Unlike Diggle et al. (2000),

they based their analysis on unconditional rather than the conditional likelihood after

elimination of the nuisance parameters. Their framework included a wide variety of

links like complimentary log links and some symmetric and skewed links in addition

to the usual logit and probit links. Recently Sinha et al. (2004) and Sinha et al. (2005)

proposed a unified Bayesian framework for matched case-control studies with missing

exposures. They also motivated a semiparametric alternative for modeling varying

stratum effects on the exposure distributions. The parameters were estimated in a

Bayesian framework by using a non-parametric Dirichlet process prior on the stratum

specific effects in the distribution of the exposure variable and parametric priors on all

other parameters. The interesting aspect of the Bayesian semiparametric methodology

is that it can capture unmeasured stratum heterogeneity in the distribution of the

exposure variable in a robust manner. They also extended the proposed method to

situations with multiple disease states.

In a typical case-control study design, the exposure information is collected only

once for the cases and controls. However, some recent medical studies Lewis et al.

(1996) have indicated that a longitudinal approach of incorporating the entire exposure

history, when available, may lead to a gain in information on the current disease status

of a subject vis-a-vis more precise estimation of the odds ratio of disease. It may also

provide insights on how the present disease status of a subject is being influenced by

past exposure conditions conditional on the current ones. Unfortunately, proper and

rigorous statistical methods of incorporating longitudinally varying exposure information

inside the case control framework have not yet been properly developed. In this work,









we present a Bayesian semiparametric approach for analyzing case control data when
longitudinal exposure information is available for both cases and controls.

1.3 Review of Small Area Estimation

Sample survey methodologies are widely used for collecting relevant information

about a population of interest. In many surveys it may be of interest to estimate

characteristics of small domains within the population of interest. Domains may be

geographic areas like state or province, county, school district etc. or can even be

identified by a particular socio-demographic characteristic like a specific age-sex-race

group within a large geographical area. Sometimes, the domain-specific sample size

may be too small to yield direct design-based estimates of adequate precision. This

led to the development of small area estimation procedures which provide accurate
model-based estimators of various features of small domains. In recent years, there has

been a growing demand for small area statistics both from the public and private sectors.
This is because these statistics are increasingly been used for formulating policies

and decisions, in allocating federal funds to local jurisdictions and in regional planning.

Ghosh and Rao (1994) provide a nice review of the different types of estimators and
inferential procedures used in survey sampling and small area estimation.

Since sample surveys are generally designed for large areas, the estimates of

means or totals obtained thereof are reliable for large domains. Direct survey based

estimators for small domains often yield large standard errors due to the small sample

size of the concerned area. This is due to the fact that the original survey was designed

to provide accuracy at a much higher level of aggregation than for local areas. This

makes it a necessity to "borrow strength" from adjacent or related areas to find indirect

estimators that increase the effective sample size and thus increase the precision of

the resulting estimate for a given small area. Broadly speaking, a small area model has

a generalized linear form with a mean term, a random area-specific effect term and a

measurement error term which reflects the noise for not sampling the entire domain.









Both the random effect and the noise are assumed to be independent realizations from

underlying distributions, usually assumed to be Gaussian. In the past decade, newer

methodologies have been proposed for analyzing small area level data like empirical

Bayes (EB), hierarchical Bayes (HB) and empirical best linear unbiased prediction

(EBLUP). These have gone a long way in broadening the scope of application of small

area estimation techniques.

During the last 10-15 years, model based inference has been widely used in the

small area context. This is mainly due to the wide range of functionalities that comes

with the linear mixed effects modeling framework. Some of the main advantages of this

framework are (i) Random area-specific effects accounting for between area variation

above and beyond that explained by auxiliary variables in the model. (ii) Different

variations like non-linear mixed effects models, logistic regression models, generalized

linear models can be entertained. (iii) Area specific measures of precision can be

associated with each small area estimate unlike the global measures. (iv) Complex data

structures like spatial dependence, time series structures, longitudinal measurements

can be explored and (v) Recent methodological developments for random effects models

can be utilized to achieve accurate small area inferences. Generally, there are two kinds

of small area models depending on whether the response is observed at the area or the

unit level.

1. Area (or aggregate) level models relate small area means to area specific auxiliary
variables.

2. Unit level models relate the unit values of the study variable to unit-specific
auxiliary variables.

The basic area level model is given by

0, = z', + bivi, i = 1..., m (1-7)

Here 0i is often assumed to be a function of the population mean, Y, of the ith small

area, zi = (zi, ..., zi,)' is the corresponding auxiliary data, via's are area specific random









effects assumed to be independent and identically distributed with mean 0 and constant

variance a Lastly, big's are known positive constants and 3 = (/1, ..., 3p)' is the vector of

regression coefficients.

In order to infer about the small area means, Y,, direct estimators, Y, are assumed

to be known and available. The linear model

Oi =g( ) = Oi + e, i = 1,... m (1-8)

is assumed where the sampling errors, e, are independent with

Ep(eii0) = 0, Vp(eii0) = i,, i, known

which implies that 0, are design-unbiased. By setting ov = 0 in (1-7), we have 0, = z'p

which leads to synthetic estimators that does not account for local variation above and

beyond that reflected in the auxiliary variables z,. Combining (1-7) and (1-8), we have

0, = z' + bivi + e, (1-9)

which is a special case of a linear mixed model. Here, vi and e, are assumed to be

independent. Fay and Herriot (1979) studied the above area level model (1-9) in the

context of estimating the per capital income (PCI) for small places in the United States

and proposed Empirical Bayes estimator for that case. Ericksen and Kadane (1985)

used the same model with bi = 1 and known -2 to estimate the undercount in the

decennial census of U.S. The area level model has also been used recently to produce

model based county estimates of poor school age children in the United States.

In the unit level model, it is assumed that unit specific auxiliary data xy

(xil, ..., Xip)' are available for each population element j in each small area i. Moreover,
it is assumed that the variable of interest, yy, is related to x, through a one-fold nested

error linear regression model


yU = x,3 + vi + eu, i = 1,..., m;j =1, ..., N (1-10)









Here the area specific effects vi are assumed to be independently and identically

distributed (i.i.d) with mean 0 and constant variance o2, e, = kyy where k, is known and

ey's are i.i.d random variables independent of v,'s with mean 0 and constant variance

a Often normality of vi and e,'s are assumed. For these models, the parameters of

interest are the small area means Y, or the totals Y,. Battese et al. (1988) studied the

nested error regression model (1-10) in estimating the area under corn and soyabeans

for counties in North-Central Iowa using sample survey data and satellite information. In

doing so, they came up with an empirical best linear unbiased predictor (EBLUP) for the

small area means.

Over the years, numerous extensions have been proposed for the above modeling

frameworks including multivariate Fay-Herriot models, generalized linear models, spatial

models and models with more complicated random-effects structure etc. Rao (2003)

presented a nice overview of the different estimation methods while Jiang and Lahiri

(2006) reviewed the development of mixed model estimation in the small area context.

A proper review of model based small area estimation will be incomplete without

explaining the EBLUP, EB and HB approaches that are being widely used in this context.

As shown above, small area models are special cases of general linear mixed models

involving fixed and random effects such that small area parameters can be expressed

as linear combinations of these effects. Henderson (1950) derived the BLUP estimators

of small area parameters in the classical frequentist framework. These are so called

because they minimize the mean squared error among the class of linear unbiased

estimators and do not depend on normality. So, they are similar to the best linear

unbiased estimators (BLUEs) of fixed parameters. The BLUP estimator takes proper

account of the between area variation relative to the precision of the direct estimator.

An EBLUP estimator is obtained by replacing the parameters with the asymptotically

consistent estimator. Robinson (1991) gives an excellent account of BLUP theory and

some applications. In an EB approach, the posterior distribution of the parameters of









interest given the data is first obtained assuming that the model parameters are known.

The model parameters are estimated from the marginal distribution of the data and

inferences are based on the estimated posterior distribution. An excellent account of the

EB approach along with its important applications is given in Morris (1983). Last but not

the least, in the HB approach, a prior distribution is specified on the model parameters

and the posterior distribution of the parameter of interest is obtained. Inferences about

the parameters are based on the posterior distribution. The parameter of interest is

estimated by its posterior mean while its precision is estimated by its posterior variance.

Recent advances in Markov chain Monte Carlo technique, specifically Gibbs and

Metropolis Hastings samplers have considerably simplified the computational aspect of

HB procedures.

The Small Area Income and Poverty Estimates (SAIPE) program of the U.S.

Census Bureau was established with the aim of providing annual estimates of income

and poverty statistics for all states, counties and school districts across the United

States. The resulting estimates are generally used for the administration of federal

programs and the allocation of federal funds to local jurisdictions. The SAIPE program

also provides annual state and county level estimates of median household income.

Generally, observations on various characteristics of small areas that are collected over

time may possess a complicated underlying time-varying pattern. It is likely that models

which takes into account this longitudinal pattern in the observations may perform better

than classical small area models which do not utilize this information. In this study, we

present a semiparametric Bayesian framework for the analysis of small area level data

which explicitly accomodates for the longitudinal time varying pattern in the response

and the covariates.

1.4 Non-Parametric Regression Methodology

Non-parametric regression methods provide a powerful and flexible alternative

to parametric approaches when the relationship between response and predictor









variables is too complex to be expressible using a known functional form. One of the

main differences between parametric and nonparametric regression methodologies is

that, in the former, the true shape of the functional pattern is determined by the model

while in the latter, the shape is determined by the data itself.

Suppose, the response y and the covariate x are related as

yi = f(xi) e, = 1,2 ...n. (1-11)

where f(x) is an unknown and unspecified smooth function of x and ei N(O0, o-2).

The basic problem of "nonparametric regression" is to estimate the function f(-)

using the data points (xi, yi). In doing so, it is typically assumed that beneath a rough

observational data pattern there is a smooth trajectory. This underlying smooth pattern

is estimated by various smoothing techniques. Broadly, there are four major classes of

smoothers used to estimate f(.) viz Local polynomial kernel smoothers (Fan and Gijbels

(1996); Wand and Jones (1995)), Regression splines (Eubank (1988), Eubank (1999)),

Smoothing splines (Wahba (1990); Green and Silverman (1994)) and Penalized splines

(Eilers and Marx (1996); Ruppert et al. (2003)). Each smoother has its own strengths

and weaknesses. For example, local polynomial smoothers are computationally

advantageous for handling dense regions while smoothing splines may be better for

sparse regions. Here, we will briefly review the main characteristics of splines in general

and penalized splines in particular.

The basic idea behind splines is to express the unknown function f(x) using

piecewise polynomials. Two adjacent polynomials are smoothly joined at specific points

in the range of x known as "knots". The knots, say, ( 7-,.... -K) partition the range of

x into K distinct subintervals (or neighborhoods). Within each such neighborhood, a

polynomial of certain degree is defined. A polynomial spline of degree p has (p 1)

continuous derivatives and a discontinuous pth derivative at any interior knot. The pth

derivative reflects the "jump" of the splines at the knots. Thus, a spline of degree 0 is a









step function, a spline of degree 1 is a piecewise linear function and so on. For example,

f(.) can be represented as a linear combination of a pth degree truncated polynomial

basis having K knots, given by


1,x ..., xP, (x- 'T)P ..., (X TK)P. (1-12)

Here (x Tk)P is the function (x Tk)1/{x>T}. Using the above basis, a spline of degree

p can be expressed as
K
f(xl3, 7) = 0 + ix+ ... + PXPk+ 7(x- Tk) (1-13)
k=l

Here, (/3, ..., /p) and (7, .... TK) are the coefficients of the polynomial and spline

portions of the above structure and must be estimated. p = 1, 2, 3 corresponds to a

linear, quadratic or cubic spline respectively. The above basis constitutes one of the

most commonly used basis functions while other bases like radial basis or B-splines can

also be used. It can be shown that there exists a very rich class of spline-generating

functions which in turn greatly increases the scope and applicability of splines in various

modeling frameworks. Moreover, the very structure of the splines makes them extremely

good at capturing local variations in a pattern of observations, something which cannot

be achieved using Fourier or Polynomial bases.

One of the most important aspect of smoothing is the proper selection and

positioning of the knots. This is because the knots act as "sensors" in relaying

information about the underlying "true" observational pattern. Too few knots often

lead to a biased fit while an excessive number of knots leads to overfitting vis-a-vis

overparametrization and may even worsen the resulting fit. Thus, a sufficient number

of knots should be used and they should be placed uniformly throughout the range of

the independent variable. Generally, the knots are placed on a grid of equally spaced

sample quantiles of x and a maximum of 35 to 40 knots suffices for any practical

problem (Ruppert, 2002). Recently, there have been interesting contributions on knot









selections which are more "data driven" or "adaptive" in nature both in the frequentist

and Bayesian domains (Friedman (1991); Stone et al. (1997); Denison et al. (1998);

Lindstrom (1999); DiMatteo et al. (2001); Botts and Daniels (2008)). The flexibility and

wide applicability of splines is due to the fact that provided the knots are evenly spread

out over the range of x, f(xl|, 7) can accurately estimate a very large class of smooth

functions f(.) even if the degree of the spline is kept relatively low (say, 1 or 2).

The spline coefficients (71 ...,7K) in (1-13) correspond to the discontinuous

pth derivative of the spline thus, they measure the jumps of the spline at the knots

(7-, ..., -K). Thus, they contribute to the roughness of the resulting spline. In order to
"smooth-out" the fit, a "roughness penalty" is placed on these parameters. This is often

done by minimizing the expression
n K
S= y f(x,3, 7))}2 +-'- (1-14)
i=1 k=l
where A is known as the smoothing parameter. This is synonymous to minimizing

the first part of (1-14) subject to the constraint -7'7 < A. A plays a crucial role in the

smoothing process since it controls the goodness of fit and roughness of the fitted

model. Decreasing A, the spline will tend to overfit, becoming an interpolating curve

as A -- 0. Increasing A, the spline will become smoother and will tend to the least

squares fit as A oc. There are different methods for choosing the optimal A like

cross-validation, generalized cross-validation, Mallow's Cp criterion etc.

Broadly speaking, there are three main types of splines : Regression splines,

Smoothing splines and Penalized splines (or P-splines). All of them are based on the

same principle as detailed above but differ in the specific manner in which smoothing

is done or the knots are selected. In regression splines, smoothing is achieved by the

deletion of non-essential knots or equivalently, by setting the jumps at those knots to

zero keeping the jumps at the other knots undisturbed. In smoothing and penalized

splines, smoothing is achieved by shrinking the jumps at all the knots towards zero using









a penalty function as shown in (1-14). A major difference between smoothing splines

and penalized splines is that, in the former, all the unique data points are used as knots

but in the latter the number of knots are much smaller resulting in more flexibility. Infact,

penalized splines can be seen as a generalization of regression and smoothing splines.

The wide applicability of penalized splines in diverse settings is mainly due to its

correspondence with linear mixed effects models. Infact, penalized splines can be

shown to be best linear unbiased predictors (BLUP)'s in a mixed model framework. To

see this, we rewrite (1-14) as
n
S = {yi f (xi 7)2 + AO'D (1-15)
i=1

where 0 = (, (/3, )', =. ( 1 3p )',7 (71, 72 ..., 7K)' and D is a known positive

semi-definite penalty matrix such that


D 0(p+l)x(p+l) 0(p+l)x(K)
0(K)x(p+l) K1

Different types of penalties can be accommodated by specifying different forms of D.

For example, the penalty / f (2(x 1, 7) used for smoothing splines can be achieved

with D being the sample second moment matrix of the second derivatives of the spline

basis functions. However, the above form of D only penalizes the spline coefficients

(71 ..., 7K). Specifically, the penalty in (1-14) corresponds to setting -: = I.
Let X be the matrix with the ith row Xi = (1, xi, ..., x) and Z be the matrix with the

ith row Zi = {(xi Ti), ..., (x, Ti)P). Using this formulation in (1-15) with the basis

function in (1-12) and dividing by the error variance ao, we have

1
S y = I- XP- z112 + 11712 (1-16)
oe oe

By assuming that is a vector of random effects with Cov(-) = o-I where o2 = o. /A

while 0 as the set of fixed effects parameters, the above penalized spline framework









yields the following well known mixed effects model representation :


y = X + Z- +e (1-17)

where Cov(e) = o-l and 7 and e are independent.

Bayesian P-splines have recently become popular because they combine the

flexibility of non-parametric models and the exact inference provided by the Bayesian

inferential procedure. This is even more true because of the seamless fusion of

penalized splines into the mixed model framework (Wand, 2003) as shown above. This

equivalence also carries over to the manner in which smoothing is done. Smoothing can

be achieved by imposing penalties on the spline coefficients, 7 as shown in (1-14) or

by assuming a distributional form for 7, for example 7 ~ NK(O, 721K). In the Bayesian

context, priors are placed on -2 and the other parameters and usual posterior sampling

is carried out. Since samples are generated from the smoothing parameter alongside

the other parameters, this method is also known as automatic scatterplot smoothing.

In all the problems tackled in this dissertation, we will be using Bayesian inferential

procedures on penalized splines as shown above.









CHAPTER 2
BAYESIAN SEMIPARAMETRIC ANALYSIS OF CASE CONTROL STUDIES WITH TIME
VARYING EXPOSURES

2.1 Introduction

The fundamental problem of case control studies is the comparison of a group

of subjects having a particular disease (cases) to a group of disease-free subjects

(controls) with respect to some potential risk factors (or exposures) of the disease.

Typically, in a case control study, the exposure information is collected only once for

the cases and controls. However, some recent medical studies (Lewis et al., 1996)

have indicated that a longitudinal approach of incorporating the entire exposure history,

when available, may lead to a gain in information on the current disease status of a

subject and more precise estimation of the odds ratio of disease. It may also provide

insights on how the present disease status of a subject is being influenced by past

exposure conditions conditional on the current ones. In this work, we present a Bayesian

semiparametric approach for analyzing case control data when longitudinal exposure

information is available for both cases and controls.

Statistical analysis of case-control data was pioneered by Cornfield (1951),

Cornfield et al. (1961) and Mantel and Haenszel (1959). Since then, important and

far reaching contributions have been made in virtually every aspect of the field. Some of

the notable ones are equivalence of prospective and retrospective likelihood (Prentice

and Pyke, 1979), measurement error in exposures (Roeder et al., 1996) and matched

case-control studies (Breslow et al., 1978). Important contributions in the Bayesian

paradigm include binary exposures (Zelen and Parker, 1986), continuous exposures

(Muller and Roeder, 1997), categorical exposures (Seaman and Richardson, 2001),

equivalence (Seaman and Richardson, 2004) and matching (Diggle et al. (2000); Ghosh

and Chen (2002)).

The analysis of complex data scenarios in a case control framework is a relatively

new area of research. Specifically, analysis of longitudinal case control studies has only









started to receive some attention. Park and Kim (2004) are one of the first contributors

to this area. They proposed an ordinary logistic model to analyze longitudinal case

control data but ignored the longitudinal nature of the cohort. They also showed that

ordinary generalized estimating equations (GEE) based on an independent correlation

structure fails in this framework.

2.1.1 Setting

Case-control study designs generally incorporate exposure information for a single

time point in the past. In some situations however, an entire exposure history may be

available for the cases and controls containing relevant exposure information collected

at multiple time points in the past. However, proper and rigorous statistical methods

of incorporating longitudinally varying exposure information inside the case control

framework have not yet been adequately developed. This may be due to the obvious

complications in properly handling a longitudinal exposure profile and thereby integrating

it in an existing case control framework. But once done, there may be significant

payoffs notably, the ability to learn how the present disease status of a subject is being

influenced by his/her past exposure conditions conditional on the current ones. It can

also lead to valuable insights about differences in the exposure patterns between the

cases and controls over a long time span. In analyzing the effect of a longitudinally

varying exposure profile on a binary outcome variable (like disease status), some

of the possible challenges are : (1) The longitudinal exposure observations may be

unbalanced in nature i.e the number of observations and also the observation times may

differ from subject to subject; (2) The exposure trajectory may be highly nonlinear; (3)

The exposure observations may be subject to considerable measurement error and (4)

The effect of the exposure profile on the disease outcome may itself be complex and can

even change over time.

In view of the above challenges, we propose to use functional data analytic

techniques, specially nonparametric regression methodology to model both the time









varying exposure profile and also the influence pattern of the exposure profile on

the binary outcome. Specifically, we model the underlying exposure trajectory using

penalized splines or p-splines (Eilers and Marx (1996); Ruppert et al. (2003)). We also

express the effect of the exposures on the current disease state as a penalized spline

to account for any possible time varying patterns of influence. Analysis is carried out

in a hierarchical Bayesian framework. Our modeling framework is quite flexible since

it can accommodate any possible non-linear time varying pattern in the exposure and

influence profiles. It is difficult to achieve the same goal in a purely parametric setting.

In a case-control study, the natural likelihood is the retrospective likelihood, based

on the probability of exposure given the disease status. Prentice and Pyke (1979)

showed that the maximum likelihood estimators and asymptotic covariance matrices

of the log-odds ratios obtained from a retrospective likelihood are the same as that

obtained from a prospective likelihood (based on the probability of disease given

exposure) under a logistic formulation for the latter. Thus, case-control studies can

be analyzed using a prospective likelihood which generally involves fewer nuisance

parameters than a retrospective likelihood. Seaman and Richardson (2004) proved a

similar result in the Bayesian context. Specifically, they showed that posterior distribution

of the log-odds ratios based on a prospective likelihood with a uniform prior distribution

on the log odds (that an individual with baseline exposure is diseased) is exactly

equivalent to that based on a retrospective likelihood with a Dirichlet prior distribution on

the exposure probabilities in the control group. Thus, Bayesian analysis of case-control

studies can be carried out using a logistic regression model under the assumption that

the data was generated prospectively.

We show that the results of Seaman and Richardson (2004) applies for the

proposed semiparametric framework thus enabling us to perform the analysis based

on a prospective likelihood even though a case control study is retrospective in nature.

We perform model checking based on the posterior predictive loss criterion (Gelfand and









Ghosh, 1998). Once the optimal model is identified, model assessment is carried out

using case deletion diagnostics (Bradlow and Zaslavsky, 1997).

2.1.2 Motivating Dataset : Prostate Cancer Study

We illustrate our methodology on a data set from the Beta Carotene and Retinol

Efficacy Trial conducted by the Fred Hutchinson Cancer Research Center in Seattle,

Washington (Etzioni et al., 1999). This data set is based on a biomarker based

screening procedure for prostate cancer to elucidate the association between prostate

cancer and prostate-specific antigen (PSA). The effectiveness of biomarker based

screening procedures for prostate cancer is currently a topic of intense debate and

investigation in the realms of health care practice, policy and research. Since the

discovery of prostate-specific antigen (PSA) and the observation that serum PSA

levels maybe significantly increased in prostate cancer patients, a lot of effort has been

dedicated to identifying effective PSA based testing programs with favorable diagnostic

properties.

In this study, the levels of free and total PSA were measured in the sera of 71

prostate cancer cases and 70 controls. Participants in this study included men aged

50 to 65 at high risk of lung cancer. They were randomized to receive either placebo or

Beta Carotene and Retinol. The intervention had no noticeable effect on the incidence

of prostate cancer, with similar number of cases observed in the intervention and control

arms. Several PSA measurements recorded for the cases were taken as long as 10

years prior to their diagnosis. The 71 prostate cancer cases were diagnosed between

September 1988 and September 1995 inclusive. The individuals deemed "controls"

were selected among individuals not yet diagnosed as having cancer by the time of

analysis. As the exposure variable, we use the natural logarithm of the total PSA (Ptotal)

although the negative logarithm of the ratio of free to total PSA (Pratio) can also be

considered. In addition to the above measurements, observations were collected on

time (years) relative to prostate cancer diagnosis and age at blood draw for the cases









and controls at each of the time points. Figure 2-1 shows the PSA trajectory against age

for some randomly chosen cases and controls. Etzioni et al. (1999) analyzed this data

set by modeling the receiver operating characteristic (ROC) curves associated with both

the biomarkers (Ptotal and Pratio) as a function of the time with respect to diagnosis.

They observed that although the two markers performed similarly eight years prior to

diagnosis, Ptotal was superior to Pratio at times closer to diagnosis.

The rest of the chapter is organized as follows. In Section 2.2, we introduce the

semiparametric modeling framework. Section 2.3 describes the details of posterior

inference. In Section 2.4, we discuss relevant Bayesian equivalence results for

our framework. Section 2.5 outlines the model comparison and model assessment

procedures we performed. We describe the data analysis results based on the prostate

cancer data set in Section 2.6 and end with a discussion in Section 2.7.

2.2 Model Specification

2.2.1 Notation

Let Y, be the jth exposure (PSA) observation recorded for the ith subject, a, the

age of the ith subject when the jth PSA observation is collected while t, denote the

time of thejth PSA measurement relative to the time of diagnosis for the ith subject

(i = 1,..., N;j = 1,..., ni). For cases, time of diagnosis is the time when cancer was

detected and no PSA measurement is available at that time. For controls, time of

diagnosis is synonymous to the last observation time or the time of normal digital rectal

examination (DRE). Denoting the age at diagnosis of the ith subject by ad, we have the

simple linear relation a, = t, + ad. This relationship will be used to simplify notation

below.

2.2.2 Model Framework

Our framework is composed of two models (1) A trajectory model for the

longitudinal exposure trajectory and (2) a disease model for the effect of the exposure




























56 58
Age


60 62


59 60 61 62 63 64 65 66
59 60 61 62 63 64 65 66


48 50 52 54


56 58 60
Age


58 60 62
Age


62 64


64 66


58 60 62 64 66


Longitudinal exposure (PSA) profiles of 3 randomly sampled cases (1st
column) and 3 randomly sampled controls (2nd column) plotted against age.


Figure 2-1.


v









trajectory on the binary disease outcome. Inference on these two models will be done

simultaneously and is described in Section 3.

Our modeling framework bears some resemblance to that of Zhang et al. (2007)

who used a two stage functional mixed model approach for modeling the effect of a

longitudinal covariate profile on a scalar outcome. They proposed a linear functional

mixed effects model for modeling the repeated measurements on the covariate. The

effect of the covariate profile on the scalar outcome was modeled using a partial

functional linear model. In doing so, they treated the unobserved true subject-specific

covariate time profile as a functional covariate. For fitting purposes, they developed

a two-stage nonparametric regression calibration method using smoothing splines.

Thus, estimation at both the stages was conveniently cast into a unified mixed model

framework by using the relation between smoothing splines and mixed models. The

key differences between their framework and ours is that we use Bayesian inferential

techniques to simultaneously estimate the parameters of the exposure and disease

models. Moreover, instead of a linear modeling framework, we use a combination of

linear and logistic models since our response is binary.

Exposure Trajectory Model

The exposure trajectory model is given by

vy = Xi(ay) + e- = f(ay) + gi(ay) + e- (2-1)

where e ~- N(0, o-2), f(a) is the population mean function modeling the overall

PSA trend as a function of age for all the subjects while gi(a) is the subject specific

deviation function reflecting the deviation of the ith subject specific profile from the mean

population profile.

The reason for modeling exposure as a function of age is that for a randomly

chosen subject with unknown disease status, the PSA value at a certain time point

should depend on the subject's age at that time point controlling for the time with respect









to diagnosis. In other words, the same exposure observation recorded at the same

time relative to diagnosis for two subjects with widely different age ranges should have

different significance.

We represent both f(ay) and gi(ay) using p-splines as follows
K
f(ad) = o+ 1 ad + ... /3pay /Wa, T/)p =+ pO,(a ) +
k=l
M
gi(ay) = bio+ bila + ...+ bqa bi,qm(ao m) =q,,(aJ)'bi, (2-2)
m=l

where p,,P(a [) = [1, a, ..., ay, (ad -7i)P.. (a rK)]' and Pq,,(ad) = [1, a, ..., a, (a-

K1),..., (a KM)+]' are truncated polynomial basis functions of degrees p and q with

knots (Tr-,..., TK) and (, ..., KM) respectively (Durban et al., 2004). Generally, M < K.

Disease Model

The prospective disease model is given by

P(Di = l|Xi(t +a), -c < t < 0) = L(a + Xi(t + a)7(t + a)dt (2-3)

where L(.) is the logistic distribution function, X,(t+ a) is the true, error-free unobserved

subject-specific exposure profile modeled as f(t + ad) + gi(t + ad) while y(t + ad) is an

unknown smooth function of age which reflects the time pattern of the effect of the PSA

trajectory on the current disease status for the ith subject. In (2-3), we use the relation

ay = t + afd to model the exposure trajectory X(.) and the influence function 7(.) as a

function of time with respect to diagnosis. In doing so, we can easily assess the effect

of the trajectory on the current disease state at any given point before diagnosis for a

particular subject. "c" is the time by which we go back in the past to record the exposure

history for the ith subject; e.g. c = 8 would imply that, for the ith subject, the exposure

observations recorded since eight years prior to diagnosis are being considered for

analysis. Thus, by changing the value of c, the effect of differential lengths of PSA

trajectories on the current disease status can be studied.








In the most general case, y(t + a) can also be modeled as a P-spline i.e
K*
7(t +a,) = o+01(t+a )+... + 0(t + )a O+kr(t + a ():
k=l
= r.dC(t+ a)' (2-4)

where Vrc(t + a) = [1, (t + a),..., (t + af), (t a ) ,..., (t + af K) ]

S= (40 .... OK* r)' and ((1, ..., K*) are the knots.
As special cases of (2-4), we may consider y(t + a) = 0, in which case the
covariate is the area under the PSA process {X,(t a), -c < t < 0} and ao is its effect
on the disease probability (or logit of the disease probability). We can also assume
7(t + ad) = Oo + 01(t + ad) which signifies a linear pattern of the effect of the exposure
trajectory on the disease probability. In the above models, the knots can be chosen on a
grid of equally spaced quantiles of the ages.
Replacing (2-2) and (2-4) in the R.H.S of (2-3), we have

P(Di = 1X,(t +ad),-c< t
= L (a+ (Pp,(t+ a )'Pi+-q,(t+a)'b,)((ta + a)'dt)

= L(a +'Mi+ bQi) (2-5)


where M, = p,,'(t a+ a)I ( t a)'dt and Q, = f (t + ai)'rc(t + a)'dt.
For pre-chosen degrees of the basis functions and the knots, both Mi and Q, are
matrices and are available in closed forms. We assume normal distributional forms for
the spline coefficients in (2-2) and (2-4) in order to penalize the jumps of the spline at
the knots. Thus, we have 3p+k ~ N(O, a)(k = 1,... K); b,q+m N(O, j)(m = 1...M)
and /k+r ~ N(0, o-)(k = 1.... K*). Finally, the random subject specific deviation
function g,(ay) is modeled as b, ~ N(0, oj)(i = 1 ..., N;j = 0 ..., q).









2.3 Posterior Inference


2.3.1 Likelihood Function

Let Yi = (Y,1 ..., Yi)' and Di be the exposure vector and disease status while

a, = (ai, ..., ain,)' and ti = (ti, ..., tin)' be the observed values of age and time with

respect to diagnosis for the ith subject respectively. So, the response vector for the ith

subject will be the pair (Yi, Di). Let 02 = (c, 3, 2fa, b, ,, ac, a e ..., C}) be the

parameter space corresponding to the ith subject. Thus, the full parameter space will be

given by 0 = E2 u ,2 U ... U vN.

The likelihood for to the ith subject, conditional on the random effects is given by

L(Y,, Di, ail i) oc p(Yi,|, a,, bi, 0 )p(Dia, /, 4P )p(3(2)l )p( (2) )
N q
xp(b(2) o) l p(b 1iJ2) (2-6)
il j 0

where p(Yi,|/, a,, bi, o- ) is the probability distribution corresponding to the trajectory

model, p(Dla, 0, 4) denotes the logistic distribution corresponding to the disease model

while the rest deals with the distributional structures on the spline coefficients and

random effects.

Since the trajectory model (2-1) has a normal distributional structure while the

disease model (2-3) has a logistic structure, the likelihood function and hence the

posterior have a complicated form. To alleviate this problem, we approximate the logistic

distribution as a mixture of normals using a well known data augmentation algorithm

proposed by Albert and Chib (1993). This is briefly explained in Section 3.3.

2.3.2 Priors

To complete the Bayesian specification of our model, we need to assign prior

distributions to the unknown parameters. We assume diffuse normal priors for the

polynomial coefficients (/3, ..., /p) and (a, o,..., ,). For the variance components
(o, ao, oa o {o, ..., ao}), we assume uniform priors with large upper bounds. The prior

distributions are assumed to be mutually independent. We choose large values for the









normal prior variances to make the priors diffuse in nature so that inference is mainly

controlled by the data distribution.

2.3.3 Posterior Computation

Likelihood Approximation

As mentioned in Section 3.1, we have used the data augmentation algorithm

of Albert and Chib (1993) to approximate the likelihood and thus simplify posterior

inference. They showed that a logistic regression model on binary outcomes can be

well approximated by an underlying mixture of normal regression structure on latent

continuous data. In doing so, it can be shown that a logit link is approximately equivalent

to a Student-t link with 8 degrees of freedom.

As in Albert and Chib (1993), we introduce latent variables Z, Z,, ..., ZN such that

Di = 1 if Z, > 0 and Di = 0 otherwise. Let Z, be independently distributed from a t

distribution with location Hi = a O/'Mi, + b'Qif, scale parameter 1 and degrees of

freedom v. Equivalently, with the introduction of the additional random variable A,, the

distribution of Z, can be expressed as scale mixtures of normal distribution

Zi|A, N(Hi, A, 1), A, Gamma(v/2, 2/v)
where the Gamma pdf is proportional to A /2-lexp(-vAi/2). Using this approximation,

we can replace the logit link by a mixture of normals and can rewrite (2 -6) as


L(Yi,Di,ai|ji) oc f{p(YuS, Sa} A Ip(Z|Hi, 1/Ai) G(Ai\v/2, 2/v)dzidA
j= 1
N q
x p(j3(2)),o)p( (2)jaia)p(b(2) ,) 1 gP(b2 i )
i=1 j=0

where, p(Ula, b) denotes a normal density with mean a and variance b while G(Vla, b)

denotes a gamma density with shape a and rate b. Moreover, S, = p-,,(a,)'3 +

Cq,,(a,)'b, and A, = {/(Z, > 0)I(Di = 1)+ I(Z, < 0)I(Di = 0)}.









Posterior Sampling


The full posterior of the parameters is given by
m q
p(2Y, D, a) N x L(Yi, Di, ai, l )() ( )( ) (-)


where P(1) = ( /3 ...j3p) and (1) = ( ,0 ...., r). The full posterior can be factorized as


[Q|2Y,D,a] oc [Y|3, b, a] [f / {A[Z Di, Ai,,/,a,f ,bi,][Ail,]}dzidA] [(2) ]
=1. "
N q q
] H\-}[b b^[bUlj2][0(1)][ 1 ]- ]- ] 3-e]
i lj O j=O

where 0 is the entire parameter space. Our main parameter of interest is 0 in (2-5).

Since, the marginal posterior distribution of 0 is analytically intractable, we construct an

MCMC algorithm to sample from its full conditionals. In doing so, we use multiple chains

and monitor convergence of the samplers using Gelman and Rubin diagnostics (Gelman

and Rubin, 1992).

2.4 Bayesian Equivalence

As mentioned in Section (1.2), Seaman and Richardson (2004) showed that for

certain choices of the priors on the log odds, posterior inference for the parameter of

interest based on a prospective logistic model can be shown to be equivalent to that

based on a retrospective one. As a result, a prospective modeling framework can be

used to analyze case-control data which are generally collected retrospectively. Here we

show that the Bayesian equivalence results of Seaman and Richardson (2004) can be

extended to the semiparametric framework we have proposed. This enables us to use

a prospective logistic framework (as described in Section (2.2.2)) to analyze the PSA

dataset.

Our modeling framework hinges on the idea that for every subject, instead of a

single exposure observation, a series of past exposure observations are available.

We use this "exposure trajectory" or "exposure profile" in analyzing the present








disease status of a subject. In the spirit of our dataset, we assume that the exposure
observations are continuous. Let the exposure profile for the ith subject be X,(t) =
{X,, ...,X,,, i = 1, ..., N; -c < t < 0} where X, is thejth exposure observation
recorded for the ith subject. Let X = {X1, ..X,1 ...X XNv ..., Xvnn} be the set of
all exposure observations. Since an exposure trajectory is composed of a finite set
of exposure observations, the discretizing mechanism proposed by Rubin (1981)
and later by Gustafson et al. (2002) can be applied to the trajectory as a whole i.e
{X,(t), -c < t < 0} can be assumed to be a discrete random variable with support
{Z,(t),..., Zj(t), -c < t < 0}, the set of all observable exposure trajectories
where {Z(t), -c < t < 0,j = 1,...,J} is a finite collection of elements in the
support of the X,'s. Let Yoj and Yj be the number of controls and cases having
exposure profile {Z,(t), -c < t < 0}. We denote the "Null" or "baseline" trajectory
as X(t) =0,-c < t <0}.
The odds ratio of disease corresponding to Zj(t), -c < t < 0} with respect to
baseline exposure is exp (/ Zj(t)7(t)dt) Assuming that a control has exposure
profile {Zj(t), -c < t < 0} with probability 6/ J=16k, it can be easily shown that

6,exp Z( t) ( t)dt
P(X(t) = Z(t), -c < t < 0D = 1) = Z(t)(t)d
S kexp (J_ Zk(t)7(t)dt)
k=-c
Thus, the retrospective likelihood is
Ydj
i j 5exp (dJ Zj(t)7(t)dt)
L(56, ) = co _j -- (2-7)
d=-O 1 6Skexp(d Zk()7(t)dt)
k=1 c







yd
j exp do a Z(t)Wq(t)dt
= co (2-8)
d= Oj1 6kexp(do l Zk(t)I(t)tt


since 7(t) = I(t)'4 = 4')(t) by (2-4). We assume 1 = 1 for identifiability. Here d = 0
and 1 stands for controls and cases respectively. Assuming 0 to be the baseline odds of
disease, the prospective likelihood is given by
l -Ydi
1 Od exp (d Z(t)Wq(t)dt
L( f)= =- 1--0 (2-9)
d=Cj=- 1 i kexp (koI t) Zk(,t) -dt
-k=0 \ J-c /

Based on the above setup, we have the following equivalence results :

Theorem 1. The profile likelihood of ) obtained by maximizing L(6, 4) in (2-8) with
respect to 6 is the same as that obtained by maximizing L(O, 4) in (2-9) with respect to


Theorem 2. Let Ydj (d = 0, 1; j = 1,..., J) be independently distributed as Poisson(Adj)
where
logAd = d/log, + logz + d4' f Z ,(t)(t) dt (2-10)
J-C
We assume independent priors, p(O) oc 0-1 and p(6j) oc 5ja-1 for i and 6. The prior for
4, p(4) is chosen to be independent of ) and 6 such that for some q and r such that
yq > 1 and yor > 1, E ( Zq(t)q (t) dt and E 4' Z,(t)q(t)dt exists and are
-c -c
J
finite (i.e p(4) is such that E(4) exists and is finite). Let y + = yoj + y and yd = Ydj-
j=1
Then the following two statements hold :








(i) Assuming w = logO, the posterior density of (w, 4) is


*0 A]
j {exp (w+ 4'c Zj(t)W(t)dt)}
p*,w, 0|1y) N p(O) H 0 -J+^ (2-11)
-1 + exp w+f Z/ (t)W(t)dt
J-
(ii) Assuming 0 = (60, ..., 0) and j = 6j/ 6k, the posterior density of (0, 4) is
k=1


J i1 Oieexp (d Zj(t)W4t))dt)
p(, 0y) N p() H j Yd+ (2-12)
j{1 d=o Y-exp df Zj(At)Wt)dt
j1 \ -c /
(iii) The marginal posterior densities of 4 obtainable from p(w, 0|y) and p(0, 0|y) are
the same.
The proofs of the above theorem are similar in nature to those in Seaman and
Richardson (2004) and are given in the Appendix A. Since we have considered near
uniform prior for a and our prior on 4 ensures the existence and finiteness of E(O), the
conditions of Theorem 2 are essentially satisfied for our framework.
Based on the above results, it can be concluded that the marginal posterior
distribution of 4 the parameter of interest, will be the same regardless of whether
we fit a prospective or retrospective model. Thus, we can analyze the PSA data using
the prospective semiparametric modeling framework described above. Bayesian
equivalence can also be shown in the more general case of multicategory case control
setup, i.e when there are multiple (> 2) disease states. We have the following result
Theorem 3. Let, {X(t), -c < t < 0} be any exposure trajectory with support
{Z1(t),..., ZK(t), -c < t < 0}, the set of all observable exposure trajectories. Let there
are r + 1 disease categories. Suppose Ydk (d = 0, 1, ..., r; k = 1,..., K) be independently









distributed as Poisson(Adk) where


log(Adk) = Ig(d) + og(rdk) + log(6k),

log(Aok) = Iog(k).

0d being the baseline odds for disease category d and rl being the parameter of interest.
Assume independent improper priors, r (Od) oc 1d, Tr(6k) oc 61 for 0 and 6 and a prior

7r(rl) forrl that is independent of 6 and 6 and proper i.e E(rl) exists and is finite. Let ndk

be the number of individuals with D = d and {X(t) = Zk(t), -c < t < 0}. Then the

following two statements holds

(i) The posterior density of (rl, 6) is

r K K r =o ndk
7(7, VIn) x nn( d)"dk)ndk d+ nldk 1 7r(77)
d lk 1 k=1 d=1 d=1
K
(ii) Assuming 0 = (6, ..., OK) and Ok = k/ 6 6, the posterior density of (0, r1) is
= 1
ndk


o(, 0|n) N K nOk r kdk K ) -1
k=1 d=1 k=1 /d k=1



(iii) The marginal posterior densities of rl obtainable from w(1r, O n) and (rl1, 0 n) are

the same.

The proof of the above theorem is given in Appendix A.

2.5 Model Comparison and Assessment

2.5.1 Posterior Predictive Loss

We performed model comparison using the posterior predictive loss (PPL) criterion

proposed by Gelfand and Ghosh (1998). This criterion is based on the idea that an

optimal model should provide accurate prediction of a replicate of the observed data.









Gelfand and Ghosh (1998) obtained this criterion by minimizing the posterior loss for

a given model and then, for all models under consideration, selecting the one which

minimizes this criterion. For a general loss function, this criterion can be expressed as a

linear combination of two distinct parts i.e a goodness-of-fit part and a penalty part. For

our framework, the posterior predictive loss can be written as
N k N
PPL = (D, )2 +k 1 Var(Di) (2-13)
ii 1

where D, = E(Dreply, D) and Var(Di) = Var(D eply, D) = E(Dreply, D) (E(Dreply, D))2.

For our framework, Drep = (Dep, Drep) is the replicated disease status vector for all

the subjects. It is straightforward to calculate the expected value of the above criterion

using the posterior samples obtained from the Gibbs sampler. Lower values of this

criterion would imply a better model fit. We assume k = oo and obtain the values

of posterior predictive loss for different lengths of exposure trajectories and different

number of knots. The results are given in Table 2-3 and explained in Section 2.6. For the

optimal model selected using the posterior predictive loss criterion, model assessment

was performed using Kappa measures of agreement and case deletion diagnostics. The

methodology is described below.

2.5.2 Kappa statistic

We formed 2 x 2 tables cross classifying the observed and predicted number of

cases and controls for different combinations of trajectory lengths and number of knots.

We summarized the agreement in these tables using the Kappa statistic (K) (Agresti,

2002) which compares agreement against that which might be expected by chance. The

value of K ranges from -1 to 1; K = 1 implies perfect agreement while K = -1 implies

complete disagreement. A value of 0 indicates no agreement above and beyond that

expected by chance.









Suppose nab be the number of subjects for whom (D = a, D = b; a = 0, 1; b = 0, 1),

D and D being the observed and predicted disease status for a particular subject. Then,

no00 n11 (n n ll\( n01 nl n ( no n0 (no + nio
n nn n ny I n
1 ( f nil noi + nill noo + noi nor + ni
n n n n

where n = noo + no0 + n10 + n11.

The observed disease status (vis-a-vis case or control status) of a subject is

obtained from the dataset while the predicted disease status is calculated from the

posterior estimates of the parameters. At iteration n of the Gibbs sampler, we can

calculate the quantity p(n) = (n)(D, = 1lX,(t+ ad), t e [-c, 0]) = L(n)(a +/'M,+ b'Qi,)

where L(.) can be either the exact logit cdf or the approximate Student-t cdf (with 8

degrees of freedom). Based on the value of ,n), we can assign

b if fn) > 0.5

0 if (n) < 0.5

Based on the values of {(Di, bi} ); i = 1,..., N}, we can form a 2 x 2 table, and

hence can calculate a value of kappa, say, K(n) at iteration n of the Gibbs sampler. The

posterior means and 95% credible intervals of K provide a measure of the amount of

agreement that our model provides.

2.5.3 Case Influence Analysis

Case influence (or case deletion) diagnostics are often used as a tool for model

assessment in various statistical problems. The procedure hinges on the idea that the

influence of a particular observation on a parameter can be measured by the difference

in the parameter estimate based on the full data and the data with that observation

deleted (Hampel et al., 1987). These diagnostics can be used to detect observations

with an unusual effect on the fitted model and thus may lead to identification of data

or model errors. Bradlow and Zaslavsky (1997) applied case influence tools in









Bayesian hierarchical modeling. The basic tenet is that samples obtained from the

full posterior of parameters, when importance weighed, can reflect the effect of deleting

a particular observation from the dataset. They presented an easily applicable graphical

technique for checking influential observations based on importance weighing. The

local dependence structure, often present in hierarchical models, makes the importance

weights inexpensive to calculate.

Let H, = a +3'Mio + b',Qio and S, = Pp, (a,)'/3+ q,,(a,)'b,. Suppose L(YISy, a )

be the density function corresponding to the trajectory model, while L(Di Hi) be the one

for the disease model. We worked with the following three types of weighting schemes

based on those proposed by Bradlow and Zaslavsky (1997)
ni
(n),* 72
(yd)i = L(YS, a)-L(DiH,)-'.
j=1
W(n) (=n) )iN(bi, 0, a2)-1.
(y,d,Q)i = (y,d)
ni
(n)* (n) 72
W )i = W(yd,)i f L *( SU, )L*(DiHi). (2-14)
j= 1

Here n denote the nth iteration of the Gibbs sampler, the subscript i denote the deletion

of y, and the superscript denote unnormalized weights. In the last weighing scheme,

L*(Y, ISy, a- ) and L*(DiHi,) are the usual likelihood with the population level parameters

i.e (a, 3, 0, o-2) replaced by the full data posterior medians. Here "full data posterior" is

the posterior distribution obtained from the complete dataset i.e the one having all the

subjects.

2.6 Analysis of PSA Data

We have used the semiparametric framework explained in Section 2.2 to analyze

the prostate cancer dataset described in Section 2.1.2. Multiple observations on free

and total PSA were obtained for 71 prostate cancer cases and 70 controls. For some

subjects, observations were collected as far as 10 years prior to diagnosis. We use

the natural logarithm of total PSA (Ptotal) as our exposure of interest. Our principle









aim is to examine whether past exposure observations can contribute significantly

towards predicting the current disease status of a subject given his/her current exposure

information. In doing so, we will also test how differential lengths of the PSA trajectories

affect the current probability of disease for a particular individual.

For the purpose of our analysis, we have used a linear p-spline (p = 1) with a

subject specific slope parameter to model the exposure trajectory as follows
K
Y =/3o + 31(t + ai) + -/3,k+(td + ad 7 )+ + bi(tu + a) + e, (2-15)
k=l
For the prospective disease model (2-3), we considered two specific scenarios viz.

constant influence, 7(t + af) = 0o and linear influence, 7(t + af) = Oo + 0l(t + af). The

results for these two cases are summarized below.

2.6.1 Constant Influence Model

In this parametrization, the area under the PSA process, {X,(t + af), -c t < 0}

acts as the covariate and 0o signifies its effect on the disease probability. We have used

different values of "c" (time, in years, by which we go back in the past to record the

exposure history of a subject) to analyze the effect of differential areas under the PSA

process on the current disease state.

On fitting the above model, we observed that for all trajectory lengths, 0o is

significant (its 95% credible interval does not contain 0). For any particular interval

(i.e choice of c), the posterior means and 95% credible intervals of 00 do not change

much with the number of knots (K). In addition, 0o increases as the trajectory length

decreases i.e as we move closer to the point of diagnosis. This is likely related to the

scale of the area under the PSA process but it also seems to support the well known

medical fact that total PSA is a better discriminator of prostate cancer at times closer to

diagnosis than at times further off (Catalona et al., 1998). To assess the impact of only

the past PSA observations on the current disease state, we considered the exposure

interval I = (-10, -5) and 3 knots in the trajectory. The posterior mean of 0o is 0.298








and the 95% credible interval is (0.196, 0.421). Thus, even exposure observations
recorded as far as 5-10 years prior to diagnosis seem to have a significant influence on
the current disease status of a subject. We formally compared the different models using
the PPL criterion in Section 2.6.3.
2.6.2 Linear Influence Model
We next fitted the model permitting a linear pattern of influence of the exposure
trajectory on the disease outcome. For all trajectory lengths, 0o and 01 were significant
since the 95% credible intervals excluded 0. To better understand the influence of
differential lengths of exposure trajectories on disease status, we calculated the odds
ratios for different age at diagnosis and trajectory lengths. Suppose the exposure
trajectory for the ith subject changes from {X,(t + ad), -c < t < O} to {Z,(t + ad), -c <
t < 0}. The corresponding odds ratio of disease is given by
P(Di = 1Z,(t+af), -c< t <) P(D X(t a),-c < t < 0)
Di =0IZit a ,-ct0 P(Di = O1Xit +a ,- t0
P(Di = OlZ,(t+ af),-C < t <0) XP(D, lX,(t ad),C < t < O)

= exp [ {Z,(t a d) X(t+ ad)}I (t a )dt. (2-16)

Parameterizing {Z,(t + af), -c < t < 0} as p,r,(t + af)'l q+ (q,(t + a)'d,, as in (2-2),
we can rewrite (2-16) as

exp [( )' ( p,(t + afd),c(t + ad)'dt)
(0-c
xexp [(d, b,)' ( q, (t + ad r,)((dt + ad'dt .

If there is an uniform increase of "m" in the trajectory i.e {Z,(t + af) X,(t + a) =
m, t e [-c, 0]}, (this can also be looked upon as a vertical shift of the trajectory upwards
by "m"), the above odds ratio simplifies to

exp m 7(t+ af)dt = exp [cm(o0 + (af c/2)01)] (2-17)
S -C









Table 2-1 shows the posterior means and 95% credible intervals of the odds ratios

corresponding to different trajectory lengths and age at diagnosis when m = 0.5. For a

fixed trajectory length, the odds ratios decrease as age at diagnosis increases. This


Table 2-1. Estimates of odds ratios for different trajectory lengths
for a 0.5 vertical shift of the exposure trajectory for the
Age (-3,0) (-5,0) (-8,0)
50 3.96 (2.10, 7.63) 4.57 (2.32, 8.73) 5.26 (2.41, 11.01)
55 3.34 (2.02, 5.78) 3.75 (2.15, 6.43) 4.19 (2.24, 7.77)
60 2.83 (1.92, 4.39) 3.08 (2.00, 4.77) 3.36 (2.08, 5.46)
65 2.41 (1.79, 3.35) 2.55 (1.83, 3.59) 2.70 (1.90, 3.91)
70 2.06 (1.62, 2.70) 2.12 (1.64, 2.77) 2.19 (1.68, 2.89)
75 1.78 (1.41,2.32) 1.77 (1.41, 2.24) 1.79 (1.41, 2.31)
80 1.54 (1.16, 2.12) 1.48 (1.13, 1.98) 1.46 (1.11, 1.99)


and age at diagnosis
linear influence model
(-10,0)
5.46 (2.46, 11.24)
4.32 (2.29, 7.95)
3.44 (2.11,5.63)
2.76 (1.92, 4.02)
2.22 (1.69, 2.97)
1.80 (1.41, 2.38)
1.47 (1.09, 2.07)


seems to support the notion that younger subjects tend to have more aggressive

form of prostate cancer than older ones and thus are most likely to be benefited from

early detection (Catalona et al., 1998). For most ages at diagnosis, the odds ratios

steadily increase as longer exposure trajectories are considered i.e as past exposure

observations are taken into account. However, the rate of increase is higher for lower

age at diagnosis. Thus, consideration of past exposure observations in addition to

recent ones result in a significant gain in information about the current disease status

of a subject. Finally, for the highest age at diagnosis considered (80), the odds ratios

decrease as longer exposure trajectories are considered. This may imply that for a

subject with very high age at diagnosis, his/her past exposure observations may not

contain significant amounts of information about the present disease status.

As before, we fitted the disease model on the interval / = (-10, -5). The posterior

mean and 95% credible interval of po and 01 are respectively 1.24 (0.29, 2.19) and

-0.015 (-0.029, 0.003) implying that exposure observations recorded 5-10 years prior

to diagnosis also has a significant effect on the current disease status. The posterior

means and 95% credible intervals of the odds ratios shown in Table 2-2 corroborate the

above conclusion.









Table 2-2. Posterior means and 95% confidence intervals of odds ratio for
/ = (-10, -5) for the linear influence model
Age at Diagnosis
50 60 70 80
Mean 4.99 3.27 2.22 1.56
95% C.1 (1.96, 10.41) (1.91, 5.36) (1.67, 2.98) (1.10, 2.29)

2.6.3 Overall Model Comparison

For both the constant and linear influence models, we calculated the PPL criterion

(described in Section 5.1) corresponding to different trajectory intervals and number of

knots. These values are given in Table 2-3.

The PPL values for the linear model were smaller than those corresponding to the

constant influence model. Thus, we can conclude that for the prostate cancer data, the

class of linear influence models fit better than the class of constant influence models.

For both setups, the model with 0 knots has the worst fit (highest PPL criterion) across

all trajectory lengths. For a given trajectory, the models tend to improve with an increase

in the number of knots until a certain number of knots is reached. Further increase

of knots tend to worsen the fit; this agrees with the findings of Ruppert (2002). The

important point to note here is that the number of knots and the length of the exposure

trajectory seem to interact in their effect on model fit. The best fitting constant influence

model seem to be the one with exposure trajectory (-10, 0) and 3 knots.

For the linear influence setup, the PPL criterion has a decreasing trend as longer

exposure trajectories are taken into account. Thus, inclusion of past exposures result

in an improvement of model fit. This may be indicative of the fact that past exposure

observations contain significant amount of information about the current disease status.

In addition, for the trajectory interval / = (-10, -5), the PPL criteria corresponding

to the linear and constant influence models are moderately small. Thus, exposure

observations recorded 5-10 years prior to diagnosis also provide a modest amount of

information toward predicting the current disease status, corroborating the conclusions









Table 2-3. Posterior predictive losses (PPL) for the constant and linear
influence models for varying exposure trajectories and knots
Knots Model (-2,0) (-5,0) (-8,0) (-10,0) (-10,-5)
0 Constant 47.54 47.02 47.20 47.65 47.81
Linear 43.61 43.32 43.17 43.33 43.82

1 Constant 46.61 46.65 46.77 46.57 45.29
Linear 42.80 42.83 42.91 42.90 42.94

2 Constant 45.83 45.50 45.72 46.32 44.69
Linear 43.20 43.01 42.74 42.66 43.33

3 Constant 45.47 45.23 45.24 44.82 45.17
Linear 43.47 43.05 42.72 42.73 43.43
4 Constant 45.35 45.67 45.27 45.31 45.54
Linear 43.70 43.13 42.56 42.61 43.47
5 Constant 46.67 46.06 45.42 45.75 46.01
Linear 43.91 43.20 43.12 42.93 43.48


reached earlier. For the linear setup, the model with exposure trajectory I = (-8, 0)

and 4 knots perform the best (has the lowest PPL criterion among all the models

considered).

2.6.4 Model Assessment

As mentioned before, the number of knots and length of exposure trajectory tend

to interact in influencing the fit of the constant and linear influence models. Thus, for

a fixed trajectory length, the optimal model can be selected as the one with the lowest

value of the PPL criterion across all the knot choices. For the linear influence model, the

lowest PPL value was recorded for / = (-8, 0) and 4 knots. So, we perform our model

assessment procedure on this model.

For this model, the posterior mean of K was about 0.6 with 95% credible interval

(0.535, 0.680) which indicates substantial agreement beyond what is expected by

chance. We next performed case deletion analysis. We deleted each subject (with all

the observations) rather than each observation for a subject. Figure 2-2 (a)-(c) shows

the case deleted posterior means and 95% credible intervals for 31, 0o and 01. (In









these figures, the solid and dashed horizontal lines respectively indicates the estimated

posterior mean and 95% credible intervals of the respective parameters based on

the full data posterior. The solid points denote the importance weighted case-deleted

posterior mean while the vertical lines segments are the 95% case-deleted posterior

intervals). None of the subjects seem to be very influential on the parameter estimates.

For every subject, we also looked at the difference in the predicted probability of disease

based on the full data and with that subject deleted. Figure 2-2 (d) shows the plot of

the posterior means of the difference probabilities and the corresponding confidence

intervals. (In this figure, the solid line represents zero difference. The solid points

represents the difference in disease probabilities based on the full and case deleted

posteriors. The vertical line segments are the 95% posterior intervals of the differences).

Surprisingly, the observation for case number 108 has a significant departure from

the rest. On analyzing this subject, it was found that it had the unique combination of

very high age and very high values of PSA. In fact it had the highest mean age in the

sample, the highest age at diagnosis while the third highest mean Ptotal value. These

characteristics may have contributed to the exceptionally high difference in the predicted

probability of disease.

We also performed case deletion analysis of the intercept parameters of the

disease and trajectory models and the variance components. None of the subjects were

found to be influential on the posterior estimates of these parameters. Thus, based on

the above two measures, we may conclude that the semiparametric linear influence

model with trajectory I = (-8, 0) and 4 knots seems to fit the observed data relatively

well.

2.7 Conclusion and Discussion

Case control studies have witnessed a wide variety of research over the years.

Fundamental and far reaching contributions have been made both in the Frequentist

and Bayesian domains. Generally, the bulk of research have dealt with the standard






































I I I I I I I I
0 20 40 60 80 100 120 140
Deleted Case
A 1p


I I I I I I I I
0 20 40 60 80 100 120 140
Deleted Case
B Yo
















1 ,7 77~l,,l~l 1 T ''I


0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Deleted Case Deleted Case
C 1 D Disease Probability


Figure 2-2. Sensitivity of 31, 0 Q i and disease probability estimates to case-deletions.


L A ll. -IJ LJA A 1 ,LLL kIL|J|L h 2 iJ,I


o
N-



E s
LUJ

E

0
0
o


o.


' '1' I' *1.I I" 1 I I'


o

oC
0 -
0




LC
0
62


o
C


LO


I II I


. t 1, 1 i r.r i[r ,'t r .i


0


.. I "1I


d









situation where exposure observations for cases and controls are collected at a single

time point in the past. Some medical studies however have suggested that it may be

worthwhile to take into account an entire exposure history, if available, in assessing the

disease-exposure relationship. Case-control studies involving longitudinal exposure

trajectories is a relatively unexplored area. At the same time, it is a promising one given

the wide variety of longitudinal data analytic tools that are now available. Moreover,

recent developments in the area of semiparametric and nonparametric regression

analysis have added more flexibility in this direction specially when exposure trajectories

have complicated and unknown functional forms.

In this work, we have applied semiparametric regression techniques in analyzing

longitudinal case control studies. We have used penalized regression splines in

modeling the exposure trajectories for the cases and the controls. Thus our framework

can be used even when exposure observations are collected at different time points

across subjects i.e when exposures are unbalanced in nature. The exposure trajectory

is used as the predictor in a prospective logistic model for the binary disease outcome.

We have also modeled the slope parameter of the disease model as a p-spline to

account for any time varying influence pattern of the exposure trajectory on the current

disease status. In doing so, we have summarized the exposure history for the cases

and controls in a flexible way which allowed us to consider differential lengths of the

exposure trajectory in analyzing its effect on the current disease status. In order

to simplify the analysis, we used the logit-mixture of normal approximation (Albert

and Chib, 1993). We showed that the Bayesian equivalence results of Seaman and

Richardson (2004) essentially holds for our framework, thus allowing us to use a

prospective logistic model having fewer nuisance parameters although the dataset was

collected retrospectively. Analysis have been carried out in an hierarchical Bayesian

framework. Parameter estimates and associated credible intervals are obtained using

MCMC samplers. We have applied our methodology to a longitudinal case control









study dealing with the association between prostate specific antigen (PSA) and prostate

cancer.

We analyzed our model using differential lengths of exposure trajectories. In

doing so, we have concluded that past exposure observations do provide significant

information towards predicting the current disease status of a subject. Specifically,

we have shown that across all age at diagnosis groups, the odds of disease steadily

increase as past exposure observations are taken into account in addition to the recent

ones. We also observed that for a fixed trajectory length, the odds of disease steadily

decrease as the age at diagnosis increases corroborating the medical fact that younger

subjects tend to have more aggressive form of prostate cancer and thus are most likely

to be benefitted from early detection. We performed model comparison using posterior

predictive loss (Gelfand and Ghosh, 1998). This criterion indicated that models with

longer exposure trajectories tend to perform better than those with shorter trajectories.

Lastly, model assessment was performed on the optimal model using the kappa statistic

and case deletion diagnostics. Both these tools suggested that our model fits relatively

well to the data.

Some interesting extensions can be done to our setup. For richer datasets, it will

be interesting to model the subject specific deviation functions as p-splines. In addition,

we have only assumed constant and linear parameterizations of the influence function

of the prospective disease model. For a larger data set, a p-spline formulation can

also be used for the influence function which may bring out any underlying non-linear

pattern of influence of the exposure trajectory on the current disease status. Although

we have used a binary disease outcome, it will be interesting to extend our framework

to accommodate multi-category disease states. Our modeling framework can also be

generalized by incorporating a larger class of nonparametric distributional structures

(like Dirichlet processes or Polya trees) for the subject specific random effects.









CHAPTER 3
ESTIMATION OF MEDIAN HOUSEHOLD INCOMES OF SMALL AREAS : A BAYESIAN
SEMIPARAMETRIC APPROACH

3.1 Introduction

Sample survey methodologies are widely used for collecting relevant information

about a population of interest over time. Apart from providing population level estimates,

surveys are also designed to estimate various features of subpopulations or domains.

Domains may be geographic areas like state or province, county, school district etc. or

can even be identified by a particular socio-demographic characteristic like a specific

age-sex group. Sometimes, the domain-specific sample size may be too small to yield

direct estimates of adequate precision. This led to the development of small area

estimation procedures which specifically deal with the estimation of various features

of small domains. Generally, observations on various characteristics of small areas

are collected over time, and thus, may possess a complicated underlying time-varying

pattern. It is likely that models which exploit the time varying pattern in the observations

may perform better than classical small area models which do not utilize this feature. In

this study, we present a semiparametric Bayesian framework for the analysis of small

area level data which explicitly accomodates for the longitudinal pattern in the response

and the covariates.

3.1.1 SAIPE Program and Related Methodology

The Small Area Income and Poverty Estimates (SAIPE) program of the U.S.

Census Bureau was established with the aim of providing annual estimates of income

and poverty statistics for all states, counties and school districts across the United

States. The resulting estimates are generally used for the administration of federal

programs and the allocation of federal funds to local jurisdictions. There are also many

state and local programs that depend on these estimates. Prior to the creation of the

SAIPE program, the decennial census was the only source of income and poverty

statistics for households, families and individuals related to small geographic areas









like counties, cities and other substate areas. Due to the ten year lag in the release of

successive census values, there was a large gap in information concerning fluctuations

in the economic situation of the country in general and local areas in particular. The

establishment of the SAIPE program has largely mitigated this issue.

The current methodology of the SAIPE program is based on combining state

and county estimates of poverty and income obtained from the American Community

Survey (ACS) with other indicators of poverty and income using the Fay-Herriot class

of models (Fay and Herriot, 1979). The indicators are generally the mean and median

adjusted gross income (AGI) from IRS tax returns, SNAP benefits data (formerly

known as Food Stamp Program data), the most recent decennial census, intercensal

population estimates, Supplemental Security Income Receipiency and other economic

data obtained from the Bureau of Economic Analysis (BEA). Estimates from ACS are

being used since January 2005 on the recommendation of the National Academy of

Sciences Panel on Estimates of Poverty for Small Geographic Areas (2000). Income

and poverty estimates until 2004 were based on data from the Annual Social and

Economic Supplement (ASEC) of the Current Population Survey (CPS).

Apart from various poverty measures, the SAIPE program provides annual state

and county level estimates of median household income. At this point, direct ACS

estimates of median household income are only available for the period 2005-2008.

Thus, for illustration purpose, we have considered data from ASEC for the period

1995-1999 in order to estimate the state level median household income for 1999.

This is because, the most recent census estimates correspond to the year 1999 and

these census values can be used for comparison purposes. The SAIPE regression

model for estimating the median household income for 1999 use as covariates, the

median adjusted gross income (AGI) derived from IRS tax returns and the median

household income estimate for 1999 obtained from the 2000 Census. The response

variable is the direct estimate of median household income for 1999 obtained from the









March 2000 CPS. Bayesian techniques are used to weigh the contributions of the CPS

median income estimates and the regression predictions of the median income based

on their relative precision. The standard deviations of the error terms are estimated

by fitting a model to the estimates of sampling error covariance matrices of the CPS

median household income estimates for several years. The mean function in this model

is referred to as a "generalized variance function" (Bell, 1999). Noninformative prior

distributions are placed on the regression parameter corresponding to the IRS median

income since it was found to be statistically significant even in the presence of census

data, both in the 1989 and 1999 models.

3.1.2 Related Research

Estimation of median income for small areas contributes to the policy making

process of many Federal and State agencies. Before the establishment of the SAIPE

program, the estimation of median income for four-person families was of general

interest. The Census Bureau used the ideas suggested by Fay (1987) in this regard.

Estimation was carried out in an empirical Bayes (EB) framework suggested by Fay

et al. (1993). Later, Datta et al. (1993) extended the EB approach of Fay (1987) and

also put forward univariate and multivariate hierarchical Bayes (HB) models. The

estimates from their EB and HB procedures significantly improved over the CPS median

income estimates for 1979. Ghosh et al. (1996) exploited the repetitive nature of the

state-specific CPS median income estimates and proposed a Bayesian time series

modeling framework to estimate the statewide median income of four-person families

for 1989. In doing so, they used a time specific random component and modeled it as a

random walk. They concluded that the bivariate time series model utilizing the median

incomes of four and five person families performs the best and produces estimates

which are much superior to both the CPS and Census Bureau estimates. In general, the

time series model always performed better than its non-time series counterpart.









Semiparametric regression methods have not been used in small area estimation

contexts until recently. This was mainly due to methodological difficulties in combining

the different smoothing techniques with the estimation tools generally used in small

area estimation. The pioneering contribution in this regard is the work by Opsomer

et al. (2008) in which they combined small area random effects with a smooth,

non-parametrically specified trend using penalized splines. In doing so, they expressed

the non-parametric small area estimation problem as a mixed effects regression model

and analyzed it using restricted maximum likelihood. Theoretical results were presented

on the prediction mean squared error and likelihood ratio tests for random effects.

Inference was based on a simple non-parametric bootstrap approach. The methodology

was used to analyze a non-longitudinal, spatial dataset concerning the estimation of

mean acid neutralizing capacity (ANC) of lakes in the north eastern states of U.S.

3.1.3 Motivation and Overview

The motivation of our work also originates from the repetitive nature of the CPS

median income estimates. But, in contrast to the approach of Ghosh et al. (1996), we

have viewed the state specific annual household median income values as longitudinal

profiles or "income trajectories". This gained more ground because we used the state

wide CPS median household income values for only five years (1995 1999) in our

estimation procedure. Figure 3-1 shows sample longitudinal CPS median household

income profiles for six states spanning 1995 to 2004 while Figure 3-2 shows the plots

of the CPS median income against the IRS mean and median incomes for all the states

for the years 1995 through 1999. It is apparent that CPS median income may have

an underlying non-linear pattern with respect to IRS mean income, specially for large

values of the latter. The above two features motivated us to use a semiparametric

regression approach. In doing so, we have modeled the income trajectory using

penalized spline (or P-spline) (Eilers and Marx, 1996) which is a commonly used

but powerful function estimation tool in non-parametric inference. The P-spline is








































0 -
o


o
o


o





0
0


0



34000 36000 38000 40000 42000 44000 48000

IRS mean



























IRS mean


40000 45000 50000 55000 60000

IRS mean


I I I I I I I
24000 25000 26000 27000 28000 29000 30000

IRS median


24000 26000 28000 30000 32000


IRS median


28000 30000 32000 34000 36000

IRS median


Figure 3-1.


Longitudinal CPS median income profiles for 6 states plotted against IRS

mean and median incomes. (1st column : IRS Mean Income; 2nd column :

IRS Median Income).









expressed using truncated polynomial basis functions with varying degrees and

number of knots, although other types of basis functions like B-splines or thin plate

splines can also be used. We have worked with two types of models viz. a regular

semiparametric model and a semiparamteric random walk model. For each of these

models, analysis has been carried out using a hierarchical Bayesian approach. Since

we chose non-informative improper priors for the regression parameters, propriety of

the posterior has been proved before proceeding with the computations. Markov chain

Monte Carlo methodologies, specifically, Gibbs sampling (Gelfand and Ghosh, 1998)

has been used to obtain the parameter estimates.

We have compared the state-specific estimates of median household income for

1999 with the corresponding decennial census values in order to test for their accuracy.

In doing so, we observed that the semiparametric model estimates improve upon

both the CPS and the SAIPE estimates. Interestingly, the positioning of the knots had

significant influence on the results as will be discussed later on. We want to mention

here that the SAIPE model had a considerable advantage over ours in that they used the

census estimates of the median income for 1999 as a predictor. In small area estimation

problems, the census estimates are regarded as the "gold standard" since these are

the most accurate estimates available with virtually negligible standard errors. So,

using those as explanatory variables was an added advantage of the SAIPE state level

models. The fact that our estimates still improve on the SAIPE model based estimates is

a testament to the flexibility and strength of the semiparametric methodology specially

when observations are collected over time. It also indicates that it may be worthwhile

to take into account the longitudinal income patterns in estimating the current income

conditions of the different states of the U.S.

The rest of the chapter is organized as follows. In Section 3.2 we introduce the two

types of semiparametric models we have used. Section 3.3 goes over the hierarchical

Bayesian analysis we performed. In Section 3.4, we describe the results of the data











0 0 C>
00 50 6 00
0 0 *
Eo E
I S Me o e


8r 32. o. P m


I C, 0 0- c>.
10100 *0 ** 1 1




S. ana s e e e e. S





of our models. We end with a discussion in Section 3.6. The appendix contains the






















The target of inference is generally By or some function of it. Specifically, in our context,
incomes at times v and u i.e ()'. We denote by X the covariate




















corresponding to the ith Mtate and jthI CM
30000 40000 50000 60000 70000 20000 25000 30000 35000
IRS Mean Income IRS Median Income
A IRS mean income plot B IRS median income plot

Figure 3-2. Plots of CPS median income against IRS mean and median incomes for all
the states of the U.S. from 1995 to 1999.


analysis with regard to the median household income dataset. In Section 3.5, we

discuss the Bayesian model assessment procedure we used to test the goodness-of-fit

of our models. We end with a discussion in Section 3.6. The appendix contains the

proofs of the posterior propriety and the expressions of the full conditional distributions.

3.2 Model Specification

3.2.1 General Notation

Let Y = (Y,4,..., Y,js)' be the sample survey estimators of some characteristics

OY = (01, ...0,)' for the ith small area at the jh time (/ = 1,2,...,m;j = 1,2,...,t).

The target of inference is generally 0, or some function of it. Specifically, in our context,

0, = O, which denotes the median household income of the ith state at the jth year.

We are interested in estimating (, ..., Omu,,)' i.e the median household income for all

the states at time u. We may also want to estimate the difference in median household

incomes at times v and u i.e (0v Oiu, ..., Omv Omu,,)'. We denote by X, the covariate

corresponding to the ith state and jth year.


*


*









3.2.2 Semiparametric Income Trajectory Models

We assume the following two semiparametric models :

3.2.2.1 Model I : Basic Semiparametric Model (SPM)

Let Y, and X, denote the CPS median household income and the IRS mean (or

median) income recorded for the ith state at the jth year. The basic semiparametric

model can be expressed as


Y, = f(x-) + bi + u. + eu (3-1)

where f(xy) is an unspecified function of x- reflecting the unknown response-covariate

relationship.

We approximate f(xy) using a P-spline and rewrite (3-1) as
K

k=l
X= 3 + Z.7 + b, + u. + ey

= 0 + e (3-2)

where 0. = X0/3 + Z'y + bi + u, is our target of inference.

Here X, = (1, x,..., xf)', Zy = {(x. Ti)P, ..., (xd TK)P}',0 = (/3 ..., /p)' is the

vector of regression coefficients while 7 = (71,..., 7K)' is the vector of spline coefficients.

The above spline model with degree p can adequately approximate any unspecified

smooth function. Typically, linear (p = 1) or quadratic (p = 2) splines serves most

practical purposes since they ensure adequate smoothness in the fitted curve. m and t

respectively denote the number of small areas and the number of time points at which

the response and covariates are measured. Thus, in our case, m = 51, for all the 50

states of the U.S. and the District of Columbia and t = 5 for the years 1995-1999. bi is

a state-specific random effect while u, represents an interaction effect between the ith

state and thejth year. We assume b, ~_.id N(0, o-j) and 7 ~ N(0, -2IK). o- controls

the amount of smoothing of the underlying income trajectory. Moreover, it is assumed









that ud and ed are mutually independent with u ~- N(0,
are the sampling standard deviations corresponding to the CPS direct median income

estimates obtained using the "generalized variance function" technique mentioned in

Section 3.1.1. In the datasets provided by the Census Bureau, these estimates are

given for all the states at each of the time points. The knots (-, ..., rK) are usually

placed on a grid of equally spaced sample quantiles of xj's.

From (3-1) and (3-2), we have

OU = f(x) + bi + ud
which reflects our basic assumption that the true unknown household median income

may have an unspecified variational pattern with the IRS mean (or median) income.

Thus, the covariate effect is expressed by the unspecified nonparametric function f(xy)

which reflects the possible nonlinear effect of xy on 6y.

3.2.2.2 Model II : Semiparametric Random Walk Model (SPRWM)

Since, for each state, the response and the covariates are collected over time,

there may be a definite trend in their behavior. Thus, we added a time specific random

component to (3-1) and modeled it as a random walk as follows

Yu =X' + Z',>y + bi + v, + u, + eu

= 0 + e (3-3)

where 0y = X1/ + Z',- + b, + vj + u,

Here, vj denotes the time specific random component. We assume that, (vjv_ -_, O-) ~

N(vj-_, O-) with vo = 0. Alternatively, we may write, vj = vj-_+ wj where wj ~- N(0, ov).

This is the so-called random walk model and is similar to the systems equations used in

dynamic linear models.

Before proceeding to the next section, we may note that unlike the models of Ghosh

et al. (1996), the models given in (3-2) and (3-3) incorporate state specific random

effects (bi). This rectifies a limitation of the former as pointed out in Rao (2003).









3.3 Hierarchical Bayesian Inference

3.3.1 Likelihood Function

Let Yi = (Y,1 ..., Y,)' be the response and Xi = (Xi, ...,Xit)' and Zi = (Zi, ..., Zit)

be the covariates for the ith state. Let 0, = (0,, 3, 7, bi, b2, o-2, o-2) be the parameter

space corresponding to the ith state where 0, = (Oi, ..., Ot)' and b2 = ( ..., )'. Thus,

the full parameter space will be given by Q = i2 x 22 x ... x ,,. For the ith state, the

likelihood corresponding to Model I (SPM) can be written as


L(Y,, Xi,, Zil i) oc L(Y, li)L(0i,/3, bi, 2, X i, Z,)L(b| l )L(7|o- )
t
S {L(YO, O,)L(0 lX'0/3 Z' b, ,)} L(b, |-)L(7|o)
j= 1
(3-4)

Here, L(Ula, b) denotes a normal density with mean a and variance b while L(bi o-j)

and L(y 1o-7) denotes a normal distribution with mean 0 and variances o-a and o2

respectively.

For the random walk model, the parameter space for the ith state would be 0i =

(0i, /3,-, bi, v, b2, a 2, ao) where v = (v1, ..., vt) is the vector of time specific random

effects. Thus, the likelihood function for the ith state will have an extra component

corresponding to v as follows
t
L(Yi, Xi, Zilii) = {L(Y|\6, 2 o)L(Oe|X- +- Z.' + bi, )L(vv_, a2)} x
j= 1
x L(b|a2) )L(,,2) (3-5)

where L(vji vj_, o-2) denotes a normal distribution with mean vj_i and variance -2

where vo = 0.

3.3.2 Prior Specification

To complete the Bayesian specification of our model, we need to assign prior

distributions to the unknown parameters. We assume noninformative improper uniform

prior for the polynomial coefficients (or fixed effects) / and proper conjugate gamma









priors on the inverse of the variance components ( ..... o-, a ,o- ). The prior

distributions are assumed to be mutually independent. We choose small values (0.001)

for the gamma shape and rate parameters to make the priors diffuse in nature so that

inference is mainly controlled by the data distribution.

Thus, we have the following priors : 3 ~ uniform(RP++), (pj)-1 ~ G(cj, d)(j =

1 ... t), (j)-1 ~ G(c, d), (7)-1 G(c,, d,) and (o)-1 ~ G(cv, dv). Here X ~ G(a, b)

denotes a gamma distribution with shape parameter a and rate parameter b having the

expression f(x) oc xa-lexp(-bx), x > 0. Since we have chosen improper priors for 0,

posterior propriety of the full posterior have been shown. We have the following theorem
Theorem 1. Let 2x = max(, ...,.2) = '.7. say, for some k e [1,..., t]. Then,

posterior propriety holds if the following conditions are satisfied

1. (m p 5)/2 + ck > 0 and dk > 0

2. m/2 + cj 2 > 0 and dj >0,j = 1,..., t;j 4 k

3.3.3 Posterior Distribution and Inference

The full posterior of the parameters given the data is obtained in the usual way by

combining the likelihood and the prior distribution as follows
m t
p(f2Y, X, Z) x H L(Yi, Xi, Zili)7(/3)7(o)7(o) () (3-6)
i=1 j=1

For the random walk model, there will be an additional term 7r(a2). By the conditional

independence properties, we can factorize the full posterior as

[0, ,b, a a2, { ..., }Y, X,Z] o [Ylo ][0|/3,, b,{ ..., X, Z][b|l ] x
t
[7 1- 1 [/3]7[- [ ]nb]
j= 1

Our target of inference is {0,, i = 1,..., m;j = 1, ...t}, the true median household

income of all the states. Since the marginal posterior distribution of 0, is analytically

intractable, high dimensional integration needs to be carried out in a theoretical









framework. However, this task can be easily accomplished in an MCMC framework

by using Gibbs sampler to sample from the full conditionals of 0,i and other relevant

parameters. In implementing the Gibbs sampler, we follow the recommendation of

Gelman and Rubin (1992) and run n (> 2) parallel chains. For each chain, we run 2d

iterations with starting points drawn from an overdispersed distribution. To diminish

the effects of the starting distributions, the first d iterations of each chain are discarded

and posterior summaries are calculated based on the rest of the d iterates. The full

conditionals for both the models are given in the appendix.

3.4 Data Analysis

We applied the semiparametric models in Section 3.2.2 to analyze the median

household income dataset referred to in Section 3.1.3. The response variable Y, and

the covariates X, respectively denote the CPS median household income estimate and

the corresponding IRS mean (or median) income estimate for the ith state at thejth

year (i = 1,..., 51;j = 1,..., 5). The state-specific mean or median income figures are

obtained from IRS tax return data. The Census Bureau gets files of individual tax return

data from the IRS for use in specifically approved projects such as SAIPE. For each

state, the IRS mean (median) income is the mean (median) adjusted gross income (AGI)

across all the tax returns in that state. Like other SAIPE model covariates obtained from

administrative records data, these variables do not exactly measure the median income

across all households in the state. One of the reasons for this is that the AGI would not

necessarily be the same as the exact income figure and the tax return universe does

not cover the entire population i.e some households do not need to file tax returns, and

those that do not are likely to differ in regard to income than those that do. However,

the use of the mean or median AGI as a covariate only requires it to be correlated with

median household income, not necessarily be the same thing. Specifically for this study,

we have used IRS mean income as our covariate. This is because, it seems to possess









an underlying non-linear relationship with the CPS median income (Figure 3-2A), and so

it is more suited to a semiparametric analysis.

3.4.1 Comparison Measures and Knot Specification

Our dataset originally contained the median household income of all the states of

the U.S. and the District of Columbia for the years 1995-2004. However, we only used

the information for the five year period 1995-1999 since our target of inference are the

state specific median household incomes for 1999. We evaluated the performance of

our estimates by comparing them to the corresponding census figures for 1999. This

is because, in small area estimation problems, the census estimates are often treated

as "gold standard" against which all other estimates are compared. However, such a

comparison is only possible for those years which immediately precede the census year

e.g. 1969, 1979, 1989 and 1999.

In order to check the performance of our estimates, we plan to use four comparison

measures. These were originally recommended by the panel on small area estimates of

population and income set up by the Committee on National Statistics in July 1978 and

are available in their July 1980 report (p. 75). These are

* Average Relative Bias (ARB) = (51)-1 Y Ici eil
i Ci
2
Average Squared Relative Bias (ASRB) = (51)-1 Y Ici -e12
Ci

Average Absolute Bias (AAB) = (51)-1 1 |c, e,

Average Squared Deviation (ASD) = (51)-1 'i1(c, e,)2

Here c, and e, respectively denote the census and model based estimate of median

household income for the ith state (i = 1,..., 51). Clearly, lower values of these measures

would imply a better model based estimate.

The basic structure of our models would remain the same as in Section 3.2.2.

We have used truncated polynomial basis for the P-spline component in both the

models. Since Fig 2a doesn't indicate a high degree of non-linearity, we have restricted









ourselves to a linear spline (p = 1). The selection of knots is always a subjective but

tricky issue in these kind of problems. Sometimes experience on the subject matter

may be a guiding force in placing the knots at the "optimum" locations where a sharp

change in the curve pattern can be expected. Too few or too many knots generally

create problems in terms of worsening the fit. This is because, if too few knots are

used, the complete underlying pattern may not be captured properly, thus resulting in

a biased fit. On the other hand, once there are enough knots to fit important features

of the data, further increase in the number of knots have little effect on the fit and may

even degrade the quality of the fit (Ruppert, 2002). Generally, at most 35 to 40 knots

are recommended for effectively all sample sizes and for nearly all smooth regression

functions. Following the general convention, we have placed the knots on a grid of

equally spaced sample quantiles of the independent variable (IRS mean income).

3.4.2 Computational Details

We implemented and monitored the convergence of the Gibbs sampler following

the general guidelines given in Gelman and Rubin (1992). We ran three independent

chains each with a sample size of 10,000 and with a burn-in sample of another 5,000.

We initially sampled the O6's from t-distributions with 2 df having the same location and

scale parameters as the corresponding normal conditionals given in the Appendix. This

is based on the Gelman-Rubin idea of initializing certain samples of the chain from

overdispersed distributions. However, once initialized, the successive samples of O6's

are generated from regular univariate normal distributions. Convergence of the Gibbs

sampler was monitored by visually checking the dynamic trace plots, acf plots and by

computing the Gelman-Rubin diagnostic. The comparison measures deviated slightly for

different initial values. We chose the least of those as the final measures presented in

the tables that follows.









3.4.3 Analytical Results

Data on CPS median income and IRS mean incomes were available for 50 states

and the District of Columbia for the time span 1995-2004. CPS median income ranged

from $24,879.68 to $52,778.94 with a mean of $36,868.48 and standard deviation of

$5954.94 while IRS mean annual income ranged from $27,910 to $72,769.38 with a

mean of $41,133.45 and standard deviation of $7196.56.

We fitted Model I (SPM) with all possible knot choices from 0 to 40 but the best

results were achieved with 5 knots. The estimates (with 5 knots) improved significantly

over the CPS estimates based on all the four comparison measures. Addition of more

knots seemed to degrade the fit of the model. This may happen as pointed out in

Ruppert (2002). On the other hand, the SAIPE model based estimates were slightly

superior to the SPM estimates.

Next, we fitted the semiparametric random walk model (SPRWM) to our data.

Overall, the random walk structure lead to some improvement in the performance of

the estimates. However, for the model with 5 knots, the performance of the estimates

remained nearly the same. This may be because 5 knots is sufficient to capture the

underlying pattern in the income trajectory and the random walk component doesn't

lead to any further improvement. Last but not the least, the random walk model

estimates, although generally better than those of the basic semiparametric model,

still cannot claim to be superior to the SAIPE estimates for all the comparison measures.

Table 3-1 reports the posterior mean, median and 95% Cl for the parameters of the

SPRWM with 5 knots.

It is of interest that the 95% Cl for 71, 74 and 75 doesn't contain 0 indicating the

significance of the first, fourth and fifth knots. This is indicative of the relevance of knots

in the penalized spline fit on the CPS median income observations. The same is true for

the coefficients of SPM.









Table 3-1. Parameter estimates of SPRWM with 5 knots
Parameter Mean Median 95% Cl
0o 4677.71 4660.08 (4633.31, 4758.7)
/1 0.8156 0.816 (0.814, 0.817)
71 -0.154 -0.154 (-0.158, -0.149)
72 0.02 0.024 (-0.016, 0.040)
73 -0.008 -0.016 (-0.056, 0.066)
4 -0.093 -0.119 (-0.127, -0.037)
5 -0.165 -0.173 (-0.187, -0.139)

3.4.4 Knot Realignment

As mentioned in Section 3.1.1, the SAIPE state models use the census estimates

of median income (for 1999) as one of the predictor which essentially gives them a

big edge over us. This may be one of the reasons why the estimates obtained from

the semiparametric models are atmost comparable, but not superior to the SAIPE

estimates. But that doesn't rule out the fact that the semiparametric models have room

for improvement. In this section, we will look for any possible deficiencies in the our

models and will try to come up with some improvements, if there is any.

As mentioned in Section 3.4.1, selection and proper positioning of knots plays

a pivotal role in capturing the true underlying pattern in a set of observations. Poorly

placed knots does little in this regard and can even lead to an erroneous or biased

estimate of the underlying trajectory. Ideally, a sufficient number of knots should be

selected and placed uniformly throughout the range of the independent variable to

accurately capture the underlying observational pattern.

Figures 3-3A and 3-3B shows the exact positions of 5 and 7 knots in the plot of CPS

median income against IRS mean income. In both the cases, the knots are placed on

a grid of equally spaced sample quantiles of IRS mean income. In both the figures, the

knots lie on the left of IRS mean = 50000, the region where the density of observations

is high. The knots tend to lie in this region because they are selected based on quantiles

which is a density-dependent measure. Thus, in both the figures, the coverage area

of knots (i.e the part of the observational pattern which is captured by the knots) is the










o *
*


S- *g- S
0)

S 4 5 6 400 5 0 6 0












IRS Mean Income IRS Mean Income
A Positioning of 5 Knots B Positioning of 7 Knots




Figure 3-3. Exact positions of 5 and 7 knots in the plot of PS median income against








region to the left of the dotted vertical lines. On the other hand, the non-linear pattern
8 |-<," 8 **<-"*
o A 0 AAAA A 0
I< I-----------------------------------I ICM I -------------']------------- I
30000 40000 50000 60000 70000 30000 40000 50000 60000 70000
IRS Mean Income IRS Mean Income
A Positioning of 5 Knots B Positioning of 7 Knots

Figure 3-3. Exact positions of 5 and 7 knots in the plot of CPS median income against
IRS mean income. The knots are depicted as the bold faced triangles at the
bottom.


region to the left of the dotted vertical lines. On the other hand, the non-linear pattern

is tangible only in the low density area of the plot i.e the region lying to the right of IRS

mean = 50000. Evidently, none of the knots lie in this part of the graph. Thus, we can

presume that in both the cases (5 and 7 knots), the underlying non-linear observational

pattern is not being adequately captured.

As a natural solution to this issue, we decided to place half of the knots in the low

density region of the graph while the other half in the high density region. The exact

boundary line between the high density and low density regions is hard to determine.

We tested different alternatives and came up with IRS mean = 47000 as a tentative

boundary because it gave the best results. In both the regions, we placed the knots at

equally spaced sample quantiles of the independent variable. Figure 3-4 shows the new

knot positions for 5 knots.

It is clear from Figure 3-4 that the new knots are more dispersed throughout

the range of IRS mean than the old ones. The region between the bold and dashed

vertical lines denotes the additional coverage that has been achieved with the knot











"*
C 8
0 *
o 0







oI I I I I
C)* 0
E












30000 40000 50000 60000 70000

IRS Mean Income

Figure 3-4. Positions of 5 knots after realignment. The knots are the bold faced triangles
at the bottom. The region between the dashed and bold lines is the
additional coverage area gained from the realignment.


rearrangement. Based on the number of data points inside this region, it is clear that a

much larger proportion of observations has been captured with the knot realignment.

No knots are in the region beyond the bold vertical lines (i.e beyond IRS mean 56000)

possibly due to the very low density of the observations in that area. Overall, it seems

that, the new knots can capture some of the underlying non-linear pattern in the dataset

which the old knots failed to achieve. We also experimented by placing all the knots in

the low density region (beyond IRS mean = 47000) but the results were not satisfactory.

This indicates that the knots should be uniformly placed throughout the range of the
independent variable to get an optimal fit.














We have worked with 5 knots because it performed consistently well for both

the SPM and SPRW models. On fitting the semiparametric models with the new
knot alignment, we did achieve some improvement in the results. Table 3-2 reports
C'j I I- I 1-1
30000 40000 50000 60000 70000

IRS Mean Income
Figure 3-4. Positions of 5 knots after realignment. The knots are the bold faced triangles
at the bottom. The region between the dashed and bold lines is the
additional coverage area gained from the realignment.


rearrangement. Based on the number of data points inside this region, it is clear that a

much larger proportion of observations has been captured with the knot realignment.

No knots are in the region beyond the bold vertical lines (i.e beyond IRS mean 56000)

possibly due to the very low density of the observations in that area. Overall, it seems

that, the new knots can capture some of the underlying non-linear pattern in the dataset

which the old knots failed to achieve. We also experimented by placing all the knots in

the low density region (beyond IRS mean = 47000) but the results were not satisfactory.

This indicates that the knots should be uniformly placed throughout the range of the

independent variable to get an optimal fit.

We have worked with 5 knots because it performed consistently well for both

the SPM and SPRW models. On fitting the semiparametric models with the new

knot alignment, we did achieve some improvement in the results. Table 3-2 reports









the comparison measures for the raw CPS estimates, SAIPE estimates and the

semiparametric estimates with the knot realignment while Table 3-3 depicts the

percentage improvement of the semiparametric estimates over the CPS and SAIPE

estimates. Here, SPM(5)* and SPRWM(5)* respectively denote the semiparametric

models with the realigned 5 knots.


Table 3-2. Comparison
SPRWM(5)*
Estimate ARB
CPS 0.0415
SAIPE 0.0326
SPM(5)* 0.028
SPRWM(5)* 0.0295


Table 3-3.

Estimate

SAIPE


CPS


measures for SPM(5)* and
estimates with knot realignment
ASRB AAB ASD
0.0027 1,753.33 5,300,023
0.0015 1,423.75 3,134,906
0.0012 1173.71 2,334,379
0.0013 1256.08 2,747,010


Percentage improvements of SPM(5)* and SPRWM(5)*
estimates over SAIPE and CPS estimates
Model ARB ASRB AAB ASD
SPM(5)* 14.11% 20.00% 17.56% 25.54%
SPRWM(5)* 9.51% 13.33% 11.78% 12.37%
SPM(5)* 32.53% 55.55% 33.06% 55.96%
SPRWM(5)* 28.92% 51.85% 28.36% 48.17%


It is clear that, with the knot realignment, the comparison measures corresponding

to the semiparametric estimates have decreased substantially, specially so for the SPM.

The new comparison measures for the semiparametric models are quite lower than

those corresponding to the SAIPE estimates. Thus, we may say that the semiparametric

model estimates performs better than the SAIPE estimates with the realigned knots.

This improvement is apparently due to the additional coverage of the observational

pattern that is being achieved with the relocation of the knots. As a result of this

increased coverage, a larger proportion of the underlying nonlinear pattern in the

observations in being captured by the new knots. Although we have done this exercise

with only 5 knots, it would be interesting to experiment with other types of knot alignment









and with different number of knots. Table 3-4 and Table 3-5 report the posterior mean,

median and 95% Cl for the parameters in SPM(5)* and SPRWM(5)* respectively.


Table 3-4. Parameter estimates of
Parameter Mean Median
/3o 4767.48 4769.04
31 0.811 0.810
71 -0.189 -0.191
72 0.0389 0.0395
73 0.104 0.102
74 -0.240 -0.253
75 -0.127 -0.155


Table 3-5. Parameter estimates of
Parameter Mean Median
0o 4826.28 4824.39
31 0.806 0.809
71 -0.159 -0.156
72 0.014 0.012
3 0.08 0.08
74 -0.237 -0.244
75 -0.225 -0.183


SPM(5)*
95% Cl
(4743.33, 4791.67)
(0.809, 0.812)
(-0.198, -0.180)
(0.0189, 0.059)
(0.099, 0.126)
(-0.305, -0.179)
(-0.181, -0.081)


SPRWM(5)*
95% Cl
(4806.77, 4860.56)
(0.801, 0.810)
(-0.183, -0.151)
(0.004, 0.039)
(0.027, 0.123)
(-0.369, -0.125)
(-0.538, -0.085)


It is of interest to note that, with the knot realignment, all the knot coefficients (i.e

the 7's) are significant for both SPM and SPRWM. For the old configuration, some of the

knot coefficients were not significant for the models. This corroborates the fact that, with

the knot realignment, all the five knots are significantly contributing to the curve fitting

process in terms of capturing the true underlying non-linear pattern in the observations.

3.4.5 Comparison with an Alternate Model

We also compared the semiparametric models (with 5 knots) with the model

proposed by Ghosh et al. (1996), henceforth referred to as the GNK model. Their

univariate model is as follows


Y =o + 3xi+ b + u+ eu


(3-7)


where (b b,) N(0, o-), u, ~ N(0, rb) and e, ~ N(0, 72).









One of the major qualitative difference between the above model and our semiparametric

models is that the former doesn't have a state specific random effect. In fact, it would

also be interesting to compare the above model with the basic semiparametric model

(SPM) with 0 knots i.e

Y, = 3o + ixj + bi + u. + ey (3-8)

where bi ~i.i.d N(0, o-) while ud and ed have the same distribution as above. Clearly,

the only difference between (3-7) and (3-8) is that the former contains a time specific

random component while the latter contains a area specific random component. Ghosh

et al. (1996) showed that the estimates from the bivariate version of the GNK model

(3-7) performs much better than the census bureau estimates in estimating the median

household income of 4-person families in the United States. Table 3-6 depicts the

comparison measures corresponding to the above models.

Table 3-6. Comparison measures for time series and
other model estimates
Estimate ARB ASRB AAB ASD
CPS 0.0415 0.0027 1,753.33 5,300,023
SAIPE 0.0326 0.0015 1,423.75 3,134,906
GNK 0.0397 0.0025 1709.58 5,229,869
SPM(0) 0.0337 0.0017 1408.7 3,137,978
SPM(5)* 0.028 0.0012 1173.71 2,334,379
SPRWM(5)* 0.0295 0.0013 1256.08 2,747,010


It is clear that, although the estimates from the GNK model perform slightly better

than the CPS, those are quite inferior to the semiparametric and SAIPE estimates. This

may be because the state specific random effects in the semiparametric models can

account for the within-state correlations in the income values, something which the GNK

model fails to do. Since the comparison measures for SPM(0) are much lower than

those for the GNK model, we can also conclude that the area specific random effect is

much more critical than a time specific random component in this situation.









3.5 Model Assessment

To examine the goodness-of-fit of the semiparametric models, we used a Bayesian

Chi-square goodness-of-fit statistic Johnson (2004). This is essentially an extension of
the classical Chi-square goodness-of-fit test where the statistic is calculated at every

iteration of the Gibbs sampler as a function of the parameter values drawn from the
respective posterior distribution. Thus, a posterior distribution of the statistic is obtained
which can be used for constructing global goodness-of-fit diagnostics.

To construct this statistic, we form 10 equally spaced bins ((k 1)/10, k/10),
k = 1,..., 10, with fixed bin probabilities, pk = 1/10. The main idea is to consider the

bin counts mk(O) to be random where 0 denotes a posterior sample of the parameters.

At each iteration of the Gibbs sampler, bin allocation is made based on the conditional
distribution of each observation given the generated parameter values i.e YU would be

allocated to the kth bin if F(YU|) e ((k 1)/10, k/10), k = 1,..., 10. The Bayesian
chi-square statistic is then calculated as


R8(&)= m/k() npk 2

For the purpose of model assessment, two summary measures can be used, both
derived from the posterior distribution of RB(O). First one is the proportion of times the
generated values of RB exceeds the 0.95 quantile of a X distribution. Values quite close

to 0.05 would suggest a good fit. The second diagnostic is the probability that RB(O)
exceeds a X2 deviate i.e

A = PI(RB() > X), X X

Since the nominal value of this probability is 0.5, values close to 0.5 would suggest a

good fit.
The only assumptions for this statistic to work are that the observations should be

conditionally independent and the parameter vector should be finite dimensional. The












00





0 10 20 30 40 0 5 10 15 20 25 30
0 0








o distribution of the basic semiparametric and semiparametric random walk
0
I I I I I I I I I
0 10 20 30 40 0 5 10 15 20 25 30
Theoretical Quantiles of Chi-Square (9) Theoretical Quantiles of Chi-Square (9)
A Basic Semiparametric Model B Semiparametric RW Model

Figure 3-5. Quantile-quantile plot of RB values for 10000 draws from the posterior
distribution of the basic semiparametric and semiparametric random walk
models. The X-axis depicts the expected order statistics from a X2
distribution with 9 degrees of freedom.


second assumption naturally holds in our case. Regarding the first one, since we have

multiple observations over time for every state, there may be within-state dependence

between those. Thus, instead of taking all the observations (i.e the CPS median

income values), we decided to use the last observation for each state. For the basic

semiparametric model (SPM), the above summary measures were respectively 0.049

and 0.5 while for the random walk model (SPRWM), these were 0.047 and 0.51. These

measures suggest that both SPM and SPRWM fits the data quite well. Figure 3-5A

and 3-5B shows the quantile-quantile plots of RB values obtained from 10000 samples

of SPM and SPRWM with 5 knots. Both the plots demonstrate excellent agreement

between the distribution of RB and that of a X2(9) random variable.

Johnson points out that the Bayesian chi-square test statistic is also an useful tool

for code verification. If the posterior distribution of RB deviates significantly from its

null distribution, it may imply that the model is incorrectly specified or there are coding

errors. Since the summary measures are quite close to the corresponding null values,









we think that our models provide a satisfactory fit to the data set and also that there are

no coding errors.

3.6 Discussion

The proper estimation of median household income for different small areas is one

of the principal goals of the U.S. Census Bureau. These estimates are frequently used

by the Federal Government for the administration and maintenance of different federal

programs and also for the allotment of federal grants to local jurisdictions. Although

these estimates are available annually for every state, the U.S. Census Bureau generally

uses a non-longitudinal approach in their estimation procedure based on the Fay-Herriot

model (Fay and Herriot, 1979). In this study, we have proposed a semiparametric class

of models which exploit the longitudinal trend in the state-specific income observations.

In doing so, we have modeled the CPS median income observations as an "income

trajectory" using penalized splines (Eilers and Marx, 1996). We have also extended the

basic semiparametric model by adding a time series random walk component which can

explain any specific trend in the income levels over time. We have used as our covariate,

the mean adjusted gross income (AGI) obtained from IRS tax returns for all the states.

Analysis has been carried out in a hierarchical Bayesian framework. Our target of

inference has been the median household incomes for all the states of the U.S. and the

District of Columbia for the year 1999. We have evaluated our estimates by comparing

those with the corresponding census estimates of 1999 using some commonly used

comparison measures.

Our analysis has shown that information of past median income levels of different

states do provide strength towards the estimation of state specific median incomes

for the current period. In fact, if there is an underlying non-linear pattern in the median

income levels, it may be worthwhile to capture that pattern as accurately as possible and

use that in the inferential procedure. In terms of modeling the underlying observational

pattern, the positioning of knots proved to be both important and interesting. The









quality (in terms of their "closeness" to the census estimates) of the estimates tended

to improve as the knots were positioned more uniformly throughout the range of the

independent variable. It became apparent that the contribution of the knots towards

deciphering the underlying observational pattern improved substantially when those

were properly placed with an optimal coverage area. This in turn improved the

approximation of the curve vis-a-vis the true unknown observational pattern. This

proved interesting because, still now, there is no absolute rule which controls the

positioning of knots. Our final estimates proved to be superior, not only to the raw CPS

estimates, but also to the current U.S. Census Bureau (SAIPE) estimates. Although the

basic semiparametric model performed much better that the semiparametric random

walk model with 5 knots, more experiments need to be done with different knot positions

and number before anything conclusive can be said about their relative performance as

a whole. But, it seems that, if adequate knots are used and if those are placed uniformly

throughout the range of the independent variable, then a random walk component

may not improve the fit any further provided there is no strong trend in the income

levels. The main advantage of our modeling procedure is that it can be used for any

possible patterns in the response (income, poverty etc) observations of small areas. In

a subsequent work related to the estimation of median incomes of 4-person families,

we have shown that the multivariate version of the basic semiparametric model perform

quite well too and provide estimates which are consistently superior to the U.S. Census

Bureau estimates.

The above models can be extended in various ways based on the nature of the

observational pattern and the quality (or richness) of the dataset. Some obvious

extensions are given as follows : (1) In the models considered above, the spline

structure f(xi) represents the population mean income trajectory for all the states

combined. The deviation of the ith state from the mean is modeled through the random

intercept b,. This implies that the state-specific trajectories are parallel. A more flexible









extension would be to model the state-specific deviations as unspecified non-parametric

functions as follows


YY = f(xy) + gi(xiy)+ u, + e-
K*
where gi(xy) = bil + b,2xy + wikXi K{)+ (3-9)
k=1

Here gi(x) is an unspecified nonparametric function representing the deviation of the ith

state-specific trajectory from the population mean trajectory f(x). gi(x) is also modeled

using P-spline with a linear part, bil + b,2x and a non-linear one, ,K1 Wik(X Kk) thus

allowing for more flexibility. Both these components are random with (bi,, bi2)' ~ N(0, Z)

(Z being unstructured or diagonal) and wik ~ N(0, o-). This extension is particularly

relevant in situations where the state-specific income trajectories are quite distinct

from the population mean curve and thus need to be modeled explicitly. We plan to

pursue this extension if we can procure a richer dataset with longer state specific

income trajectories. (2) Sometimes the function to be estimated (here the median

income pattern) may have varying degrees of smoothness in different regions. In

that case, a single smoothing parameter may not be proper and a spatially adaptive

smoothing procedure can be used (Ruppert and Carroll, 2000). (3) We used the

truncated polynomial basis function to model the income trajectory but other types

of bases like B-splines, radial basis functions etc can also be used. (4) Although we

used a parametric normal distributional assumption for the random state and time

specific effects, a broader class of distributions like the mixtures of Dirichlet processes

(MacEachern and Muller, 1998) or Polya trees (Hanson and Johnson, 2000) may be

tested.

Last but not the least, we think that semiparametric modeling approach holds a

lot of promise for small domain problems specially when observations for each domain

are collected over time. The associated class of semiparametric models can well be an

attractive alternative to the models generally employed by the U.S. Census Bureau.









CHAPTER 4
ESTIMATION OF MEDIAN INCOME OF FOUR PERSON FAMILIES :A MULTIVARIATE
BAYESIAN SEMIPARAMETRIC APPROACH

4.1 Introduction

Small area estimation techniques have been widely used for estimating various

features of small domains domains for which the sample size is prohibitively small

for the application of direct survey based estimation procedures. Small domains can

be specific regions like a state, county or school district or can even be identified by a

particular socio-demographic characteristic like a specific ethnic group.

The U.S.. Census Bureau has always been concerned with the estimation of

income and poverty characteristics of small areas across the United States. These

estimates play a vital role towards the administration of federal programs and the

allocation of federal funds to local jurisdictions. For example, state level estimates of

median income for four-person families are needed by the U.S. Department of Health

and Human Services (HHS) in order to formulate its energy assistance program to low

income families. Since income characteristics for small areas are generally collected

over time, there may well be a time varying pattern in those observations. Neglecting

those patterns may lead to biased estimates which doesn't reflect the true picture. In

this study, we put forward a multivariate Bayesian semiparametric procedure for the

estimation of median income of four-person families for the different states of the U.S.

while explicitly accommodating for the time varying pattern in the observations.

4.1.1 Census Bureau Methodology

The estimation of median incomes for different family sizes used to be carried out

by the U.S. Census Bureau until a few years ago. More recently, they have established

the Small Area Income and Poverty Estimates (SAIPE) program which exclusively

deals with the estimation of median household income and poverty estimates for small

areas across the United States. But the estimation of the median income of four-person









families remains interesting nevertheless. Now, we will briefly discuss the estimation

procedure that the U.S. Census Bureau used to follow towards that end.

In estimating the median income of four-person families, the U.S. Census Bureau

relied on data from three sources. The basic source was the annual demographic

supplement to the March sample of the Current Population Survey (CPS) which used to

provide the state specific median income estimates for different family sizes. The second

source was the decennial census estimates for the year proceeding the census year i.e

1969, 1979, 1989 and so on. Lastly, the Census Bureau also used the annual estimates

of per capital income (PCI) provided by the Bureau of Economic Analysis (BEA) of the

U.S. Department of Commerce. Each of the above data sources (and the resulting

estimates) have some disadvantages which neccesiated an estimation procedure that

used a combination of all three to produce the final median income estimates. The

CPS estimates were based on small samples which resulted in substantial variability.

On the other hand, decennial census estimates, although having negligible standard

errors, were only available every 10 years. Due to this lag in the release of successive

census estimates, there was a significant loss of information concerning fluctuations in

the economic situation of the country in general and small areas in particular. Lastly, the

per capital income estimates didn't have associated sampling errors since they were not

obtained using the usual sampling techniques. The details of the estimation procedure

appears in Fay et al. (1993).

The Census Bureau based their estimation procedure on a bivariate regression

model suggested by Fay (1987). In doing so, they used median income observations

for three and five person families in addition to those of four person families. The basic

dataset for each state was a bivariate random vector with one component the CPS

median income estimates of four person families and the other component being the

weighted average of CPS median incomes of three and five person families, with

weights 0.75 and 0.25 respectively. Both the regression equations used the base year









census median (b) and the adjusted census medians (c) corresponding to four person

families and the weighted average of three and five person families as covariates. The

base year census median denotes the median income estimate obtained from the most

recent decennial census while the adjusted census median (c) for the current year is

obtained by the relation

Adjusted census median (c) = PC ) x census median (b)
PCI (b)
Here PCI(c) and PCI(b) denotes the per capital income estimates produced by the BEA

for the current and base years respectively. Thus, in the above expression, the current

year adjusted census median estimate is obtained by adjusting the base year census

median by the proportional growth in the PCI between the base year and the current

year. In the regression equation, the base year census median adjusts for any possible

overstatement of the effect of change in the PCI in estimating the current median

incomes. Finally, the Census Bureau used an empirical Bayesian (EB) technique (Fay

(1987); Fay et al. (1993)) to calculate the weighted average of the current CPS median

income estimate and the estimates obtained from the regression equation.

4.1.2 Related Literature

The estimation of median incomes for small areas have received sustained attention

over the years. Datta et al. (1993) extended and refined the ideas of Fay (1987) and

proposed a more appealing empirical Bayesian procedure. They also performed an

univariate and multivariate hierarchical Bayesian analysis of the same problem and

showed that both the EB and HB procedures resulted in significant improvement over

the CPS median income estimates for the univariate and multivariate models. However,

the multivariate model resulted in considerably lower standard error and coefficient of

variation than the univariate model although the point estimates were similar. Later,

Ghosh et al. (1996) (henceforth referred to as GNK) presented a Bayesian time series

analysis of the same problem by exploiting the inherent repetitive nature of the CPS

median income estimates. In doing so, they estimated the statewide median income









estimates of four-person families for 1989 using 1979 as the base year. They compared

their estimates with the CPS median income estimates and Bureau of Census estimates

by treating the decennial census values as "gold standard". They used both univariate

and bivariate model formulations. In all the cases, the time series model with the

adjusted census median income as covariates performed better than the ones with

either the base year census median as covariates or both the base year and adjusted

census medians as covariates. In all the cases, the time series model performed better

than the non-time series one which only utilized the census median income figures for

1979, the CPS median income estimates for 1989 and the per capital income incomes

for 1979 and 1989. Last but not the least, the bivariate time series model using the

median incomes of four and five person families performed the best and outperformed

both the CPS and Bureau of Census estimates of median income.

Semiparametric regression methods have not been used in small area estimation

contexts until recently. This was mainly due to methodological difficulties in combining

the different smoothing techniques with the estimation tools generally used in small

area estimation. The pioneering contribution in this regard is the work by Opsomer

et al. (2008) in which they combined small area random effects with a smooth,

non-parametrically specified trend using penalized splines (Eilers and Marx, 1996). In

doing so, they expressed the non-parametric small area estimation problem as a mixed

effects regression model and analyzed it using restricted maximum likelihood. They also

presented theoretical results on the prediction mean squared error and likelihood ratio

tests for random effects. Inference was based on a simple non-parametric bootstrap

approach. They applied their model to a non-longitudinal, spatial dataset concerning the

estimation of mean acid neutralizing capacity (ANC) of lakes in the north eastern states

of U.S.









4.1.3 Motivation and Overview

The motivation of our work also originates from the longitudinal nature of the

CPS median income estimates. However, instead of viewing the observations as

a time series as in Ghosh et al. (1996), we have treated the state specific median

income observations as longitudinal profiles or "income trajectories". As with any

longitudinally varying observations, the income profiles (both state-specific and overall)

may have a non-linear pattern over time. Moreover, the successive income observations

may be unbalanced in nature. These features motivated us to use a semiparametric

regression approach in our modeling framework. In doing so, we have modeled the

income trajectory using penalized spline (or P-spline) which is a commonly used but

powerful function estimation tool in non-parametric inference. The P-spline is expressed

using truncated polynomial basis functions with varying degrees and number of knots

although other types of basis functions like B-splines or thin plate splines can also be

used. As covariates, we have used the adjusted census median incomes since it was

found to be the most effective covariate by Ghosh et al. (1996). We tested four different

regression models viz (1) A univariate model with only the CPS median income of

four-person family as the response variable; (2) A bivariate model with the CPS median

incomes of three and four person families as the response variables; (3) A bivariate

model with the CPS median incomes of four and five person families as the response

variables; and lastly (4) A bivariate model with the CPS median incomes of four person

family and weighted average of the CPS median incomes of three and five person

families (with weights 0.75 and 0.25) as the response variables. In all the cases, our

primary objective has been the estimation of median incomes of four-person families of

all the 50 U.S. states and the District of Columbia for 1989. For each of these models,

analysis has been carried out using a hierarchical Bayesian approach. Since we chose

non-informative improper priors for the regression parameters, propriety of the posterior

has been rigorously proved before proceeding with the computations (see Theorem 3 in









the appendix). Markov chain Monte Carlo methodologies, specifically, Gibbs sampling

(Gelfand and Smith, 1990) has been used to obtain the parameter estimates.

We have compared the state-specific estimates of median household income for

1989 with the corresponding decennial census values in order to test for their accuracy.

In doing so, we observed that the semiparametric model estimates improve upon both

the CPS and the Census Bureau estimates. Interestingly, for all the above models,

the semiparametric estimates are generally superior or at least comparable to the

corresponding estimates from the time series models of Ghosh et al. (1996). This is a

testament to the flexibility and strength of the semiparametric methodology specially

when observations are collected over time. It also indicates that it may be worthwhile

to take into account the longitudinal income patterns in estimating the current income

conditions of the U.S. states. Lastly, the semiparametric modeling framework is very

general and can be applied to any situation where various characteristics of small areas

are collected over time.

The rest of the chapter is organized as follows. In Section 4.2 we introduce the

bivariate semiparametric modeling framework. Section 4.3 goes over the hierarchical

Bayesian analysis we performed. In Section 4.4, we describe the results of the data

analysis with regard to the median household income dataset. Finally, we end with

a discussion and some references towards future work in Section 4.5. The appendix

contains the proofs of the posterior propriety and the expressions of the full conditional

distributions for our models.

4.2 Model Specification

4.2.1 Notation

Let Y, = (Y, ..., ,s)' be the sample survey estimators of some characteristics

8, = (01, ..., s)' for the ith small area at the jth time (i = 1, 2,..., m;j = 1,2,..., t).

In this study, we are concerned with the estimation of 0, or some function of it. For

example, 0y, may be the median income of four-person families for the ith state at









the jth year. In that case, we may be interested in estimating (01,1 ..., 0,,O)' the

median income of four-person families for all the states at time u. We may also want to

estimate the difference in median incomes of four-person families at times v and u i.e

(O1vi 01,1 ... *Omvi Om,,)'. Correspondingly, let X, = (Xyi, ..., Xs)' be the predictors

corresponding to the ith state and jth year.

4.2.2 Semiparametric Modeling Framework

We consider both univariate and bivariate income trajectory models for the

family-size dataset. The univariate modeling framework is exactly the same as explained

in Chapter 3. Here, we will explain the bivariate framework which is of two types viz a

simple bivariate model and a bivariate random walk model. These can also be seen as

extensions of the univariate models explained in Section 3.2.2.

4.2.2.1 Simple bivariate model

The bivariate non-random walk model is given by

KI
Yi = Ao + a11xi + ... + a + 7kl(X kl + b; + Uiy + eiL
k=l
K2
Yij2 = 302 + /12Xy2 ... + ,2,2 + k2 (Xi2 7k2) + bi2 + UJ2 + e6 (4-1)
k=l

This is the most general structure since the degrees of the spline as well as the number

and position of the knots are different for the two models. If for / = 1, 2,..., m;j =

1, 2,..., t, { Yi, X 1} and { Y,2, Xo2} have similar relationship, we can assume p = q and

rkl = k2, k = 1, 2,..., K (= K2).

Equation (4-1) can be rewritten as


Y = U~ Z bi +u+ ey (4-2)

= Oy +e.,


where 0o- = U0/3 + Z-y + bi + u-.
6 6rL IU









Here By = (01, 0y2), uy = (u6i, U2)', ey = (edl, e2)', bi = (bil, bi2)', =

(/01, ... /pl, 02, q2 = (711, ... 7K1i, 712 ... 7K22),

x1 ... x 0 0 ...
0 0 ... 0 1 Xi2 ... XU

and


Z ((X Tr )P ... (X TK11)p 0 ... 0
0 ... 0 (X2 712)q ... (X2 TK22)

Analogous to the univariate case, we assume bi i'nd N(0, Xo), and 7 ~ N(0, 1,).

e. and u. are mutually independent with e. 'ind N(0, :y) and u- ~ind N(0, qIj). For
simplification purposes, we assume that Yo = diag(o-7, o-,), and 1E = diag(o-71, o-,)

where o- is assumed to be known and is estimated from the data as in the univariate
framework. The above bivariate model can easily be generalized to a multivariate
framework if the need arise.

4.2.2.2 Bivariate random walk model

In order to model any conspicuous trend in the income observations for a specific
family size and/or a specific state, we add a time specific random component to the

simple bivariate model (4-2) as follows

Yu = U' Z'7 + b + v + u + e

= o,+ e (4-3)

where 0y = U 0/ + Z'y bi + vj + uy.

As in Section 3.2.2.2, we assume that (v jvj_ Ev) N(vj-_, Ev) with vo = 0.

Alternatively, we may write vj = vj-_ + wj where wj /i.i.d N(0, Iv).









4.3 Hierarchical Bayesian Analysis

In this section, the notations and expressions would correspond to the bivariate

setup. The expressions for the univariate setup would be analogous and is mentioned in

detail in Chapter 3.

4.3.1 Likelihood Function

Let Yi = (Y ..., Yl)' be the response and Ui = (Ui, ..., Uit)' and Z, = (Zi, ..., Zt)

be the covariate vectors corresponding to the ith state. Here, Y, = (Yy, Y,2)' and the

expressions for U, and Z, are given above. Let 0, = (0i, 0, 7, bi, { i1,.... qjt} o, Y)

be the parameter space corresponding to the ith state where i0 = (0i, ..., 0t)'. Thus, the

full parameter space will be given by 0 = i2 x ... x ,,. For the bivariate non-random

walk model, the likelihood function for the ith state would be given by


L(Y,, Ui, Zil i) oc L(Y, il )L(OI/3, 7, bi, { 1, .... It}, Ui, Zi)L(bil o)L(-yl, )
t
= {L(Yg 0o, Xy)L(Oy U' Z'7 bi, V)} L(bil o)L(7-y|,)
j=1
(4-4)

Here, L(X|l, 1) denotes a multivariate normal density with mean vector p and variance

covariance matrix X.

For the bivariate random walk model, the parameter space for the ith state would be

fi = (,O, 0, 7, bi, v, {i' ...., t}, o,, :, ~) where v = (v, ..., v)' is the vector of time

specific random effects. The hierarchical Bayesian framework is given by

1. (Y0e,) N/(eO, 0:)

2. (06/1, 7, bi, vj, qj) ~ N(X' + Z',7 + bi + vj, qjj)

3. (v lv-_~, ZE) ~ N(vj-_, ZE), assuming vo = 0

4. (bil, o) ~ N(0, Zo)

5. 7 ~ N(0, Z.)









Thus, the likelihood function for the ith state, (4-4) will have an extra component

corresponding to v given by L(v lvj_ v,) which has a normal distribution with mean

vj_1 and covariance matrix 1,.

4.3.2 Prior Specification

To complete the Bayesian specification of our model, we need to assign prior

distributions to the unknown parameters. We assume noninformative improper uniform

prior for the polynomial coefficients (or fixed effects) 3 and proper conjugate Inverse

Wishart priors on the variance covariance matrices ({f1,..., q}, 01, ). The prior

distributions are assumed to be mutually independent. We choose the inverse Wishart

parameters in such a way that the priors are diffuse in nature so that inference is mainly

controlled by the data distribution.

Thus, we have the following priors : 3 ~ uniform(RP ++2), v ~_ IW(Sj, dj)(j =

1, ... t~, IW(S, d7), 1o IW(So, do) and I, IW(S,, d,) Here X ~ IW(A, b)

denotes a inverse Wishart distribution with scale matrix A and degrees of freedom b

having the expression f(X) oc IXI-(b+p+1)/2exp(-tr(AX-1)/2), p being the order of A.

4.3.3 Posterior Distribution and Inference

The full posterior of the parameters given the data is obtained in the usual way by

combining the likelihood and the prior distribution as follows
m t
p(f|Y, U, Z) oc n L(Yi, Ui, Zi|n,)7(0)7r(o)7(I ) [H (q) (4-5)
i=1 j= 1

For the random walk model there will be an additional term 7r(,). By conditional

independence properties, we can factorize the full posterior as


[0, 3, 7, b, o {, i1, .... 't} Y, U, Z] oc [Y le][el 3, 7, b, { Wi,..., W }, X, Z]
t
x [bl E][7l E[/3][E0] f[5o[[L]
j= 1

Our target of inference is {06,, i = 1,..., m;j = 1, ...t}, the true median income

for of four-person families for all the states. Since the marginal posterior distribution









of 0y is analytically intractable, high dimensional integration have to be carried out in

a theoretical framework. However, this task can be easily accomplished in an MCMC

framework by using Gibbs sampler to sample from the full conditionals of 0y and

the other relevant parameters. In implementing the Gibbs sampler, we follow the

recommendation of Gelman and Rubin (1992) and run n (> 2) parallel chains. For each

chain, we run 2d iterations with starting points drawn from an overdispersed distribution.

To diminish the effects of the starting distributions, the first d iterations of each chain are

discarded and posterior summaries are calculated based on the rest of the d iterates.

The full conditionals for both the models are given in the appendix.

Once posterior samples are generated from the full conditionals of the parameters,

Rao-Blackwellization yields the following posterior means and variances of 6,
n 2d
E(O| y) (nd)-1 (1 i+ j-k)-l (EI lY i (X/i + ZYk, + bik,)) (4-6)
k= 1 d+l

and
n 2d n 2d
V(6,y) = (nd) (1 ^l' jkI)1 (nd)l > (E '+'k)1
k= 1= d+ k= 1= d+
x (I1Y- + k1(X0k/ + Z kl/ bik)) (1Y + (X/ + Z'/ + bik))
n 2d
(- 1 +ijkl) 1 (nd)_2 E > (EI1 + )1 (+-YL
k= 1 =d+l

+ k/(X Zk/+ ZYk/+ bik/))
n 2d
x (-1Y + 1 YU i (X k/ + ZJk/ + bk/)) (4-7)
k 1= d 1

4.4 Data Analysis

We applied the semiparametric models in Section 4.2.2. to analyze the median

income dataset referred to in Section 4.1.3. The basic dataset for our problem is the

triplet (Y-1, Y62, Y63) and the associated variance covariance matrix Zy (i = 1,..., 51;j =

1,..., 11). Here Y,4, Y,2 and Y,3 respectively denote the CPS median incomes of









four, three and five person families for the ith state and thejth year. Y, is assumed

to estimate the true unknown median income Oi, (u = 1, 2, 3). The corresponding

adjusted census medians are denoted by X,y, Xy and X,3. The years correspond to

1979,...,1989.

For the univariate setup, the response and covariates are respectively Y,i and

X,6. For the bivariate setup, the basic data vector is a duplet with first component YU1
and second component is either Y,2, Y 3 or 0.75Y,2 + 0.25Y,3. The adjusted census

medians are chosen analogously. As mentioned before, our target of inference are the

state specific median incomes of four person families for 1989.

4.4.1 Comparison Measures and Knot Specification

In this study, our target of inference is the state specific median income corresponding

to four-person families for the year 1989. We judged our estimates by comparing those

to the corresponding census figures for 1989. In small area estimation problems, the

census estimates are often treated as "gold standard" against which all other estimates

are compared. However, such a comparison is only possible for those years which

immediately precede the census year i.e 1969, 1979, 1989 and 1999.

In order to check the performance of our estimates, we plan to use four comparison

measures. These were originally recommended by the panel on small area estimates of

population and income set up by the Committee on National Statistics in July 1978 and

is available in their July 1980 report (p. 75). These are

* Average Relative Bias (ARB)= (51)-1 Zic' ic e
C 51 I e12
2
Average Squared Relative Bias (ASRB) = (51)-1 |i1 c-2
Ci

Average Absolute Bias (AAB) = (51)-1 1 | c,- e,|

Average Squared Deviation (ASD) = (51)-1 51(c, e,)









Here c, and ei respectively denote the census and model based estimate of median

income for the ith state (i = 1, ...,51). Clearly, lower values of these measures would

imply a better model based estimate.

The basic structure of our models would remain the same as in Section 4.2.2. We

have used linear truncated polynomial basis functions for the P-spline component

in our models since the median income profiles didn't exhibit a high degree of

non-linearity. For highly non-linear profiles a quadratic or cubic polynomial basis

function representation can be used. In non-parametric regression problems, the

proper selection of knots plays a critical role. Ideally, a sufficient number of knots

should be selected and placed uniformly throughout the range of the independent

variable so that the underlying observational pattern is properly captured. Too few or

too many knots generally degrades the quality of the fit. This is because, if too few

knots are used, the complete underlying pattern may not be captured properly, thus

resulting in a biased fit. On the other hand, once there are enough knots to fit important

features of the data, further increase in the knots have little effect on the fit and may

lead to overparametrization (Ruppert, 2002). Generally, at most 35 to 40 knots are

recommended for effectively all sample sizes and for nearly all smooth regression

functions. Following the general convention, we have placed the knots on a grid of

equally spaced sample quantiles of the independent variable (adjusted census median

income).

4.4.2 Computational Details

We implemented and monitored the convergence of the Gibbs sampler following the

general guidelines given in Gelman and Rubin (1992). We ran three parallel chains,

with varying lengths and burn-ins. We initially sampled the 6''s from multivariate

t-distributions with 2 df having the same location and scale matrices as the corresponding

multivariate normal conditionals given in the Appendix. This is based on the Gelman-Rubin

idea of initializing the chain at overdispersed distributions. However, once initialized, the









successive samples of O 's are generated from regular multivariate normal distributions.

Convergence of the Gibbs sampler was monitored by visually checking the dynamic

trace plots, acf plots and by computing the Gelman-Rubin diagnostic. To diminish the

effect of the starting distributions, the first d iterations of each chain are discarded

and the posterior summaries are based on the subsequent iterates. The comparison

measures deviated slightly for different initial values. We chose the least of those as the

final measures presented in the tables that follows.

4.4.3 Analytical Results

Data on CPS median income and adjusted census median incomes were available

for 50 states and the District of Columbia for the time span 1979-1989. CPS median

income ranged from $24,879.68 to $52,778.94 with a mean of $36,868.48 and standard

deviation of $5954.94 while adjusted census median income ranged from $27,910 to

$72,769.38 with a mean of $41,133.45 and standard deviation of $7196.56.

We fitted both the univariate and bivariate models to the median income dataset.

In doing so, we worked with all possible knot choices from 0 to 40. Here, we would only

show the results corresponding to the best performing model i.e the model with the

lowest values of the comparison measures.

In the univariate framework, the model with 3 knots in the income trajectory

performed the best. Table 4-1 reports the comparison measures for this model (denoted

as USPM(3)) along with those of the CPS estimates (CPS), Census Bureau estimates

(Bureau), and the univariate GNK time series (GNK.TS) and non-time series (GNK.NTS)

estimates. Table 4-2 reports the percentage improvement of the time series, non-time

series and the semiparametric estimates over the census bureau estimates.

From Table 4-1, it is clear that the semiparametric estimates significantly improve

upon the CPS, time series and non-time series estimates with respect to all the

comparison measures. Infact, the semiparametric estimates perform slightly better

than the bivariate Census Bureau estimates too with respect to ARB and AAB. This









Table 4-1. Comparison measures for univariate estimates
Estimate ARB ASRB AAB ASD
CPS 0.0735 0.0084 2,928.82 13,811,122.39
Bureau 0.0296 0.0013 1,183.90 2,151,350.18
GNK.TS 0.0338 0.0018 1,351.67 3,095,736.14
GNK.NTS 0.0363 0.0021 1,457.47 3,468,496.61
USPM(3) 0.0289 0.0014 1169.74 2,549,698.26

Table 4-2. Percentage improvements of univariate
estimates over Census Bureau estimates
Estimate ARB ASRB AAB ASD
GNK.TS -14.19% -38.46% -14.17% -43.90%
GNK.NTS -22.64% -61.54% -23.11% -61.22%
USPM(3) 2.37% -7.69% 1.2% -18.52%


is also reflected in Table 4-2 where the semiparametric estimates marginally improve

upon the Bureau estimates for the above two comparison measures. Overall, the

degree of dominance of the Bureau estimates on the time series and non time series

estimates is much larger compared to that on the semiparametric estimates. These

results indicate that, in the univariate framework, the semiparametric model with 3 knots

perform significantly better than the time series and non-time series models of Ghosh

et al. (1996).

Now, we move on to the bivariate non-random walk setup. First, we consider

the model with response vector the CPS median income of 4 and 3 person families

i.e (Y,4 and Y,2). The covariates are the corresponding adjusted census medians.

Since we assumed inverse Wishart priors for the variance covariance matrices, the

values of the comparison measures were dependent on the degrees of freedom of

the Wishart distribution and the number of knots in the income trajectory. We worked

with different combinations of the two in fitting these models. The best results (lowest

comparison measures) were obtained for two models, both with 6 knots but with degrees

of freedoms 7 and 9 respectively. These models are denoted by BSPM(1)(4,3) and

BSPM(2)(4,3) respectively. When we consider the median incomes of 4 and 5 person









Table 4-3. Comparison measures for bivariate non-random
walk estimates
Estimate ARB ASRB AAB ASD
CPS 0.0735 0.0084 2,928.82 13,811,122.39
Bureau 0.0296 0.0013 1,183.90 2,151,350.18
GNK.TS(4,3) 0.0295 0.0013 1,171.71 2,194,553.67
GNK.NTS(4,3) 0.0323 0.0016 1,287.78 2,610,249.94
BSPM(1)(4,3) 0.0274 0.0013 1079.63 2,182,669.56
BSPM(2)(4,3) 0.0286 0.0011 1131.61 1,880,089.29
GNK.TS(4,5) 0.0230 0.0009 932.51 1,618,025.33
GNK.NTS(4,5) 0.0295 0.0013 1,179.94 2,216,738.06
BSPM(4,5) 0.0255 0.0010 1033.12 1,859,373.98
GNK.TS(4,3+5) 0.0287 0.0013 1,150.24 2,116,692.71
GNK.NTS(4,3+5) 0.0324 0.0015 1,297.12 2,530,938.06
BSPM(1)(4,3+5) 0.0271 0.0012 1078.5 2,128,679.65
BSPM(2)(4,3+5) 0.0289 0.0012 1132.10 1,838,598.30

families, the lowest comparison measures were obtained for the model with 4 knots in

the income trajectory and 7 degrees of freedom. We denote this model by BSPM(4,5).

Lastly, for the model with median incomes of 4 person families and the weighted

average incomes of 3 and 5 person families (with weights 0.75 and 0.25) as response

vectors, the best results were obtained for two models, both with 6 knots and with

degrees of freedoms 7 and 9 respectively. We denote these models as BSPM(1)(4,3+5)

and BSPM(2)(4,3+5) respectively. Table 4-3 reports the comparison measures for these

models along with those of CPS, Bureau, and the corresponding bivariate GNK time

series and non-time series estimates. Table 4-4 reports the percentage improvement of

the above estimates over the census bureau estimates.
From Table 4-3 and Table 4-4, it is clear that both BSPM(4,3) and BSPM(4,3+5)

estimates improve upon the bivariate time series and non time series estimates with

respect to nearly all the four comparison measures. The semiparametric estimates also

improves upon the Census Bureau estimates and the raw CPS estimates. For the model

with median income of four and five person families as response, the semiparametric

estimates falls well behind the bivariate time series estimates of Ghosh et al. (1996) but

significantly improves upon the CPS and Census Bureau estimates.


100









Table 4-4. Percentage improvements of bivariate non-random
walk estimates over Census Bureau estimates
Estimate ARB ASRB AAB ASD
GNK.TS(4,3) -0.48% -2.52% 1.03% -2.01%
GNK.NTS(4,3) -8.99% -22.45% -8.77% -21.33%
BSPM(1)(4,3) 7.43% 0.00% 8.81% -1.46%
BSPM(2)(4,3) 3.38% 15.38% 4.42% 12.61%
GNK.TS(4,5) 22.19% 30.52% 21.23% 24.79%
GNK.NTS(4,5) 0.31% -0.18% 0.33% -3.04%
BSPM(4,5) 13.85% 23.08% 12.74% 13.57%
GNK.TS(4,3+5) 2.94% 3.56% 2.84% 1.61%
GNK.NTS(4,3+5) -9.36% -17.18% -9.56% -17.64%
BSPM(1)(4,3+5) 8.45% 7.69% 8.90% 1.05%
BSPM(2)(4,3+5) 2.37% 7.69% 4.37% 14.54%

Now let us consider the bivariate random walk model. For the case with 4 and

3 person families, the lowest comparison measures were obtained for three models

with degrees of freedoms and number of knots (3, 6), (5, 6) and (9, 1) respectively. We

denote these models as BRWM(1)(4,3), BRWM(2)(4,3) and BRWM(3)(4,3) respectively.

Each of these models significantly improves upon the CPS and Census Bureau

estimates and are also superior to the bivariate time series and non-time series models

proposed by Ghosh et al. (1996) (GNK). The random walk estimates also seem to

improve marginally over those corresponding to the non-random walk semiparametric

model. When we consider the median income estimates of 4 and 5 person families,

the random walk model with degrees of freedom 5 and 1 knot in the trajectory seems

to perform the best. The comparison measures are significantly better than the CPS,

Bureau and the non-time series model of GNK. However, they fall marginally short of

the time series estimates but fare better than the corresponding estimates obtained

from the non-random walk model (BSPM(4, 5)). We denote this model as BRWM(4,

5). Lastly, for the model with median incomes of 4 person families and the weighted

average incomes of 3 and 5 person families (with weights 0.75 and 0.25) as response

vectors, the best results were obtained for the model with 5 degrees of freedom and 1

knot in the trajectory. The comparison measures were significantly better than the CPS,









Table 4-5. Comparison measures for bivariate random walk model
Estimate ARB ASRB AAB ASD
BRWM(1)(4,3) 0.0261 0.0011 1043.33 1,902,416.1
BRWM(2)(4, 3) 0.0274 0.0010 1094.25 1,804,969.06
BRWM(3)(4, 3) 0.0258 0.0012 1037.03 2,114,599.65
BRWM(4,5) 0.0245 0.0010 978.12 1,672,183.6
BRWM(4, 3 5) 0.0244 0.0011 990.50 1,941,833.29


Bureau and GNK (both time series and non-time series) while it also improved upon the

non-random walk semiparametric model. We denote this model as BRWM(4,3+5). Table

4-5 reports the comparison measures for the random walk models.

4.5 Conclusion and Discussion

Estimates of various characteristics of small areas are frequently used by

the federal government for formulating important policy decisions and to provide

developmental funds to different states and local jurisdictions. These are also used

by various local agencies to formulate business policies and other important decisions.

Often observations on these characteristics (for example, income and poverty estimates)

are available at multiple time points in the past thus resulting in a longitudinal profile or

trajectory. Taking proper account of these time varying profiles may result in a significant

improvement in the estimates of the same characteristics at some current or future time

points. In this scenario, spline based semiparametric procedures have a clear edge

over the usual parametric procedures since the former can take in account virtually any

possible pattern in the underlying profile and can also handle unbalanced observations

with ease.

Estimation of median incomes of four person families for different states of

U.S. (here playing the role of small areas) is of interest to the U.S. Bureau of the

Census. Towards this end, the Bureau of Census collected annual median income

estimates of 3, 4 and 5 person families for all the states and the District of Columbia

for every year. But the methodology used by the Census Bureau doesn't take into

account the longitudinal nature of the state-specific median income observations.


102









In this study, we put forward a multivariate Bayesian semiparametric procedure for

the estimation of median income of four-person families for the U.S. states while

explicitly accommodating for the time varying pattern in the income observations.

We used a bivariate semiparametric modeling framework in which we modeled the

median incomes of a given pair of family sizes (for every state) as penalized splines.

In doing so, we came up with estimates of median incomes of 4 person families which

were significantly better than that obtained by the U.S. Bureau of Census and were

comparable to those obtained by the time series methodology of Ghosh et al. (1996).

We also extended the basic semiparametric framework by incorporating a time series

(random walk) component to account for the within state dependence in the successive

income observations. The class of random walk models seemed to improve upon their

non-random walk counterparts but more studies are required to be done before reaching

a definite conclusion about their relative performance. Overall, we strongly think that

semiparametric procedures holds a lot of promise for small area estimation problems,

specifically in situations where multiple time varying observations of some characteristic

are available for the small areas.


103









CHAPTER 5
CONCLUSION AND FUTURE RESEARCH

In my dissertation, I have concentrated on the application of semiparametric

methodologies in analyzing unorthodox data scenarios originating in diverse fields like

case control studies and small area estimation. In the former scenario, I have used

penalized splines to model longitudinal exposure profiles and its influence pattern on

the current disease status for a group of cases and controls. In doing so, I have come

to the conclusion that past exposure observations may have significant effect on the

present disease status. Our modeling framework is quite general and flexible in the

sense that it can be used to model any possible patterns of exposure profiles and also

it can capture complex time varying patterns of influence of the exposure history on the

current disease status. We applied our modeling framework on a nested case control

study of prostate cancer where the exposure was the Prostate Specific Antigen (PSA).

In the second scenario, we have used semiparametric procedures to model the income

trajectories of different small areas and have used that information to estimate the

median incomes of those small areas at a given time point in the future. Our model

based estimates seemed to perform better than the usual Bureau of Census estimates

which are based on the income observations from a particular time point and hence

are non-longitudinal in nature. We have also extended the semiparametric modeling

framework to the bivariate scenario in estimating the median income of varying family

sizes for each small area. In both these cases, the semiparametric income estimates not

only improves on the census estimates but are also comparable to estimates based on

time series models. Thus, we can conclude that semiparametric methodology, if properly

applied, holds a lot of promise for complicated data-driven situations arising in diverse

statistical settings like the once mentioned above.

The flexibility and power of the nonparametric and semiparametric procedures

immediately implies that a multitude of interesting but useful extensions can be carried









out over what has been already done above. I will briefly go over some of the possible

extensions below. These extensions are independent of the specific area or setting

where they are applied i.e these equally apply to the case control and small area

scenarios we have mentioned before.

5.1 Adaptive Knot Selection

As mentioned before, we have used penalized splines to model the exposure and

influence profiles in the case control framework and the income trajectories in the small

area estimation problem. As explained in Section 1.4, selection and proper positioning

of knots is a vital aspect in any smoothing procedure involving splines. Traditionally,

knots are placed at equally spaced sample quantiles of the independent variables and

that's what we have done in both the case control and small area scenarios. But this

procedure has its fair share of drawbacks it was evident in the univariate small area

problem where the original placement of the knots failed to account for the low density

region of the data pattern where the non-linearity was mostly concentrated. This was

probably because of the quantile dependent placement procedure of the knots.

Recently, there has been some research on data-driven or "adaptive" knot

placement procedures in which the number and locations of the knots are controlled

by the data itself rather than being pre-specified. The advantage of this procedure is that

fewer number of knots would be required which would be placed in "optimal" locations

along the domain. Thus, the resulting spline fit will be flexible enough to capture any

underlying heterogeneity in the data pattern. Both Frequentist and Bayesian approaches

have been proposed towards this end. Some Frequentist contributions include Friedman

(1991) and Stone et al. (1997) who used forward and backward knot selection schemes

until the "best" model is identified. Zhou and Shen (2001) used an alternative algorithm

which led to the addition of knots at locations which already possessed some knots.

Bayesian treatment of this problems revolves on the notion of treating the knot number

and knot locations as free parameters. Some notable Bayesian contributions include


105









Denison et al. (1998) who placed priors on the number and locations of the knots. Then

they sampled from the full posteriors of the parameters (including knot locations and

numbers) using reversible jump MCMC methods (Green, 1995). However, they restricted

the knots to be located only at the design points of the independent variable. DiMatteo

et al. (2001) followed the same basic procedure as Denison et al. (1998) but they

didn't restrict the knots to be located only at the design points of the experiment. They

also penalized models with unnecessarily large number of knots. Botts and Daniels

(2008) proposed a flexible approach for fitting multiple curves to sparse functional

data. In doing so, they treated the numbers and locations of knots of the population

averaged and subject specific curves as distinct random variables and sampled from

their posterior distributions using reversible jump MCMC methods. They used free-knot

b-splines to model the population averaged and subject specific curves. In all the

above contributions, Poisson priors are placed on the knot numbers while flat priors are

placed on the knot positions. The usefulness and flexibility of the Bayesian approach

lies in the fact that the number and locations of knots are automatically determined

from the MCMC scheme. Thus, this methodology is often known as Bayesian Adaptive

Regression Splines. However, the sampling procedure is quite intensive since the

parameter dimension varies at every iteration. Botts and Daniels substantially reduced

the computational burden by dealing with the approximate posterior distribution of only

the number and positions of the knots by integrating out the other parameters by using

Laplace transformations.

An immediate but worthwhile extension to what we have already done would be

to incorporate an adaptive knot selection scheme into both the case control and small

area modeling frameworks. For the former setup, this would correspond to deciphering

the optimal number of knots for the population mean PSA trajectory and the influence

function. So, depending on the particular study or the dataset at hand, any underlying

pattern in the influence profile (of the exposure trajectory on the disease state) can be


106









automatically captured. For the second framework, an adaptive knot selection scheme

would result in a spline fit that would adequately reflect any underlying heterogeneity in

the time varying income trajectory of the different small areas.

Some other interesting extensions to our work can be

1. Incorporating informative (non-ignorable) missingness (Little and Rubin, 1987) in
the longitudinal exposure (case control) or income (small area) profiles.

2. Incorporating non-parametric distributional structures like mixtures of Dirichlet
processes (MacEachern and Muller, 1998), Polya trees (Hanson and Johnson,
2000) on the subject (or area) specific random effects.

3. Extending the semi-parametric case control modeling framework to situations
involving multiple (> 2) or even categorical disease states.

Now, I briefly explain some work that we are currently engaged in doing.

5.2 Analyzing Longitudinal Data with Many Possible Dropout Times using Latent
Class and Transitional Modelling

5.2.1 Introduction and Brief Literature Review

Longitudinal studies deal with repeated measurement of individuals over time. As

a result, missingness is an integral part of these studies. Missingness can result from

different causes like dropout or withdrawal from the course of treatment, intermittent

absence from a visit, death due to unrelated causes etc. In this study we will only

consider missingness induced by dropout. Depending on the precise nature or causes

of dropout, different missingness (or dropout) mechanisms have been formulated (Little

and Rubin, 1987). Broadly these are of three types viz :

1. Missing completely at random (MCAR) : Missingness induced by dropout is said
to be MCAR if it is completely independent of the response.

2. Missing at random (MAR) : Missingness induced by dropout is said to be MAR
if dropout only depends on the observed data i.e dropout is unrelated to the
unobserved data conditional on the observed data.

3. Missing not at random (MNAR) : This occurs when missingness depends on
the unobserved response at the time of dropout or at future times, even after
conditioning on the observed data. This type of missingness is also known as
Non-ignorable or Informative dropout.


107









Non-ignorable missingness can be handled by two distinct classes of models viz

pattern-mixture and selection models, first formulated by Little and Rubin (1987). These

approaches differ in the way they factor the joint distribution of the missing data and

the response. In the former approach, the population is first stratified by the pattern of

dropout resulting in a model for the whole population that is a mixture over the patterns.

On the other hand, the selection modelling approach first models the hypothetical

complete data and then a model for the missing data process (conditional on the

hypothetical complete data) is appended to the complete data model. In this study we

will focus on the Pattern mixture (PM) modeling approach.

Suppose our study consists of N subjects, each of whom can be measured at T

time points. Let Yi and the Di respectively denote the response vector and dropout time

for the ith subject. Di is such that


Di t if the ith subject drops out between the (t l)th and tth observation times.
T 1 if the ith subject is a complete.

Here we assume that a subject is first measured at baseline (t = 0). Thus, there be

T unique dropout times. In the PM approach, it is assumed that subjects with different

dropout times have different response distribution i.e


f (y I ) D f (Yi) = f(yi, Di) f(yi) f (Di) (5-1)

So, for the ith subject, yi and Di are assumed to be associated or dependent. Thus,

in this approach models are built for [Y, Di] but inferences are based on f(y) =

Sf(ylD)P(D).
D
An important but realistic situation that may arise in longitudinal studies is that

the number of unique dropout times T (vis-a-vis, the number of times a subject is

measured) maybe large. As a result the number of subjects having a particular dropout

time may be quite small. Thus, stratification by dropout pattern may lead to sparse


108









patterns which may result in unstable parameter estimates in those patterns since

some of the parameters maybe unidentifiable. There are different ways to get around

this problem Hogan and Laird (1998) suggested parameters to be shared across

patterns. Hogan et al. (2004) suggested ways to group the T dropout times into m < T

groups in an adhoc fashion. Roy (2003) proposed an automated mechanism to do

the above grouping using a latent variable approach within the context of normal

models for continuous data. This approach assumes the existence of a discrete latent

variable that explains the dependence between the response vector and the dropout

time and allows incorporation of uncertainty about the groupings, conditional on a

fixed number of groups. Roy and Daniels (2008) extended the above approach by

incorporating uncertainty in the number of classes through approximate Bayesian model

averaging. In their approach, the marginal mean is assumed to follow a generalized

linear model, while the mean conditional on the latent class and random effects is

specified separately. Since the dimension of the parameter vector of interest (the

marginal regression coefficients) does not depend on the assumed number of latent

classes, they treat the number of latent classes as a random variable. A prior distribution

is assumed for the number of classes and approximate posterior model probabilities

are calculated. In order to avoid the complications with implementing a fully Bayesian

model, they propose a simple approximation to these posterior probabilities. Lastly, they

apply their methodology to a dataset dealing with the longitudinal study of depression in
HIV-infected women.

Heagerty (1999) proposed marginally specified logistic normal models for

longitudinal binary data. In doing so, he proposed an alternative parametrization of

the logistic normal random effects model and studied both likelihood and estimation

equation approaches to parameter estimation. A notable feature of his approach

was that the marginal regression parameters still permit individual level predictions

or contrasts. Heagerty (2002) also proposed a general parametric class of serial


109









dependence models which permits likelihood based marginal regression analysis of

binary response data. These are known as marginalized transition models. Basically, it

is a combination of a marginal regression model used to characterize the dependence

of the response on covariates and a conditional regression model or transition model

(Diggle et al., 2002) used to capture the serial dependence in the response process.

There exists another class of models known as marginalized latent variable models

which takes care of the exchangeable or non-diminishing dependence pattern among

the repeated response observations using random intercepts. Schildcrout and Heagerty

(2007) combined the marginalized transition and latent variable models by proposing a

unifying model that takes into account both serial and long range dependence among

the response observations. Their model can be used in situations with moderate to large

number of repeated measurements per subject where both serial (short range) and

exchangeable (long range) response correlation can be identified.

In this study, we combine the methodologies proposed in Heagerty (2002),

Schildcrout and Heagerty (2007) and Roy and Daniels (2008) and propose a new

model which accounts for both serial(short term) and long-range dependence among

the response observations in situations where the number of unique dropout times is

large. We group the dropout times using a latent variable approach taking into account

the uncertainty in the number of groups. We also model the marginal covariate effects of

interest.

5.2.2 Modeling Framework

Longitudinal observations collected on an individual over multiple time points are

always correlated since they correspond to the same subject. An established way

of accounting for this dependence (in the response vector Yi of the ith subject) is to

introduce subject specific random effects, say bi. In a typical longitudinal study the

principle aim of the researcher is to model the marginal covariate effects using the


110









marginal regression model. But this goal cannot be achieved using a non-linear link

function since it doesn't hold for the marginal covariate effects.

Heagerty (1999) proposed marginally specified logistic models which lead to direct

modeling of the marginal covariate effects. Let Y, and Xit respectively be the response

observation and the covariate vector corresponding to the ith individual at the tth time

point, i = 1, 2,..., N ; t = 1, 2,..., T. Let E(YtXit, /) be the marginal mean of Y,. It is

specified as
logit [E(Y tX,t,/3)] = X/3 (5-2)

The above structure is the marginal regression model. Now, in order to specify the

dependence among (Y,1, Y2,..., -T) the following conditional model is specified

logit [E( YXit, bi)] = At + bi (5-3)

where bi N(0, 0). Ai, can be computed by solving the following convolution equation


P(Yt = 1)= P(Y,t Xit, bi)dF(bi) (5-4)

Thus A is a function or / and 0. In this study we will be proposing a model which will

marginalize over the random effects and the drop-out distribution to directly model

the marginal covariate effects of interest taking into account both the serial and

exchangeable dependence structure among the Yi's.

Let us briefly go over the necessary notations with respect to subject i. Let Y =

(Y,, Y, ..., YT) be the response vector. Let the T unique dropout times be grouped
into m classes by the latent indicators Si = (Si, ..., Sim). Here S is an indicator for class

j,j = 1,..., m (m < T) such that

S { 1 if the ith subject is in class
Otherwise.
0 otherwise.









We assume that conditional on the past observations, Y, depends only on the

previous p observations i.e (Y,_t t-2, ..., Yt-). Here we have to deal with the

following three types of dependence structures :

1. Dependence between response and dropout time modeled by the latent classes.

2. Short range (serial dependence) between Y, and (Yt-_,..., ,-p) modelled by a
MTM(p).

3. Long range or non-diminishing dependence among the Y,'s modelled by the
subject specific random effects bi, i = 1,..., N.

We first specify the Marginal model as


T = E(YtX t,0) = g-l(t) (5-5)

The above model marginalizes over the subject specific random effects and over the

latent class distribution (implicitly over the dropout distribution) as well. In order to

fully specify the association due to repeated measurements and nonignorability in the

missingness process, we specify a conditional model in addition to the marginal model.

By conditional, we mean conditioned over the random effects and latent classes. We

assume that the relevant information in the dropout times is captured by the latent

variable S this is obvious because the specific latent class a subject would belong to

would solely depend on his/her dropout time. Thus, we specify a mixture distribution

over these latent classes, as opposed to over D itself.

Before delving into the model, it is important to note that the conditional model

parameters are not of main interest, and in fact will be viewed as nuisance parameters.

This is because we are not interested in estimating either subject-specific effects (i.e.

effects conditional on the random effects) or class-specific covariate effects (i.e. effects

of covariates on Y given a particular dropout class). Moreover, the conditional model

should be so specified that it is compatible with the marginal model (5-5). As we will see

below, this leads to a somewhat complicated model. Specifying this conditional model


112









is necessary, as we will see, in order to account for the three types of dependencies

mentioned above.

We assume that Y,, conditional on the random effects bi and latent class 5,, are from an

exponential family with distribution

f( Yt Ik, k < t, bi, Si) = exp [{ t it (Tit)}/(mi) + h( t, ( )]
where E(Y,t Yk, k < t, b,, Si) = g-l(it) = '(lit). Here Tlit is the linear predictor, b() is

a known function, 0 is a scale parameter and m, is the prior weight. We next specify the

conditional model as
m p
g{E(Yt Yk, k < t, bi, Si)} = Ait + SuZ() + 7t,kyit-k + b (5-6)
j=1 k=l

where, in the most general case, [bi, Sy = 1, X] ~ N(0, o(7(Xi)) and 7it,k(S = 1) =

V'tkk forj = 1, 2,..., m and k = 1, 2,..., p, where Vi, and Zit are both subsets of Xt.
Thus, the variance of bi may depend on the latent class and the covariate vector for

the ith subject. Moreover, 62, 6, .... 6) determines how the dependence between Y,

and Yt-k varies as a function of the covariates Vit,k conditional on the latent classes.

We also make the sum-to-zero constraint i.e at = Y1 a for the purpose of

identifiability. Lastly, in this conditional model, each subject has its own intercept, and

the effect of each covariate, is allowed to differ by dropout class via the regression

coefficients, aO).

The probabilities of the latent classes given the drop-out times are specified as

proportional odd's model (Agresti, 2002) given by


logit P Su= = oDi /k A1Di, k = 1..., m 1. (5-7)
j= 1

where Ao,1 < Ao,2 < ... < AO,M-1 and A1 are unknown parameters. Thus the class

probabilities are assumed to be a monotone function of dropout time (in fact, linear on

the logit scale).


113









Instead of proportional odd's model, we can also assume proportional hazards
model i.e

log log1 -P( S =l Di )] k = A/Di, k=l ... m 1
j 1

The other option would be to assume an ordinal probit formulation for the probabilities of
the latent classes given by

-1 P( S = Di =Ak A1D,, k= 1,...,m- 1


The predicted probabilities obtained from the ordinal probit model are similar to those

obtained from the proportional odd's model. Moreover, an advantage of the former
model is that, sampling from its posterior distribution is particularly efficient. For this

reason, the ordinal probit model is sometimes preferred if a Bayesian analysis needs to

be performed.
Lastly, the drop-out times Di are assumed to follow a multinomial distribution with

mass at each possible drop-out times, parameterized by p. Here we make the important
assumption that Y, is independent of Di given 5,. Our main target of inference are the

covariate effects averaged over the classes i.e PM averaged over M. The intercept Ai, in

(5-6) is determined by the following relationship between the marginal and conditional
models

E(Ytl|) = Z p(SilDi)P(Di) J {E(Y |tyt-_ 1.... Yt-p, bi, S)p(yt_1,.... yt-plb, S)}
D S A
x p(bilSi)dbi

where A = {it_, ..., Yt-p}.
5.2.3 Likelihood, Priors and Posteriors

Let, the set of all parameters be denoted by w = (3, a, a ,..., o-, 6). We
partition the complete response data for subject i, Yf into observed (values of Yf prior

to dropout) components, denoted by Yi and missing (response observations after









dropout) components, denoted by Y". Since the subjects are independent of one

another, the likelihood for the parameters is the product of individual contributions (from

each subject). Once A, = (A,i, A,2,..., AT) has been calculated, the evaluations of

the individual contributions from subject i becomes straightforward. In the following

expressions, m (the number of latent classes) will implicitly conditioned upon. Thus we

have
N
L(w|Y, X, D) = Li(wlYi, X,, D,)
i=1
where


Li(w Y,, X,, D,) o I/ L,(Y,lY{ _,, 5 = 1, b,, a(), )p(So = 1 D,; A)p(D, y)dF(b,|So = 1, o-,)
j=1

Here,


Li(YilY{_i}, S, = 1, bi, aW), ) = Li(Y,\ S, = 1, bi, a ), )LiY,2 Y,,, S, = 1, bi, a ), )

x...x L,(YTI Y, T-1 ... Y,T-p,So = b,, ) ) (5-8)


Proportionality in (5-8) holds because we assume that the missing and observed

responses from subject i are independent, given Si and b, (i.e. [Yn IY,, bi, Si] =

[Y,"1bi, Si]). Following the OPEF formulation, we have
!( T T T
L,(Y, Y{-i_,}, S,= 1, b,, a(, ) = exp ity,it- (tlt) /(mi) + h(Y,,
t=1 t= t=1

where
m
Tli, = g{E(Y,ilb,, S = 1)} = A, b, S,Za
j=1
m
1Ti2 = g{E(Y,|2yl, bi, SU = 1)} A,2 + bi + sZaJ +712,1Yi
j=1

m p
TiiT = g{E(Y, yr-1,..., yT-p, b,, Su = 1)} = An + b,+ SuZa-0)+ 7iT,kYiT-k
j=1 k=l


115









Since the number of latent classes m is treated as a random variable itself, we assume a

prior for m along with w. Let the priors be respectively denoted by 7(m), 7r(0), r(a),

{7(o ), = 1, 2..., m)}, r(A), r(p), Tr(), and 7(6). So the full posterior of m and w is

given by
N m
7(w, m Y, X, D) = J Li(w Y,, X,, Di)(m)r(/3)7(a)7(A)7(p)7r()r(5){f 7(u72)}
i= / 1
(5-9)

We can avoid the integral (w.r.t b,) in (5-8) if we also sample the big's along with the other

parameters from the full posterior (5-9). In that case, the full posterior may be rewritten

as


7(w, mlY, X, D)=



ere


L*(w Yi, Xi,


N m
[ L*(wlYi, Xi, Di) 7(m) 7(0)X)7(a) 7(A) 7(p) 7(0)) (6){ nT7(72)}
i=1 /= 1
(5-10)


m
Di) = Li(Yi|Y{_i,}, Sy= 1, b,, a ), )p(5y= 1|D,; A)


(5-11)


For the most general case, we have assumed an OPEF structure for each Y, conditional

on the past. Since the outcomes are binary, we can simplify it to a Bernoulli distribution


(5-12)


where p = E(Ytlyt-1, Yt-2 ,.... Yit-p, bi, Sy


1) = g- Air bi


p
S- it, kYit-k k
k= 1


116


wh


M

j=1


x p(Dily)p(bilSy = 1, ~72)


Li(Yi|Y{-_i, SU = 1, bi, al), 0)


H G'c) (I g)(1-









Since logit P( S =1 D), AOk A1D,,we have,


P(S = )= P(S, +...+ S = D,) P(S, + ...+ S,_ = D,)
eAD O (0e)- 1
e= i(eAj- eA/jl) (5-13)
1 eAo + A1D,) + e-1 + A1D,)

Now, as mentioned earlier, D, is the dropout time for the ith subject. Also, there are
T unique dropout times. Let, for t = 1,2,..., T

1 if the ith subject drops out between the (t l)th and tth observation times
0 otherwise.

Thus 1b, = ('il, i,' ., ... iT) = (0, 0, ..., 0) would imply that the ith subject is a complete.
So, D, = t <> = 1 and D, = T+ 1 => (',i, ',, ..., ,iT) = (0, 0, ... 0). Let pt denote
the probability of dropping out between times t 1 and t, t = 1, 2,... T. So, for the ith
subject, the density of D, would be Multinomial i.e

P(D = d) = ... ( ... r)1- d, = 1, 2,..., T+ 1 (5-14)

5.2.4 Specification of Priors
We assume that the number of latent classes m follows a truncated Poisson
distribution with rate parameter j, truncated at an integer between 1 and T (the number
of unique dropout times) i.e

p(m) oc m =0,1,...,s where 1 < s < R
For the other parameters, we assume the following priors

1. Let 0 ~ Nq(30, Z/o) assuming that Vi = 1,2,..., N and t = 1,2,... T, Xit is q
dimensional.

2. Let all), a(2) ..., a(m) -"d Nr((ao, Zao). where r < q since Zt C Xit Vi = 1,2,..., N
and t= 1, 2, ..., T.

3. Let o1, ,..., -ld U(a, b) where 0 < a < b < oo.


117









4. A ~ Nm(Ao, Zo)

5. (Oi1, 2, ...., (7-) ~ Dirichlet(71 r2, ..., rT)

6. 6, ..., 6 ii"d Nr(60, ) for the same reasons as in (3).

7. For the time being we keep the prior of 4, 7r(4) unspecified.

Now, combining (5-10 5-14) and the priors specified above, we can write down the

full posterior distribution of m and w, 7r(w, mlY, X, D) upto a constant. Thus, we can get

the full conditional distribution of all the relevant parameters and proceed with sample

generation using MCMC.

The assumption of conditional independence between Y, and Di given 5, and the

covariates can be verified by performing a likelihood ratio test (Frequentist) or using

Bayes factors (Bayesian). The null model is given by (5-6) and the alternative model

may be written as
m p
g{E(Yt Yk, k < t, b,, S,, Di)} = At + SUyZ + 7itkYt-k + b f(D,) (5-15)
j=1 k=1

where f(Di) maybe a smooth but unspecified function of Di. Thus, the null hypothesis

of conditional independence (between Y, and Di given 5, and Xi) would be simply

f(Di) = 0. The test can be carried out by first fitting the null model (??). Then, the
posterior probability of class membership for each subject can be estimated by

f Li(YilY{-i S, = 1, bi, & ~)p(S -= l|Di; )p(Dil,|)dF(b|S,, 2)
P(5, = IDi, Yi, Xi, al) = Li(Di, Yi, CV)
Li(Di, Yi, w)

where w is obtained by performing a full Bayesian analysis on the full conditionals of

w. The Likelihood Ratio test (LRT) is then performed by fitting models (??) and (5-14)

using a weighted likelihood (the weights being the above posterior probability of class

membership). An alternative way of doing the above conditional independence tests

would be to use score tests based on smoothing splines as used in proportional hazards

models by Lin et al. (2006).


118









The model proposed in (5-14) has the most general form. We can simplify it by

assuming a linear effect of drop-out time in which case the alternative (simpler) model

would be
m p J
g{E(Yt Yk, k < t b,,S,, Di)} = Ai + SyZ' + it,kYit-k + b,+ hj(Di)j (5-16)
j=l k=1 j=l

where each h(... ) is a known function and the b's are parameters. The null hypotheses

would be Ho : i = ... = J = 0. The linear drop-out effect would imply J = 1 and

h(Di) = Di. The LRT can then be performed as before by fitting models (5-6) and

(5-16) using the same weights given above. We can also use Bayes factors for carrying

out these analysis.

Note I : The above methodology was based on the fundamental assumption that

the drop-out time is discrete in nature; we assumed that there are T possible dropout

times and then modeled the dropout distribution as a multinomial. A possible and

interesting extension of the above methodology would be to assume that the dropout

distribution is continuous in nature i.e the ith individual can dropout at any time point

within an interval. In that case, we donot have to introduce latent classes to summarize

the dropout times. We believe that the proposed methodology can be modified to

accommodate this situation.

Note II : As mentioned before, Heagerty (1999) proposed Marginally Specified

Logistic Normal models for longitudinal binary data. He proposed two models : the first

one was a marginal logistic regression model which links the average response to the

covariates by the following equation :

logitE(Y|X,) = X' (5-17)

Here Yy and Xy respectively denote the binary response and the exogenous covariate

vector recorded at time j for the ith subject, i = 1, 2,..., N;j = 1, 2,..., ni. The second

model is a conditional model which explains the within-subject dependence among


119









the longitudinal measurements. This is achieved by conditioning on a vector of latent

variable (or random effects) bi such that

logitE(Y|bi, Xi) = A+ b (5-18)

An important assumption that is made is that conditional on bi = (bil, bi2, ..., bi,,), the

components of Y, are independent. Finally, it is assumed that (bilXi) ~ N(0, o,) where

o, models the dependence among the big's (and thus, indirectly among the Y,'s) and can

be obtained as a function of the observation times t, = (ti, ti2 ..., tin) and a parameter

vector a. Heagerty (1999) referred to the models given in (5-17) and (5-18) as the

marginally specified logistic normal models.

Under the above modelling framework, the parameter A, can be expressed as a

function of both the marginal linear predictor ,y = X3 0 and o-, the standard deviation

of b,. Writing b, as o-z where z ~ N(O, 1), A, can be obtained as the solution to the

following convolution equation :


h(qU) h(A, +dz)O(z)dz (5-19)

where h(.) is the inverse of the logit link and 0(.) is the standard normal density function.

Given (Ty1, o-), the above equation can be solved for A, using numerical integration and

Newton-Raphson iteration.

Ay, thus obtained from (5-19) will be a function of the marginal mean parameters

3 and the random effects covariance parameters a and should be computed for both

the maximum likelihood and estimating equation methodology (Heagerty, 1999). For

maximum likelihood estimation, the contribution of the ith subject to the observed data

likelihood is ascertained by first assuming a linear transformation of the form bi = Cizi

where Ci is a ni x q matrix and zi ~ Nq(0, Iqxq). The above transformation effectively

links up bi to a lower dimensional random effect zi. The contribution of the ith subject

(to the observed data likelihood) can now be expressed as a mixture over the random


120









effects distribution as


Li(,a) = ... P(Y, =y, bi,,Xi)f(bi|X,)dbi
j=1

= .../ h(A+ Cyzi)YU{1 h(A+ Cyzi)} l-yq(zi)dz (5-20)
j=1
q
where Oq(zi) = O (zik). Since L,(#, a) cannot be evaluated analytically, numerical
k=l
procedures are required to find its value. Heagerty (2002) used Gauss-Hermite
Quadrature to perform the calculation but assumed q = 1. With increasing values

of q, the computational burden increases exponentially and is not feasible at all. We are

currently trying to develop alternative and less computationally intensive methodologies

to accomplish the above objectives. We are working with Multivariate Logistic and

Multivariate t distributions against a Bayesian framework as in O'brien and Dunson

(2004). We hope that this methodology will provide a better alternative to the arduous

numerical methods mentioned below.









APPENDIX A
PROOF OF BAYESIAN EQUIVALENCE RESULTS

Proof of Theorem 1. Let Ydj (d = 0, 1;j = 1,..., J) be independently distributed as

Poisson(Adj) where

logAd = log/ + dlog9 + logj + d4' / Z(t)(t)dt (A-1)

Thus, the likelihood will be
1 J
L(ji, O,i6, f fA= i ( )exp(-Ad,)
d
and hence the log likelihood will be
1 J
1(p,0,, 6)= -{ydjlog(Ad)- Ad}
d= Oj 1
Now, replacing the expression of logAdj from (A-1) we have


1(p, ) = yyj (log+( dlog,' ogJ dq3' Zj(t)W(t)dt
d=0j=1l c
1 J 0
-ddjexpp (d' Zj(t)W (t)dt) (A-2)
d=Oj=1 -c

Differentiating (A-2) w.r.t p and 0 and solving the resulting equations we have


= EYyoj/CE and 0= J
>yoj 5jexp (q' Zj(t)(J(t)dt)
J J
Replacing the above expressions in (A-2) and then exponentiating, we obtain the

expression of L(6, 4) in (2-8).

Again, differentiating (A-2) w.r.t 6j, we have


J = d j 1...J (A-3)
5 Odexp d Zj(t)xW(t)dt)
d J-c
It is easy to show that if we replace (A-3) in (A-2) and then exponentiate, we get the

expression for L(O, 4) in (2-9). Since the order of maximization is immaterial, it follows

that, L(6, 4) and L(O, 4), once maximized over the nuisance parameters (0 and 6


122








respectively) yield the same profile likelihood of 0. Thus, inferences about the parameter
of interest 4 can be obtained using the prospective likelihood which has fewer nuisance
parameters than the retrospective one.
Proof of Theorem 2.(i) The posterior density of (0, 6, 4) is


(A-4)


J 1 J
p(0'6, 6,y) o p(4)fj 6 i-1 If (AdJ)~exp(-A,)
j 1 d Oj 1
Replacing the expression of Adj from (2-10), we have


p(O, 6,41y) oc


x


P() {Oexp Zjy(t)M (t)dt)} y+a-1
j-1
exp ([I +exp (1 Z' Z,(t)(t))dt)) 6


Integrating out 6, from the above expression, we have


p(0, |y) oc o ) F(y+j + aj) exp Z Wt)
j1 1 + exp(0/ Z (t)x(t) dt) d j

Now, performing the transformation from 0 to w yields expression (2-11).
J
(ii) First, we perform the transformation from 6 to (0, b), where = yj. Thus,
j=1
6j = Ojy, j = 1,...,J. The jacobian of transformation will be J-1.
Using this transformation in (A-4) and after some manipulation, we have


J
p(iO, 0, 4,\ly) oc p(a)Y+++-1-yI'+-1I ojY


a'-1exp ('YJ ZJ(t)W(t)dt
j 1 -c


x exp[


(A-5)


123


j= 1


,'exp Z (t) (t)dt)
\ -c /








Now, integrating (A-5) w.r.t 0 we obtain


J NJ
p(e, O, rly) ox p(-) J y,


j 1 c
x exp (' yi Zt(t)(dt 0 Y (A-6)
j 1 J -c j= 1
Integration of (A-6) w.r.t b yields (2-12) after some minor manipulation.

(iii) The order in which p(O, 6, |0y) is integrated w.r.t the parameters does not make
any difference in the marginal posterior density of p(0). Thus, integration of p(w, 01y)
w.r.t w or p(O, 01y) w.r.t 0 will yield the same marginal posterior density p(0|y) of 0.
Remarks :

1. As in Seaman and Richardson (2004), the assumption of existence and finiteness
of E (04' J Zq(t)W(t)dt and E 4' Z,(t)V(t)dt is automatically satisfied
provided the prior density p(O) ensures that E(O) exists and is finite.

2. The posterior propriety of p(O, 6, 0 y) in (A -10) can be shown in a similar way
to that in Seaman and Richardson (2001).

3. The prior distribution p(O) of 0 induces a prior distribution on the "influence
function" {1(t), -c < t < 0} in the logistic case-control model in (2 -3) since
7(t) = O'(t), -c < t < 0.

Proof of Theorem 3. Let D denotes the disease status with r + 1 categories. As
before, let {X(t), -c < t < 0} be the exposure trajectory with support S =
{Z(t), ..., Zj(t), -c < t < 0}, the set of all exposure trajectories.
Let P(D = dlX(t) = Zk(t), -c < t < 0) = Pdk, (d = 0,1, ..., r; k = 1,..., K) and
P(X(t) = Zk(t), -c < t < 0|D = 0) = k/ 11. Let ndk be the number of individuals
with D = d and X(t) = Zk(t), -c < t < 0}. It can be shown that

6kPdk/POk
P(X(t) = Zk(t), -c < t < OD = d) = k pk
S1PdI/Po
1=1










The prospective likelihood will be given by Lp

likelihood will be


r K
I -n Pdk while the retrospective
d=Ok=1


Sn,

K K no r K
LR 6k 1/11 H 1 6_kPdk POk
k=l /=1 d=l k=l
/= 1


Let Pdk/POk

rewritten as


K
idldk. Assuming Y Td/


1, we have Od


K
Sd/. Thus, Lp can be
= 1


r
since 1 = Pdk
d=O


rK K
L, = fn (Pdk POknd f(POk)=O n
d lk 1 k 1

= n f)(l Odldk)n1d (+ oedritdk')
d=1k=1 k=l \ d=l


Pok (1 + Y dildk LR can also be written as
d=1


\nd1
K K no r K
LR (kE n/ 1kldk
k= 1 /=1 d=1 k=l 1 ~ T r


since Pdk/POk = drldk.

The augmented model is given by Zdk Adk poisson(Adk) where


log(Adk) = Ig(d) + Iog(ldk)+ log(1 k),

log(AOk) = log(6k), d = 1 ..., r; k = 1 ..., K.

The prior distribution on the parameters is assumed to be


O( r \J ( K \-
d-1 k-1


(A-7)


125


n7









The likelihood for the augmented model will be


LA =exp(- Adk hAdk/ndk1)
d=O k=l d=O k=l
K r r K
cx exp (- 6k 1(i +zd)dk HH( ,'/:5). ndk
k=1 dl d=0 k=1

exp k d (+ dldk (6k)E0 d ndk fJ(od)1 kndk dk)n
Sk=1 d=1 k=l d=l d=l k=l
The posterior based on the augmented likelihood will be


n) r K ( ')
, 6, Idn) N LA k = d 1 7r )
(d-1 kl-


(A-8)


r \
Noting that /exp (k ((1 Odldk) (k)2= Oc
Sd )
we have, by integrating out 6 in (A-8),


-ld6k OC (


r K K r Yd0 do
7(7, dn) x n (Oddk) ndk i ddk
dN wk1 k=1 d= 1


S(rd=l 1
Now, integrating out (1, ..., r,) from (A-8), we have


r 0o n
- Z rdrldk
d= 1


(A-9)


( K K r K r '1 n,
T(7q,6 n) oc exp Z j k H(k)E-O nd-1 H( )(dk) ndk f N 6kd
Sk=1 k=l d=lk=l d=l k=l


Next, we make the transformation 6k = Ok and o

the prior distribution in (A-7) becomes


K
6i having jacobian '-1. Hence
= 1


126


(A-1 0)


( r ( K
d=1 k=1









Using the above transformation, (A-10) can be rewritten as



d= Ok=1 d= lk=1
r K k= ndkK

z [exp(_ )(5)Z H ) 'o1 ( i (]fio" K no(- 1) ( Hifio )k
Sr K r
kd 7Onk 1 \x nd
d=1 k=) I k=1 d=1k=
r K k n ndkK




r ld=1 \k=l \1




d LR ( 1 71(k ) (A-13)
Integrating out (/ from (A-11), we have










From (A-9) and (A-12), it is clear that posterior inference for the parameter

of interest, ir remains the same under either the prospective likelihood L, or the
retrospective likelihood LR as long as the posterior is proper. It can be shown that
the posterior will be proper for any proper prior for 1n if nok > 1 V k = 1,..., K.


127








APPENDIX B
PROOF OF POSTERIOR PROPRIETY FOR THE SMALL AREA MODELS
B.1 Univariate Small Area Model
The proof of posterior propriety for the basic univariate semiparametric model
(Model I) is outlined below. The necessary changes to the proof for the random walk
model are mentioned at the end.

Proof of Theorem : The basic parameter space is Q = (0, 0, 7, b, o-, o {,...,)
where 0 = (0'i,..., 0')' and b = (bl,..., b,)'. Let

S= ... / p(|Y, X, Z)dQ

= ... {L(Yi j i)L(0i j,-7, bi, d Xi, Zi)L(bi b )I L(, i7 (0)7(07b) ( 7) 7 ( j)df
i=l j=1
(B-1)

We have to show that / < M where M is any finite positive constant.
Integrating first w.r.t 3, we have

/ = w(/3) [L(O /3, b ,2, Xi, Zi)d/

= exp[- (Oi- Xi/3 Z -- bil)'W- (Oi- Xi3 Zi7 bil)] d
i

= X we Xi exp W'-Wi +
(B-2)

where Q = wlx ) xl 1X ) X(Z Xl -~ W Wi = 0 Z,7 b1l
and V-1 = diag(b-2, .b-2 .. 2).
Now, W -1'Wi, = W '-1/2-1/2Wi = S'S, where Si = X-1/2W,. Similarly,
W/ -1Xi = S'Ti, X'V-JWi = TtSi and X'V- Xi = T'T, where T, = X-1/2X,.


128









On replacing these, the expression in the exponent of (B-2) becomes

S S, S T;) T T;) ( T'iSi)
2 I i i i A


2 [S'S S'ST(T'T)-T'S]
1
= -S'[I- T(T'T)-T']S= Q, say

where S = (S',..., S')' and T = (T',..., T)'. Since (I T(T'T)-IT') is idempotent,
S' [I T(T'T)-IT']S is non-negative, implying Q < 0 and thus exp(Q) < 1.
Next, we consider integration w.r.t b2 i.e


x/ = / 1 I/2 1 ( :)-"n/2-c exp(-dj/)d d ... b
j=1
t
"'. I Y X.' ,-2X' 1-1/2 | (,b2)-m/2-cj lexp(-dJ/ 2)d 2 ..."d 2
ij j=1
(B-3)

Assuming ,,x = max( ,..., t), we have, Vj 1,..., t 2 > m-x X .,'2X, >

X Y,' X X -2X> x i J XyX' and thus


(B-4)


I. X X, 1-12 _< )(p+l)P /2 Xj X 1-1/2
,J iJ


Combining (B-3) and (B-4), we have


Assuming 2x,


S<


'7 for some k e [1,..., t], we have,

X 1-1/ ... [ (,') P -ck exp(- )d'7]


2.. .d ,

Sck) t F(m/2 c 2)
S/ -2 = W, sa(B-5)
j=1,jk jk


129


x () )-3r-c~+lexp d)d
j=ljk k
S12 ((m p 5)/2
I(m-p-5)/2+ck
'J "/<


/ < i xx 1-1/2 ... ( ax)( 2 ( )-m/2-cj+exp(-dj/ )d ...d
; ; j ... j W 1








where W is finite if (m p 5)/2 + ck > 0, dk > 0, m/2+ cj 2 > 0 and dj > 0 for
j = 1, ..., t;j / k.
Combining (B-1) and (B-5), we have

/ < W I ... Jf {L{(Y, ,)L(ba) } L(i) ~) ()di2* (B-6)

where f* = (0 3 b). Since all the components of the integrand in (B-5) have proper
distributions, the above integral would be finite thus proving posterior propriety.
For the random walk model, the integrand in (B-1) will have an additional likelihood
term nli L(vI vjv_, oi) and a prior term 7(ao2). The derivation would then proceed
exactly as above and the integrand in (B-5) will also contain these additional terms. But
since both of these are proper distributions (normal and inverse gamma respectively), I
will still be finite under the conditions stated in the theorem.
B.2 Bivariate Small Area Model
The proof of posterior propriety for the bivariate semiparametric model is outlined
below.

Proof of Theorem : Here, the parameter space is Q* = (0, 0, 7, b, Zo, -7, {( ,.... }).
Here also, due to the same logic as in the univariate case, we just need to show

I p()p(0| y, b, { l,..., J})d3 < oo

/ ( ,,- ,/ ,6
or, J exp( (, X.3 Z71 b),) (0 X.3 Z>7 bi) df < oo (B-7)

in order to prove posterior propriety.
Using the same type of algebraic manipulations as in the univariate case, the L.H.S
of (B-7) can be shown to be

| X
-X WX/- exp W./WW (B-8)
2J /J


130









where Q = ( WId'X X ( XyWX. 'x ( XyJ--i VW, and W, = O, Z -
ij ij j
bi.
As before, the expression within the exponent in (B-8) can be rewritten as

K* = -\ ( STS-( U) (5 TTU) (5T/Su)
iJ j j i,j
S S' [I- T(T'T)-T']S < 0.
2

where S = (S' ..., S')', T = (T ,..., T')', S, = V1/2W, and T, = V/2X,.
Thus,

exp- W.' W -1 + < 1 (B-9)
/,J
So, in order to prove posterior propriety, we have to show
./ /. t (m d r 1) [ V trace J /)-1-l
/ = ... I| Xolx. -1/2 1 f'- 1 2 exp -trace 2 d 1...d 1
ij j=j 1
< oo (B-10)

Here r is the order of j,j = 1, 2,..., t. (r = 2 in our case).
Let A1, Aj2,..., Ajr be the distinct eigen values of .l,j = 1, 2,..., t. Since ,j is a
variance covariance matrix, it is positive definite and symmetric. Hence, W-1 also has
the same properties. Thus, Ajk > 0, Vk = 1, 2,... r.
Now, Vj = 1,2, ..., r,

1JX 1 > Aj- YIr where Aj'" = min(A, A2..., Ar).

y XuW JfX' > Y A7in X isrXje

yx,,J u l'X. > Am/in5X .X where Am1n = min(A'.

SX,,-,'X Amin Y XX is non-negative definite.
ij ij









Thus, we have


| XV Y -1X' I > I A min XyUX |
=| Z> X' |-, < I(AmnZ xJ- x I

=11. -x, xx/ 1-1/2 < min pxq2
^ ^-V<(A) 2 YLXUXW


Since I W 1 = J Ajk, V 1... t
k=1

(m+d,-r-1) (m+d,-r-1)
1 1 2 I (A k) 2
k=1


Now, replacing (B-11) and (B-12)

S< xx | .('-n)-
/ | XUX'y | .. (A in


where T denotes "trace". Let Am"n

Then, I < /1 x 2I where


in the expression of I in (B-10), we have

p+q+2 t r (mdj-r-1) V1 J-1
2 H (,Ajk) 2 exp -T 2( d ...d2
j1( k=1
(B-13)


= Aim, / [1 ..., t]; m [1 ..., r].


i f (m +d-r-1) (m+dp-p-q-2)-r-
l1 = I> XyX 2 (Ak) 2 (AIn) 2
i, {k= 1,k m}
Sp q 2 (m+d -p-q-2)-r-1
= | XyX'- | n (A/,k) 2 1 |--1 2
ij {k=l,k m}
and


1exp -T(V) 2 ) d


exp -T (V2)] d1


t m .n dj -r-1
/2 { 2 -Td md(VJ Fr-1d-..}


2r 2-d 2 2
f= 1J7i}

which is finite.

Thus, in order to show posterior propriety, we have to prove that /2 < oo.


132


(B-11)


(B-12)


(B-14)









Let us consider the integral


exp -T (


2- d1 (B-15)


p q 2 (m+d/-p-q-2)-r-1
/*= = (A k) 2 l1 2-
{k=1,k m}


By the AM-GM inequality, we have,


k=km
{kZl, k4m}


Ak < (r


A/1,km}


< (r


1- r

-1 l
{k=l,kjm}


)r-1
r l 1


{k =,k4m}


( k=k)
S{k l,k4m}


r r
k A/k < km Ik
{k=1,k m} k=1


p+q+2
2


< r1


r (r-1)(p+q+2)

k=1


(r-1)(p+q+2)
trace(2
tr race(1 1) (r
r-l 1 r


1-
1
k-1
k=l


where 'm denotes the kth diagonal element of I -1.

Since v 1 has a Wishart distribution, ,'m o- kkX< ,, (k

b g 1 a 1 w
Combining (B-1 5) and (B-1 6), we have,


1 r

k= 1


(r-1)(p+q+2)
,() 2
/lii~l


(r-1)(p+q+2)
2


(B-16)


1,..., r) implying that


(m+d -p-q-2)-r- 1
| 71 | 2 exp


(r-1)(p+ q2)
1 ) 2


S l

v k=l 1


(r-1)(p+q+2)
2


(m+di-p-q-2)-r-1
Il |1 1 2


exp -T (V/, ) dW


r (r-1)(p q+ 2)
C E (l) 2
k=l


where C = (
G


1 1


(r-1)(p+q+2)
2


133


=> ({k


Since V k


1)(p+q+2)
2


1, ..., r, A/k > 0,


S<



G r(


T( 2 d
2










the expectation being taken over the Wishart pdf.


2
S (r-1)(p q 2)

k=1


(r-1)(p+q+2)

bk) 2


(r


< ((r


q2) r
-q+2) )


q+2) r


S ,M (r-1)(p+q+2)
k=1


(E ( () (r-1)(p q 2)
k= k 2 ( 17)
k -


which is finite because


(r-1)(p+q+2)
! k ) 2 < 00 V


kr (r-(p+q2)
k-1


SE(r-1)(p q 2) < 00

(k=-


(r-1)(p+ q2)
S2)


Thus /* is finite implying posterior propriety.


Now,


'< 00


k = 1, ..., r


< 00


< 00


(B-18)


r
4 E
k=-1


r
> E
( k=1









APPENDIX C
FULL CONDITIONAL DISTRIBUTIONS

C.1 Semiparametric Case Control Model

The full conditional distribution of the parameters for the semiparametric case control
model are as follows :

1. [/pa, A, b, A,, a,, Y, D, a] ~ N(M3, V) where
/j + N n N -1
V- ( YZZ Pp,(a,)cp, (a +.)' I AIM 'M:) ,
e l j1 =i=1
N n, N
M3 = Vp ( Y ,,(au)(yd q(aP)b,) + AMc(Zi a bQ),
Je j=1 i= 1
and Zp is the p + K + 1 order prior variance-covariance matrix of 3.

2. [Zia, 4, bi, A,, Di] ~ N(a + O'MI + b'Qi, A, ) truncated at the left (right) by 0 if
Di = 1(Di = 0), i = 1,..., N.

3. [bil|, a, ,, A, ao, ao, Y, D, a] N(Mb, b)( 1,..., N) where

v = (zb + a (q,(a,).q,(a,)'+ AQ,/')Q ,
e =1

Mb = vb V q,,(a,)(y. _.(a)') + Ai (Z a 'Mi) ,

and Zb is the q + M + 1 order variance-covariance matrix of b.

4. [0*| b, A, ao, Y,D,a] ~ N(Ma,, V,) where* = (a, 4)',
/ N -1
V. = + (1, 0'M + b ,Q,)A,(1, 0'M, + b/Q,) ,
i 1
N
Mo* = A i (1, 'Mi + yQj)/Zi

and Z,* is the r + K* + 2 order variance-covariance matrix of (a, 0).

5. [Ail, a, b,Y, D,a] ~ G( +, v + (Z, a -'M -bQ ) where


6. [(r)-11, a, b, Y, D, a] G( 1, b) j= 0 q.
(i 12


135









7. [(,7a)-11,, ,b,Y,D,a] G nI
i= 1
q,,(ay)'bi) .


N ni
1, (yui p,,(aij)'
i= 1 j=1


8. [(,72)-110, a, (K + 1 K )
8. [(V )- |4, b, Y, D, a] G 2-+1, Yr *
k=1
9_ 1M N q+M 2
9. [()- bY, D,a]G -+1,+ Y b .




C.2 Semiparametric Small Area Models
i=1 j= q1
K'
10. [(r )-ll|, ,0, bY, D,a] G +-1,2-Y
k=1
Here, G(x, y) denotes a Gamma density with shape parameter x and rate parameter y
respectively.
C.2 Semiparametric Small Area Models
C.2.1 Semiparametric Univariate Small Area Model

The full conditional distributions of the parameters for the univariate semiparametric
small area model are as follows :

1. [O, 3 2,7 2,b, X, Z] ~ N(Mb, V) where

--( --1 1a 1 1 Y (X -- Z -- hi b )
V = + and Mo + 6 6

2. [bi /3, b, N(, 2a, X, Z] N(Mb, Vb) where

+ = and M = (0 X'..- Z' .) .
b j= 1 j= 1 j= 1

3. [3|7, 0, b, b2, X, Z] ~ N(M3, V3) where

V,3= ( and M,3= (- (m X-.))
i 1= 1 i=i 1 = 1 i= 1 j= 1

4. [y|/, b, b,2, 2 X, Z] ~ N(M7, V.) where

V= --+'/) and M. = -( ,Z 11 d-).


136









5. [(a,)- G|7] ~ G c 7 /7+ d


6. [0(@ )-|, -,y ,b, X, Z] G cji (2 (Ou X- Z' b) )
i 1

7. [()-b] G c, Y d+
i=1
Here G(a, b) denotes a gamma distribution with shape = a and rate = b.

C.2.2 Univariate Random Walk Model

The full conditional distribution of the parameters for the semiparametric random walk

model will follow similarly as above. In this case, v and ao will have normal and inverse

gamma full conditionals respectively while the full conditionals of the other parameters

will depend on v.

C.2.3 Bivariate Random Walk Model

The full conditional distribution of the parameters for the bivariate random-walk model

are as follows :

1. [0o| ,b,X, Z] ~ N(M,, Vj) (i = l... m,j = 1,..., t) where

V = ( -u I,1)-- and
M = (-,1 + 1 (-Y + J(X + Z(X 0 + bi + v,)).

2. [bi, 3,y, 0, '1, ao, X, Z] ~ N(M Vib) where

V=b _(_ 1 -0) and
J )-1

J J


3. [/3P 0, b, 1, X, Z] N(M3, V3) where

V3( = X-IY;) and

1
4, 8, b, X,)

4. [7|y3, 0, b, 11, E7, X, Z] ~ N(M V-1) where


Z7 bi- v).


137










M-y (Zzu 6zs7Z"

MVY = 6Z> Zy
U


and


Zu q 1(06
- E zi-/
';


X'/ b v,)


5. [vl, /3,, 0, b, I, Ev, v2, X, Z] ~ N(MA ZE) where


-1 and


Mv m\q!-1
M~


V ')


6. [vjlP,7, 0, b, Zv, vj_, vj+, X,Z] ~ N(M,Z) (j = 2,... t 1) where


- = (m l -

M = (m1j


2-)- and


2 Z V


7. [vt /3,7, I, b, v, Vt-, X, Z] ~ N(Mtv, Z) where


v m\q-1
t t

Mv m\q!-1
t t


an-1
Zv1) and


-1)


(q i tt


A = S,+ (0, -X' -Z bi


9. [ZEv] ~/W(S


10. [Zo b]~ /W so


11. [ZE,-7]~ /W(S,.


v)(0e, x, -


1,..., t) where
Z b ,v)'


assuming vo


- bib', do


77', d. + 1)


138


"(q jOil


- X Zi,"7 bi) + EvV .


q( (06


X'/ Z'7 bi) + l(v,+ + v,)).


X/ Zt, bi) +


8. [\| 7, b, V, vt_-, X, Z] ~ /W(Aj, d +m) (j


(v v-1)(v viy1)', dv + t)
I









REFERENCES


Agresti, A. (2002). Categorical data analysis. Wiley.

Albert, J. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response
data. Journal of the American Statistical Association 88, 669-679.

Althman, P. (1971). The analysis of matched proportions. Biometrika 58, 561-576.

Ashby, D., Hutton, J., and McGee, M. (1993). Simple Bayesian analyses for
case-controlled studies in cancer epidemiology. Statistician 42, 385-389.

Battese, G., Harter, R., and Fuller, W. (1988). An error component model for prediction
of county crop areas using survey and satellite data. Journal of the American
Statistical Association 83, 28-36.

Bell, W. (1999). Accounting for uncertainty about variances in small area estimation.
Bulletin of the International Statistical Institute .

Botts, C. and Daniels, M. (2008). A fexible approach to Bayesian multiple curve fitting.
Computational Statistics and Data Analysis 52, 5100-5120.

Bradlow, E. and Zaslavsky, A. (1997). Case influence analysis in Bayesian inference.
Journal of Computational and Graphical Statistics 6, 314-331.

Breslow, E. T. and Day, N. E. (1980). Statistical Methods in Cancer Research, Volume 1.
International Agency for Research on Cancer, Lyon.

Breslow, E. T, Day, N. E., Halvorsen, K. T, Prentice, R. L., and Sabai, C. (1978).
Estimation of multiple relative risk functions in matched case-control studies. Ameri-
can Journal of Epidemiology 108, 299-307.

Breslow, N. (1996). Statistics in epidemiology : The case-control study. Journal of the
American Statistical Association 91, 14-28.

Carroll, R. J., Wang, S., and Wang, C. Y. (1995). Prospective analysis of logistic case
control studies. Journal of the American Statistical Association 90, 157-169.

Catalona, W., Partin, A., Slawin, K., and Brawer, M. (1998). Use of the percentage
of free prostate-specific antigen to enhance differentiation of prostate cancer from
benign prostatic disease : A prospective multicenter clinical trial. Journal of the
American Medical Association 19, 1542-1547.

Cornfield, J. (1951). A method of estimating comparative rates from clinical data:
applications to cancer of the lung, breast, and cervix. Journal of the National Cancer
Institute 11, 1269-1275.

Cornfield, J., Gordon, T, and Smith, W. W. (1961). Quantal response curves for
experimentally uncontrolled variables. Bulletin of the International Statistical Institute
38, 97-115.


139









Datta, G., Ghosh, M., Nangia, N., and Natarajan, K. (1993). Estimation of median
income of four-person families : A Bayesian approach, in W.A. Berry, K.M. Chaloner
and J.K. Geweke (Eds),. Bayesian Analysis in Statistics and Econometrics pages
129-140.

Denison, D., Mallick, B., and Smith, A. (1998). Automatic Bayesian curve fitting. Journal
of the Royal Statistical Society, Series B 60, 333-350.

Diggle, P., Heagerty, P., Liang, K., and Zeger, S. (2002). The analysis of longitudinal
data, 2nd Edition. New York : Oxford University Press.

Diggle, P., Morris, S., and Wakefield, J. (2000). Point source modeling using matched
case-control data. Biostatistics 1, 89-109.

DiMatteo, I., Genovese, C., and Kass, R. (2001). Bayesian curve fitting with free knot
splines. Biometrika 88, 1055-1071.

Durban, M., Harezlak, J., Wand, M., and Carroll, R. (2004). Simple fitting of subject
specific curves for longitudinal data. Statistics in Medicine 00, 1-24.

Eilers, P. and Marx, B. (1996). Flexible smoothing with B-splines and penalties. Statisti-
cal Science 11, 89-121.

Ericksen, E. and Kadane, J. (1985). Estimating the population in census year : 1980 and
beyond (with discussion). Journal of the American Statistical Association 80, 98-131.

Escobar, M. and West, M. (1995). Bayesian density estimation and inference using
mixtures. Journal of the American Statistical Association 90, 577 588.

Etzioni, R., Pepe, M., Longton, G., Hu, C., and Goodman, G. (1999). Incorporating the
time dimension in receiver operating characteristic curves : A case study of prostate
cancer. Medical Decision Making 19, 242-251.

Eubank, R. (1988). Spline smoothing and nonparametric regression. New York : Marcel
Dekker.

Eubank, R. (1999). Nonparametric regression and spline smoothing. New York : Marcel
Dekker.

Fan, J. and Gijbels, I. (1996). Local polynomial modeling and its applications. Chapman
and Hall.

Fay, R. (1987). Application of multivariate regression to small domain estimation, in R.
Platek, J.N.K. Rao, C.E. Srndal, and M.P. Singh (Eds). SmallArea Statistics.

Fay, R. and Herriot, R. (1979). Estimation of income from small places : an application
of James-Stein procedures to census data. Journal of the American Statistical
Association 74, 269-277.


140









Fay, R., Nelson, C., and Litow, L. (1993). Estimation of median income of four-person
families by state, in Statistical Policy Working Paper 21, Indirect Estimators in Federal
Programs.

Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics
19, 1-141.

Gelfand, A. and Ghosh, S. (1998). Model choice : A minimum posterior predictive loss
approach. Biometrika 85, 1-11.

Gelfand, A. and Smith, A. (1990). Sampling based approaches to calculating marginal
densities. Journal of the American Statistical Association 85, 398-409.

Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple
sequences (with discussion). Statistical Science 7, 457-511.

Ghosh, M. and Chen, M.-H. (2002). Bayesian inference for matched case control
studies. Sankhya, B 64, 107-127.

Ghosh, M., Nangia, N., and Kim, D. (1996). Estimation of median income of four-person
families : A Bayesian time series approach. Journal of the American Statistical
Association 91, 1423-1431.

Ghosh, M. and Rao, J. N. K. (1994). Small area estimation : An appraisal. Statistical
Science 9, 55-76.

Godambe, V. P. (1976). Conditional likelihood and unconditional optimum estimating
equations. Biometrika 63, 277-284.

Green, P. (1995). Reversible jump Markov Chain Monte Carlo computation and Bayesian
model determination. Biometrika 82, 711-732.

Green, P. and Silverman, B. (1994). Nonparametric regression and generalized linear
models : a roughness penalty approach. Chapman and Hall/CRC.

Gustafson, P., Le, N., and Valle, M. (2002). A Bayesian approach to case-control studies
with errors in covariables. Biostatistics 3, 229-243.

Hampel, F, Ronchetti, E., Rousseeuw, P., and Stahel, W. (1987). Robust statistics : The
approach based on influence functions. Wiley.

Hanson, T. and Johnson, W. (2000). Spatially adaptive penalties for spline fitting.
Australian and New Zealand Journal of Statistics 2, 205-224.

Heagerty, P. (1999). Marginally specified logistic normal models for longitudinal binary
data. Biometrics 55, 688-698.

Heagerty, P. (2002). Marginalized transition models and likelihood inference for
longitudinal categorical data. Biometrics 58, 342-351.









Henderson, C. (1950). Estimation of genetic parameters (abstract). Annals of Mathe-
matical Statistics 21, 309-310.

Hogan, J. and Laird, N. (1998). Mixture models for the joint distribution of repeated
measures and event times. Statistics in Medicine 16, 239-257.

Hogan, J., Roy, J., and Korkontzelou, C. (2004). Tutotial in biostatistics : Handling
drop-out in longitudinal studies. Statistics in Medicine 23, 1455-1497.

Jiang, J. and Lahiri, P. (2006). Mixed model prediction and small area estimation. Test
15, 1-96.

Johnson, V. (2004). A Bayesian X2 test for goodness-of-fit. Annals of Statistics 32,
2361-2384.

Lewis, M., Heinemann, L., MacRae, K., Bruppacher, R., and Spitzer, W. (1996). The
increased risk of venomous thromboembolism and the use of third generation
progestagens : Role of bias in observational research. Contraception 54, 5-13.

Lin, J., Zhang, D., and Davidian, M. (2006). Smoothing spline based score tests for
proportional hazards models. Biometrics 62, 803-812.

Lindstrom, M. (1999). Penalized estimation of free-knot splines. Journal of Computa-
tional and Graphical Statistics 8, 333-352.

Lipsitz, S., Parzen, M., and Ewell, M. (1998). Inference using conditional logistic
regression with missing covariates. Biometrics 54, 295-303.

Little, R. and Rubin, D. (1987). Statistical Analysis with Missing Data. New York: Wiley
& Sons.

MacEachern, S. and Muller, P. (1998). Estimating mixtures of Dirichlet process models.
Journal of Computational and Graphical Statistics 2, 223-238.

Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from
retrospective studies of disease. Journal of the National Cancer Institute 22, 719-748.

Marshall, R. (1988). Bayesian analysis of case-control studies. Statistics in Medicine 7,
1223 1230.

Morris, C. (1983). Parametric empirical Bayes inference : theory and applications.
Journal of the American Statistical Association 78, 47-54.

Muller, P., Parmigiani, G., Schildkraut, J., and Tardella, L. (1999). A Bayesian
hierarchical approach for combining case-control and prospective studies. Biometrics
55, 858-866.

Muller, P. and Roeder, K. (1997). A Bayesian semiparametric model for case-control
studies with errors in variables. Biometrika 84, 523-537.


142









Nurminen, M. and Mutanen, P. (1987). Exact Bayesian analysis of two proportions.
Scandinavian journal of Statistics 14, 67-77.

O'brien, S. and Dunson, D. (2004). Bayesian multivariate logistic regression. Biometrics
60, 739-746.

Opsomer, J., Claeskens, G., Ranalli, M., and Breidt, F. (2008). Non-parametric small
area estimation using penalized spline regression. Journal of the Royal Statistical
Society, Series B 70, 265-286.

Paik, M. and Sacco, R. (2000). Matched case-control data analyses with missing
covariates. Applied Statistics 49, 145-156.

Park, E. and Kim, Y (2004). Analysis of longitudinal data in case-control studies.
Biometrika 91, 321-330.

Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case control
studies. Biometrika 66, 403-411.

Rao, J. N. K. (2003). Small Area Estimation. Wiley Inter Science, New York.

Rathouz, P., Satten, G., and Carroll, R. (2002). Semiparametric inference in matched
case-control studies with missing covariate data. Biometrika 89, 905-916.

Robinson, G. (1991). That BLUP is a good thing : the estimation of random effects.
Statistical Science 6, 15-31.

Roeder, K., Carroll, R., and Lindsay, B. (1996). A semiparametric mixture approach to
case-control studies with errors in covariables. Journal of the American Statistical
Association 91, 722-732.

Roy, J. (2003). Modeling longitudinal data with non-ignorable dropouts using a latent
dropout class model. Statistics in Medicine 59, 829-836.

Roy, J. and Daniels, M. (2008). A general class of pattern mixture models for
nonignorable dropouts with many possible dropout times. Biometrics 64, 538-545.

Rubin, D. (1981). The Bayesian bootstrap. The Annals of Statistics 9, 130-134.

Ruppert, D. (2002). Selecting the number of knots for penalized splines. Journal of
Computational and Graphical Statistics 11, 735-757.
Ruppert, D. and Carroll, R. (2000). Spatially adaptive penalties for spline fitting.
Australian and New Zealand Journal of Statistics 2, 205-224.

Ruppert, D., Wand, M., and Carroll, R. (2003). Semiparametric Regression. Cambridge
University Press, Cambridge, U.K.

Satten, G. and Carroll, R. (2000). Conditional and unconditional categorical regression
models with missing covariates. Biometrics 56, 384-388.


143









Satten, G. and Kupper, L. (1993). Inferences about exposure-disease associations using
probability-of-exposure information. Journal of the American Statistical Association
88, 200-208.

Schildcrout, J. and Heagerty, P. (2007). Marginalized models for moderate to long series
of longitudnal binary response data. Biometrics 63, 322-331.

Seaman, S. R. and Richardson, S. (2001). Bayesian analysis of case-control studies
with categorical covariates. Biometrika 88, 1073-1088.

Seaman, S. R. and Richardson, S. (2004). Equivalence of prospective and retrospective
models in the Bayesian analysis of case-control studies. Biometrika 91, 15-25.

Sinha, S., Mukherjee, B., and Ghosh, M. (2004). Bayesian semiparametric modeling for
matched case-control studies with multiple disease states. Biometrics 60, 41-49.

Sinha, S., Mukherjee, B., Ghosh, M., Mallick, B., and Carroll, R. (2005). Semiparametric
Bayesian analysis of matched case-control studies with missing exposure. Journal of
the American Statistical Association 100, 591-601.

Stone, C., Hansen, M., Kooperberg, C., and Truong, Y. (1997). Polynomial splines
and their tensor products in extended linear modeling. The Annals of Statistics 25,
1371-1470.

Wahba, G. (1990). Spline models for observational data. CBMS-NSF Regional
Conference Series in Applied Mathematics.

Wand, M. (2003). Smoothing and mixed models. Computational Statistics 18, 223-249.

Wand, M. and Jones, M. (1995). Kernel Smoothing. Chapman and Hall.

Zelen, M. and Parker, R. (1986). Case control studies and Bayesian inference. Statistics
in Medicine 5, 261 269.

Zhang, D., Lin, X., and Sowers, M. (2007). Two stage functional mixed models for
evaluating the effect of longitudinal covriate profiles on a scalar outcome. Biometrics
63, 351-362.

Zhou, S. and Shen, X. (2001). Spatially adaptive regression splines and accurate knot
selection schemes. Journal of the American Statistical Association 96, 247-259.









BIOGRAPHICAL SKETCH

Dhiman Bhadra received his Bachelor of Science in statistics from Presidency

College, Calcutta (India) in 2002 and Master of Science in statistics from Calcutta

University in 2004. He joined the Department of statistics at University of Florida in

January 2005 for pursuing a PhD in statistics. He plans to graduate in August 2010.


145





PAGE 2

2

PAGE 3

3

PAGE 4

IhadthegoodfortunetobeastudentattheDepartmentofStatisticsatUniversityofFlorida.ItisherethatIcameinclosecontactwithsomeofthepreeminentstatisticiansofthedayandlearntalotfromthem.Ideeplyacknowledgethetremendoushelp,encouragementandendlesssupportthatIreceivedfrommyadvisorProf.MalayGhosh,myco-advisorProf.MichaelJ.DanielsandProf.AlanAgrestithroughoutthehighsandlowsofdoingmyresearchwork.Theynotonlytaughtmestatisticsortheartofwritingpapersorsolvingproblems-theyintroducedmetothespiritofdiscoveryandthejoyoflearning,somethingthatwillstaywithmeforeverandwouldmotivatemeinwaysIcanneverimagine.Howeverthelistdoesn'tendheresinceeachandeverymemberofthefacultyopenedupnewdoorsformethroughwhichknowledgeowedpastandenrichedmealongtheway.Myendlessgratitudetoeachandeveryoneofthem.IalsowishtothankProf.BhramarMukherjee(currentlyattheDepartmentofBiostatisticsatUniversityofMichigan)forherhelpandinspirationovertheyears.Lastbutnottheleast,myunendinggratitudetomymotherwhosesacrice,unconditionalloveandblessingwasalwayswithme,guidingmealongtheway.Iwouldendbyconveyingmydeepestrespecttothememoryofmyfather-hewastherewithmealwaysthroughoutthisjourney. 4

PAGE 5

page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 13 1.1OverviewofDissertation ............................ 13 1.2ReviewofCase-ControlStudies ....................... 14 1.3ReviewofSmallAreaEstimation ....................... 21 1.4Non-ParametricRegressionMethodology .................. 25 2BAYESIANSEMIPARAMETRICANALYSISOFCASECONTROLSTUDIESWITHTIMEVARYINGEXPOSURES ....................... 31 2.1Introduction ................................... 31 2.1.1Setting .................................. 32 2.1.2MotivatingDataset:ProstateCancerStudy ............. 34 2.2ModelSpecication .............................. 35 2.2.1Notation ................................. 35 2.2.2ModelFramework ............................ 35 2.3PosteriorInference ............................... 40 2.3.1LikelihoodFunction ........................... 40 2.3.2Priors .................................. 40 2.3.3PosteriorComputation ......................... 41 2.4BayesianEquivalence ............................. 42 2.5ModelComparisonandAssessment ..................... 46 2.5.1PosteriorPredictiveLoss ........................ 46 2.5.2Kappastatistic ............................. 47 2.5.3CaseInuenceAnalysis ........................ 48 2.6AnalysisofPSAData ............................. 49 2.6.1ConstantInuenceModel ....................... 50 2.6.2LinearInuenceModel ......................... 51 2.6.3OverallModelComparison ....................... 53 2.6.4ModelAssessment ........................... 54 2.7ConclusionandDiscussion .......................... 55 5

PAGE 6

................... 59 3.1Introduction ................................... 59 3.1.1SAIPEProgramandRelatedMethodology .............. 59 3.1.2RelatedResearch ........................... 61 3.1.3MotivationandOverview ........................ 62 3.2ModelSpecication .............................. 65 3.2.1GeneralNotation ............................ 65 3.2.2SemiparametricIncomeTrajectoryModels .............. 66 3.2.2.1ModelI:BasicSemiparametricModel(SPM) ....... 66 3.2.2.2ModelII:SemiparametricRandomWalkModel(SPRWM) 67 3.3HierarchicalBayesianInference ........................ 68 3.3.1LikelihoodFunction ........................... 68 3.3.2PriorSpecication ........................... 68 3.3.3PosteriorDistributionandInference .................. 69 3.4DataAnalysis .................................. 70 3.4.1ComparisonMeasuresandKnotSpecication ............ 71 3.4.2ComputationalDetails ......................... 72 3.4.3AnalyticalResults ............................ 73 3.4.4KnotRealignment ............................ 74 3.4.5ComparisonwithanAlternateModel ................. 78 3.5ModelAssessment ............................... 80 3.6Discussion ................................... 82 4ESTIMATIONOFMEDIANINCOMEOFFOURPERSONFAMILIES:AMULTIVARIATEBAYESIANSEMIPARAMETRICAPPROACH .......... 85 4.1Introduction ................................... 85 4.1.1CensusBureauMethodology ..................... 85 4.1.2RelatedLiterature ............................ 87 4.1.3MotivationandOverview ........................ 89 4.2ModelSpecication .............................. 90 4.2.1Notation ................................. 90 4.2.2SemiparametricModelingFramework ................ 91 4.2.2.1Simplebivariatemodel ................... 91 4.2.2.2Bivariaterandomwalkmodel ................ 92 4.3HierarchicalBayesianAnalysis ........................ 93 4.3.1LikelihoodFunction ........................... 93 4.3.2PriorSpecication ........................... 94 4.3.3PosteriorDistributionandInference .................. 94 4.4DataAnalysis .................................. 95 4.4.1ComparisonMeasuresandKnotSpecication ............ 96 4.4.2ComputationalDetails ......................... 97 4.4.3AnalyticalResults ............................ 98 4.5ConclusionandDiscussion .......................... 102 6

PAGE 7

.................... 104 5.1AdaptiveKnotSelection ............................ 105 5.2AnalyzingLongitudinalDatawithManyPossibleDropoutTimesusingLatentClassandTransitionalModelling ................... 107 5.2.1IntroductionandBriefLiteratureReview ............... 107 5.2.2ModelingFramework .......................... 110 5.2.3Likelihood,PriorsandPosteriors ................... 114 5.2.4SpecicationofPriors ......................... 117 APPENDIX APROOFOFBAYESIANEQUIVALENCERESULTS ................ 122 BPROOFOFPOSTERIORPROPRIETYFORTHESMALLAREAMODELS .. 128 B.1UnivariateSmallAreaModel ......................... 128 B.2BivariateSmallAreaModel .......................... 130 CFULLCONDITIONALDISTRIBUTIONS ...................... 135 C.1SemiparametricCaseControlModel ..................... 135 C.2SemiparametricSmallAreaModels ..................... 136 C.2.1SemiparametricUnivariateSmallAreaModel ............ 136 C.2.2UnivariateRandomWalkModel .................... 137 C.2.3BivariateRandomWalkModel ..................... 137 REFERENCES ....................................... 139 BIOGRAPHICALSKETCH ................................ 145 7

PAGE 8

Table page 1-1Atypical22table ................................. 15 2-1Estimatesofoddsratiosfordifferenttrajectorylengthsandageatdiagnosisfora0.5verticalshiftoftheexposuretrajectoryforthelinearinuencemodel 52 2-2Posteriormeansand95%condenceintervalsofoddsratioforI=(10,5)forthelinearinuencemodel .................... 53 2-3Posteriorpredictivelosses(PPL)fortheconstantandlinearinuencemodelsforvaryingexposuretrajectoriesandknots .......... 54 3-1ParameterestimatesofSPRWMwith5knots ................... 74 3-2ComparisonmeasuresforSPM(5)andSPRWM(5)estimateswithknotrealignment ................... 77 3-3PercentageimprovementsofSPM(5)andSPRWM(5)estimatesoverSAIPEandCPSestimates ..................... 77 3-4ParameterestimatesofSPM(5) 78 3-5ParameterestimatesofSPRWM(5) 78 3-6Comparisonmeasuresfortimeseriesandothermodelestimates ................................ 79 4-1Comparisonmeasuresforunivariateestimates .................. 99 4-2PercentageimprovementsofunivariateestimatesoverCensusBureauestimates ..................... 99 4-3Comparisonmeasuresforbivariatenon-randomwalkestimates .................................... 100 4-4Percentageimprovementsofbivariatenon-randomwalkestimatesoverCensusBureauestimates .................. 101 4-5Comparisonmeasuresforbivariaterandomwalkmodel ............. 102 8

PAGE 9

Figure page 2-1Longitudinalexposure(PSA)prolesof3randomlysampledcases(1stcolumn)and3randomlysampledcontrols(2ndcolumn)plottedagainstage. ...... 36 2-2Sensitivityof1,0,1anddiseaseprobabilityestimatestocase-deletions. .. 56 3-1LongitudinalCPSmedianincomeprolesfor6statesplottedagainstIRSmeanandmedianincomes.(1stcolumn:IRSMeanIncome;2ndcolumn:IRSMedianIncome). ........................................ 63 3-2PlotsofCPSmedianincomeagainstIRSmeanandmedianincomesforallthestatesoftheU.S.from1995to1999. ..................... 65 3-3Exactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Theknotsaredepictedastheboldfacedtrianglesatthebottom. ........................................ 75 3-4Positionsof5knotsafterrealignment.Theknotsaretheboldfacedtrianglesatthebottom.Theregionbetweenthedashedandboldlinesistheadditionalcoverageareagainedfromtherealignment. .................... 76 3-5Quantile-quantileplotofRBvaluesfor10000drawsfromtheposteriordistributionofthebasicsemiparametricandsemiparametricrandomwalkmodels.TheX-axisdepictstheexpectedorderstatisticsfroma2distributionwith9degreesoffreedom. ...................................... 81 9

PAGE 10

Case-ControlstudiesandsmallareaestimationaretwodistinctareasofmodernStatistics.Theformerdealswiththecomparisonofdiseasedandhealthysubjectswithrespecttoriskfactor(s)ofadiseasewiththeaimofcapturingdisease-exposureassociationspeciallyforrarediseases.Thelaterareaisconcernedwiththemeasurementsofcharacteristicsofsmalldomains-regionswhosesamplesizeissosmallthattheusualsurveybasedestimationprocedurescannotbeappliedintheinferentialroutines.Boththeseareasareimportantintheirownright.Case-controlstudiesformsoneofthepillarsofmodernbiostatisticsandepidemiologyandhasdiverseapplicationsinvarioushealthrelatedissues,speciallythoseinvolvingrarediseaseslikeCancer.Ontheotherhand,estimatesofcharacteristicsforsmallareasarewidelyusedbyFederalandlocalgovernmentsforformulatingpoliciesanddecisions,inallocatingfederalfundstolocaljurisdictionsandinregionalplanning.MydissertationdealswiththeapplicationofBayesiansemiparametricproceduresinmodelingunorthodoxdatascenariosthatmayariseincasecontrolstudiesandsmallareaestimation. Therstpartofthedissertationdealswithananalysisoflongitudinalcase-controlstudiesi.ecase-controlstudiesforwhichtimevaryingexposureinformationareavailableforbothcasesandcontrols.Inatypicalcase-controlstudy,theexposureinformationiscollectedonlyonceforthecasesandcontrols.However,somerecentmedicalstudieshaveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory, 10

PAGE 11

ThesecondandthirdpartofmydissertationdealswithunivariateandmultivariatesemiparametricproceduresforestimatingcharacteristicsofsmallareasacrosstheUnitedStates.Inthesecondpart,weputforwardasemiparametricmodelingprocedureforestimatingthemedianhouseholdincomeforallthestatesoftheU.S.andtheDistrictofColumbia.Ourmodelsincludeanonparametricfunctionalpartforaccomodatinganyunspeciedtimevaryingincomepatternandalsoastatespecicrandomeffecttoaccountforthewithin-statecorrelationoftheincomeobservations.ModelttingandparameterestimationiscarriedoutinahierarchicalBayesianframeworkusingMarkovchainMonteCarlo(MCMC)methodology.ItisseenthatthesemiparametricmodelestimatescanbesuperiortoboththedirectestimatesandtheCensusBureauestimates.Overall,ourstudyindicatesthatpropermodelingoftheunderlyinglongitudinalincomeprolescanimprovetheperformanceofmodelbasedestimatesofhouseholdmedianincomeofsmallareas. Inthethirdpartofthedissertation,weputforwardabivariatesemiparametricmodelingprocedurefortheestimationofmedianincomeoffour-personfamiliesforthedifferentstatesoftheU.S.andtheDistrictofColumbiawhileexplicitlyaccommodatingforthetimevaryingpatternintheincomeobservations.OurestimatestendtohavebetterperformancesthanthoseprovidedbytheCensusBureauandalsohave 11

PAGE 12

12

PAGE 13

EilersandMarx 1996 ). InChapter 2 ,Ipresentananalysisofacase-controlstudywhenlongitudinal,timevaryingexposureobservationsareavailableforthecasesandcontrols.Semiparametricregressionproceduresareusedtoexiblymodelthesubjectspecicexposureprolesandalsotheinuencepatternoftheexposureprolesonthediseasestatus.Thisenablesustoanalyzewhetherpastexposureobservationsaffectthecurrentdiseasestatusofasubjectconditionalonhis/hercurrentexposurecondition.Theproposedmethodologyismotivatedbyandappliedtoacasecontrolstudyofprostatecancerwherelongitudinalbiomarkerinformationareavailableforthecasesandcontrols.WealsoshowthedetailsofthehierarchicalBayesianimplementationofourmodelsandsomeequivalenceresultsthathaveenabledustouseaprospectivemodelingframeworkonaretrospectivelycollecteddataset. InChapter 3 ,IproposeaBayesiansemiparametricmodelingprocedureforestimatingthemedianhouseholdincomeofsmallareaswhenarea-speciclongitudinalincomeobservationsareavailable.Ourmodelsincludeanonparametricfunctional 13

PAGE 14

Chapter 4 dealswithanextensionofthemethodologyinChapter3whereabivariatesemiparametricprocedurehasbeenusedtoestimatethemedianincomeoffamiliesofvaryingsizesacrosssmallareas.Thiscanalsobeseenasanextensionofthetimeseriesmodelingframeworkof Ghoshetal. ( 1996 ).Weshowthatthesemiparametricmodelsgenerallyhavebetterperformancethantheirtimeseriescounterpartsandinafewsituations,theperformancesarecomparable.Wewanttoconveythemessagethatsemiparametricregressionmethodologycanprovideanattractivealternativetothetraditionalmodelingtechniquesspeciallywhentimevaryinginformationareavailableforsmallareas. InChapter 5 ,weprovideanoveralldiscussionofourresultsandalsopointtosomeinterestingopenproblemsandareasforfutureresearchthatmaybeworthpursuing. 14

PAGE 15

Case-controlstudieshaveconsistentlyattractedtheattentionofstatisticians,andasaresult,arichandvoluminousbodyofworkhasdevelopedovertheyears.NotableworkintheFrequentistdomaininclude Corneld ( 1951 )whopioneeredthelogisticmodelfortheprobabilityofdiseasegivenexposure.Hewasthersttodemonstratethattheexposureoddsratioforcasesversuscontrolsequalsthediseaseoddsratioforexposedversusunexposedandthatthelatterinturnapproximatestheratioofthediseaseratesifthediseaseisrare.LetDandEbedichotomousfactorsrespectivelycharacterizingthediseaseandexposurestatusofindividualsinapopulation.AcommonmeasureofassociationbetweenDandEisthe(disease)oddsratio ByapplyingtheBayestheorem,theaboveexpressioncanberewrittenas whichistheexposureoddsratio.Anotherwellknownmeasureofassociationistherelativerisk(RR)ofdiseasefordifferentexposurevaluesgivenbyP(D=1jE=1)=P(D=1jE=0).Forrarediseases,bothP(D=0jE=0)andP(D=0jE=1)areclosetooneandthediseaseoddsratioisapproximatelyequaltotherelativeriskofdisease.Theclassicpaperby MantelandHaenszel ( 1959 )furtherclariedtherelationshipbetweenaretrospectivecase-controlstudyandaprospectivecohortstudy.Theyconsideredaseriesof22tablesasinTable 1-1 Table1-1. Atypical22table DiseaseStatusExposedNotExposedTotal Casen11in10in1iControln01in00in0iTotale1ie0iNi

PAGE 16

IXi=1n01in10i=Ni(1) ItmaybeofinteresttotestfortheequalityoftheoddsratiosacrosstheItablesi.e whichfollowsanapproximate2distributionwithI1degreesoffreedomunderthenullhypotheses.Thederivationofthevarianceoftheaboveestimatorinitiallyposedsomechallengebutwaseventuallyaddressedinseveralsubsequentpapers( Breslow 1996 ). BreslowandDay ( 1980 )markedthedevelopmentoflikelihoodbasedinferencemethodsforoddsratio.Methodstoevaluatethesimultaneouseffectsofmultiplequantitativeriskfactorsondiseaserateswerepioneeredinthe1960's. Inacase-controlstudy,theappropriatelikelihoodistheretrospectivelikelihoodofexposuregiventhediseasestatus. Corneldetal. ( 1961 )notedthatiftheexposuredistributionsinthecaseandcontrolpopulationsarenormalwithdifferentmeansbutacommoncovariancematrix,thentheprospectiveprobabilityofdisease(D)giventheexposure(X)hasthelogisticformi.e whereL(u)=1=1+exp(u).However,thereisaconceptualcomplicationinusingaprospectivelikelihoodbasedonP(DjX)whereasacase-controlsampling 16

PAGE 17

PrenticeandPyke ( 1979 )whoshowedthatthemaximumlikelihoodestimatorsandasymptoticcovariancematricesofthelog-oddsratiosobtainedfromtheretrospectivelikelihoodarethesameasthatobtainedfromaprospectivelikelihoodunderalogisticformulationforthelatter.Thus,case-controlstudiescanbeanalyzedusingaprospectivelikelihoodwhichgenerallyinvolvesfewernuisanceparametersthanaretrospectivelikelihood. Carrolletal. ( 1995 )extendedtheprospectiveformulationtothesituationofmissingdataandmeasurementerrorintheexposurevariables. Inacasecontrolset-up,matchingifoftenusedforselectingcomparablecontrolstoeliminatebiasduetoconfounding.Statisticaltechniquesforanalyzingmatchedcase-controldatawererstdevelopedby Breslowetal. ( 1978 ).Inthesimplestsetting,thedataconsistofmmatchedsets,say,S1,...,Sm,withMicontrolsmatchedwithacaseineachsetorstratum.Aprospectivestratiedlogisticdiseaseincidencemodelgivenby isassumed.i'sarethestratumspecicinterceptterms,treatedasnuisanceparametersandareeliminatedbyconditioningonthenumberofcasesineachstratum.Thegeneratedconditionallogisticlikelihoodyieldstheoptimumestimatingfunction( Godambe 1976 )forestimating.Theclassicalmethodsforanalyzingunmatchedandmatchedstudiessufferfromlossofefciencywhentheexposurevariableispartiallymissing. Lipsitzetal. ( 1998 )proposedapseudo-likelihoodmethodtohandlemissingexposurevariables. Rathouzetal. ( 2002 )developedamoreefcientsemiparametricmethodofestimationwhichtookintoaccountmissingexposuresinmatchedcasecontrolstudies. SattenandKupper ( 1993 ), PaikandSacco ( 2000 )and SattenandCarroll ( 2000 )addressedtheproblemofmissingexposurefromafulllikelihoodapproachbyassumingadistributionoftheexposurevariableinthecontrolpopulation. 17

PAGE 18

Althman ( 1971 )isprobablytherstBayesianworkwhichconsideredseveral22contingencytableswithacommonoddsratioandperformedaBayesiantestofassociationbasedonthecommonoddsratio.Later, ZelenandParker ( 1986 ), NurminenandMutanen ( 1987 )and Marshall ( 1988 )consideredidenticalBayesianformulationsofacasecontrolmodelwithasinglebinaryexposure.Theseworksdealtwithinferencefromtheposteriordistributionofsummarystatisticslikethelogoddsratio,riskratioandriskdifference. Ashbyetal. ( 1993 )analyzedacasecontrolstudyfromaBayesianperspectiveanduseditasasourceofpriorinformationforasecondstudy.TheirpaperemphasizedthepracticalrelevanceoftheBayesianperspectiveinaepidemiologicalstudyasanaturalframeworkforintegratingandupdatingknowledgeavailableateachstage. MullerandRoeder ( 1997 )introducedanovelaspecttoBayesiantreatmentofcase-controlstudiesbyconsideringcontinuousexposurewithmeasurementerror.Theirapproachisbasedonanonparametricmodelfortheretrospectivelikelihoodofthecovariatesandtheimpreciselymeasuredexposure.Theychosethenon-parametricdistributiontobeaclassofexiblemixturedistributions,obtainedbyusingamixtureofnormalmodelswithaDirichletprocessprioronthemixingmeasure( EscobarandWest 1995 ).Theprospectivediseasemodelrelatingdiseasetoexposureisassumedtohavealogisticformcharacterizedbyavectoroflogoddsratioparameters.Thispaperpioneeredtheuseofcontinuouscovariates,measurementerrorandexiblenon-parametricmodelingofexposuresinaBayesiansettingandbroughttolightthetremendouspossibilityofmodernBayesiancomputationaltechniquesinsolvingcomplexdatascenariosincase-controlstudies. SeamanandRichardson ( 2001 )extendedthebinaryexposuremodelofZelenandParkertoanynumberofcategorical 18

PAGE 19

Mulleretal. ( 1999 )consideredanynumberofcontinuousandbinaryexposures.However,incontrasttoSeamanandRichardson,theyspeciedaretrospectivelikelihoodandthenderivedtheimpliedprospectivelikelihood.Theyalsoaddressedtheproblemofhandlingcategoricalandquantitativeexposuressimultaneously. ContinuouscovariatescanbetreatedintheSeamanandRichardsonframeworkbydiscretizingthemintogroupsandlittleinformationislostifthediscretizationissufcientlyne. Gustafsonetal. ( 2002 )treatedtheproblemofmeasurementerrorsinexposurebyapproximatingtheimpreciselymeasuredexposurebyadiscretedistributionsupportedonasuitablychosengrid.Intheabsenceofmeasurementerror,thesupportischosenasthesetofobservedvaluesoftheexposure,adevicethatresemblestheBayesianBootstrap( Rubin 1981 ).TheyassignedaDirichlet(1,1,...,1)priorontheprobabilityvectorcorrespondingtothegridpoints. SeamanandRichardson ( 2004 )provedequivalencebetweentheprospectiveandretrospectivelikelihoodintheBayesiancontext.Specically,theyshowedthatposteriordistributionofthelog-oddsratiosbasedonaprospectivelikelihoodwithauniformpriordistributiononthelogodds(thatanindividualwithbaselineexposureisdiseased)isexactlyequivalenttothatbasedonaretrospectivelikelihoodwithaDirichletpriordistributionontheexposureprobabilitiesinthecontrolgroup.Thus,Bayesiananalysisofcase-controlstudiescanbecarriedoutusingalogisticregressionmodelundertheassumptionthatthedatawasgeneratedprospectively. Diggleetal. ( 2000 )introducedBayesiananalysisformatchedcasecontrolsstudieswhencasesareindividuallymatchedtocontrols.Theyintroducednuisanceparameters 19

PAGE 20

GhoshandChen ( 2002 )developedgeneralBayesianinferentialtechniquesformatchedcase-controlproblemsinthepresenceofoneormorebinaryexposurevariables.Theirframeworkwasmoregeneralthanthatof ZelenandParker ( 1986 ).Unlike Diggleetal. ( 2000 ),theybasedtheiranalysisonunconditionalratherthantheconditionallikelihoodaftereliminationofthenuisanceparameters.Theirframeworkincludedawidevarietyoflinkslikecomplimentaryloglinksandsomesymmetricandskewedlinksinadditiontotheusuallogitandprobitlinks.Recently Sinhaetal. ( 2004 )and Sinhaetal. ( 2005 )proposedauniedBayesianframeworkformatchedcase-controlstudieswithmissingexposures.Theyalsomotivatedasemiparametricalternativeformodelingvaryingstratumeffectsontheexposuredistributions.TheparameterswereestimatedinaBayesianframeworkbyusinganon-parametricDirichletprocessprioronthestratumspeciceffectsinthedistributionoftheexposurevariableandparametricpriorsonallotherparameters.TheinterestingaspectoftheBayesiansemiparametricmethodologyisthatitcancaptureunmeasuredstratumheterogeneityinthedistributionoftheexposurevariableinarobustmanner.Theyalsoextendedtheproposedmethodtosituationswithmultiplediseasestates. Inatypicalcase-controlstudydesign,theexposureinformationiscollectedonlyonceforthecasesandcontrols.However,somerecentmedicalstudies Lewisetal. ( 1996 )haveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory,whenavailable,mayleadtoagainininformationonthecurrentdiseasestatusofasubjectvis-a-vismorepreciseestimationoftheoddsratioofdisease.Itmayalsoprovideinsightsonhowthepresentdiseasestatusofasubjectisbeinginuencedbypastexposureconditionsconditionalonthecurrentones.Unfortunately,properandrigorousstatisticalmethodsofincorporatinglongitudinallyvaryingexposureinformationinsidethecasecontrolframeworkhavenotyetbeenproperlydeveloped.Inthiswork, 20

PAGE 21

GhoshandRao ( 1994 )provideanicereviewofthedifferenttypesofestimatorsandinferentialproceduresusedinsurveysamplingandsmallareaestimation. Sincesamplesurveysaregenerallydesignedforlargeareas,theestimatesofmeansortotalsobtainedthereofarereliableforlargedomains.Directsurveybasedestimatorsforsmalldomainsoftenyieldlargestandarderrorsduetothesmallsamplesizeoftheconcernedarea.Thisisduetothefactthattheoriginalsurveywasdesignedtoprovideaccuracyatamuchhigherlevelofaggregationthanforlocalareas.Thismakesitanecessitytoborrowstrengthfromadjacentorrelatedareastondindirectestimatorsthatincreasetheeffectivesamplesizeandthusincreasetheprecisionoftheresultingestimateforagivensmallarea.Broadlyspeaking,asmallareamodelhasageneralizedlinearformwithameanterm,arandomarea-speciceffecttermandameasurementerrortermwhichreectsthenoisefornotsamplingtheentiredomain. 21

PAGE 22

Duringthelast10-15years,modelbasedinferencehasbeenwidelyusedinthesmallareacontext.Thisismainlyduetothewiderangeoffunctionalitiesthatcomeswiththelinearmixedeffectsmodelingframework.Someofthemainadvantagesofthisframeworkare(i)Randomarea-speciceffectsaccountingforbetweenareavariationaboveandbeyondthatexplainedbyauxiliaryvariablesinthemodel.(ii)Differentvariationslikenon-linearmixedeffectsmodels,logisticregressionmodels,generalizedlinearmodelscanbeentertained.(iii)Areaspecicmeasuresofprecisioncanbeassociatedwitheachsmallareaestimateunliketheglobalmeasures.(iv)Complexdatastructureslikespatialdependence,timeseriesstructures,longitudinalmeasurementscanbeexploredand(v)Recentmethodologicaldevelopmentsforrandomeffectsmodelscanbeutilizedtoachieveaccuratesmallareainferences.Generally,therearetwokindsofsmallareamodelsdependingonwhethertheresponseisobservedattheareaortheunitlevel. 1. Area(oraggregate)levelmodelsrelatesmallareameanstoareaspecicauxiliaryvariables. 2. Unitlevelmodelsrelatetheunitvaluesofthestudyvariabletounit-specicauxiliaryvariables. Thebasicarealevelmodelisgivenby Hereiisoftenassumedtobeafunctionofthepopulationmean,Yioftheithsmallarea,zi=(zi1,...,zip)0isthecorrespondingauxiliarydata,vi'sareareaspecicrandom 22

PAGE 23

Inordertoinferaboutthesmallareameans,Yi,directestimators,^Yiareassumedtobeknownandavailable.Thelinearmodel isassumedwherethesamplingerrors,eiareindependentwithEp(eiji)=0,Vp(eiji)=i,iknown whichimpliesthat^iaredesign-unbiased.Bysetting2v=0in( 1 ),wehavei=z0iwhichleadstosyntheticestimatorsthatdoesnotaccountforlocalvariationaboveandbeyondthatreectedintheauxiliaryvariableszi.Combining( 1 )and( 1 ),wehave whichisaspecialcaseofalinearmixedmodel.Here,viandeiareassumedtobeindependent. FayandHerriot ( 1979 )studiedtheabovearealevelmodel( 1 )inthecontextofestimatingthepercapitaincome(PCI)forsmallplacesintheUnitedStatesandproposedEmpiricalBayesestimatorforthatcase. EricksenandKadane ( 1985 )usedthesamemodelwithbi=1andknown2vtoestimatetheundercountinthedecennialcensusofU.S.ThearealevelmodelhasalsobeenusedrecentlytoproducemodelbasedcountyestimatesofpoorschoolagechildrenintheUnitedStates. Intheunitlevelmodel,itisassumedthatunitspecicauxiliarydataxij=(xij1,...,xijp)0areavailableforeachpopulationelementjineachsmallareai.Moreover,itisassumedthatthevariableofinterest,yij,isrelatedtoxijthroughaone-foldnestederrorlinearregressionmodel 23

PAGE 24

Batteseetal. ( 1988 )studiedthenestederrorregressionmodel( 1 )inestimatingtheareaundercornandsoyabeansforcountiesinNorth-CentralIowausingsamplesurveydataandsatelliteinformation.Indoingso,theycameupwithanempiricalbestlinearunbiasedpredictor(EBLUP)forthesmallareameans. Overtheyears,numerousextensionshavebeenproposedfortheabovemodelingframeworksincludingmultivariateFay-Herriotmodels,generalizedlinearmodels,spatialmodelsandmodelswithmorecomplicatedrandom-effectsstructureetc. Rao ( 2003 )presentedaniceoverviewofthedifferentestimationmethodswhile JiangandLahiri ( 2006 )reviewedthedevelopmentofmixedmodelestimationinthesmallareacontext. AproperreviewofmodelbasedsmallareaestimationwillbeincompletewithoutexplainingtheEBLUP,EBandHBapproachesthatarebeingwidelyusedinthiscontext.Asshownabove,smallareamodelsarespecialcasesofgenerallinearmixedmodelsinvolvingxedandrandomeffectssuchthatsmallareaparameterscanbeexpressedaslinearcombinationsoftheseeffects. Henderson ( 1950 )derivedtheBLUPestimatorsofsmallareaparametersintheclassicalfrequentistframework.Thesearesocalledbecausetheyminimizethemeansquarederroramongtheclassoflinearunbiasedestimatorsanddonotdependonnormality.So,theyaresimilartothebestlinearunbiasedestimators(BLUEs)ofxedparameters.TheBLUPestimatortakesproperaccountofthebetweenareavariationrelativetotheprecisionofthedirectestimator.AnEBLUPestimatorisobtainedbyreplacingtheparameterswiththeasymptoticallyconsistentestimator. Robinson ( 1991 )givesanexcellentaccountofBLUPtheoryandsomeapplications.InanEBapproach,theposteriordistributionoftheparametersof 24

PAGE 25

Morris ( 1983 ).Lastbutnottheleast,intheHBapproach,apriordistributionisspeciedonthemodelparametersandtheposteriordistributionoftheparameterofinterestisobtained.Inferencesabouttheparametersarebasedontheposteriordistribution.Theparameterofinterestisestimatedbyitsposteriormeanwhileitsprecisionisestimatedbyitsposteriorvariance.RecentadvancesinMarkovchainMonteCarlotechnique,specicallyGibbsandMetropolisHastingssamplershaveconsiderablysimpliedthecomputationalaspectofHBprocedures. TheSmallAreaIncomeandPovertyEstimates(SAIPE)programoftheU.S.CensusBureauwasestablishedwiththeaimofprovidingannualestimatesofincomeandpovertystatisticsforallstates,countiesandschooldistrictsacrosstheUnitedStates.Theresultingestimatesaregenerallyusedfortheadministrationoffederalprogramsandtheallocationoffederalfundstolocaljurisdictions.TheSAIPEprogramalsoprovidesannualstateandcountylevelestimatesofmedianhouseholdincome.Generally,observationsonvariouscharacteristicsofsmallareasthatarecollectedovertimemaypossessacomplicatedunderlyingtime-varyingpattern.Itislikelythatmodelswhichtakesintoaccountthislongitudinalpatternintheobservationsmayperformbetterthanclassicalsmallareamodelswhichdonotutilizethisinformation.Inthisstudy,wepresentasemiparametricBayesianframeworkfortheanalysisofsmallarealeveldatawhichexplicitlyaccomodatesforthelongitudinaltimevaryingpatternintheresponseandthecovariates. 25

PAGE 26

Suppose,theresponseyandthecovariatexarerelatedas wheref(x)isanunknownandunspeciedsmoothfunctionofxandeiN(0,2e).Thebasicproblemofnonparametricregressionistoestimatethefunctionf()usingthedatapoints(xi,yi).Indoingso,itistypicallyassumedthatbeneatharoughobservationaldatapatternthereisasmoothtrajectory.Thisunderlyingsmoothpatternisestimatedbyvarioussmoothingtechniques.Broadly,therearefourmajorclassesofsmoothersusedtoestimatef(.)vizLocalpolynomialkernelsmoothers( FanandGijbels ( 1996 ); WandandJones ( 1995 )),Regressionsplines( Eubank ( 1988 ), Eubank ( 1999 )),Smoothingsplines( Wahba ( 1990 ); GreenandSilverman ( 1994 ))andPenalizedsplines( EilersandMarx ( 1996 ); Ruppertetal. ( 2003 )).Eachsmootherhasitsownstrengthsandweaknesses.Forexample,localpolynomialsmoothersarecomputationallyadvantageousforhandlingdenseregionswhilesmoothingsplinesmaybebetterforsparseregions.Here,wewillbrieyreviewthemaincharacteristicsofsplinesingeneralandpenalizedsplinesinparticular. Thebasicideabehindsplinesistoexpresstheunknownfunctionf(x)usingpiecewisepolynomials.Twoadjacentpolynomialsaresmoothlyjoinedatspecicpointsintherangeofxknownasknots.Theknots,say,(1,...,K)partitiontherangeofxintoKdistinctsubintervals(orneighborhoods).Withineachsuchneighborhood,apolynomialofcertaindegreeisdened.Apolynomialsplineofdegreephas(p1)continuousderivativesandadiscontinuouspthderivativeatanyinteriorknot.Thepthderivativereectsthejumpofthesplinesattheknots.Thus,asplineofdegree0isa 26

PAGE 27

Here(xk)p+isthefunction(xk)pIfx>kg.Usingtheabovebasis,asplineofdegreepcanbeexpressedas Here,(0,...,p)and(1,...,K)arethecoefcientsofthepolynomialandsplineportionsoftheabovestructureandmustbeestimated.p=1,2,3correspondstoalinear,quadraticorcubicsplinerespectively.TheabovebasisconstitutesoneofthemostcommonlyusedbasisfunctionswhileotherbaseslikeradialbasisorB-splinescanalsobeused.Itcanbeshownthatthereexistsaveryrichclassofspline-generatingfunctionswhichinturngreatlyincreasesthescopeandapplicabilityofsplinesinvariousmodelingframeworks.Moreover,theverystructureofthesplinesmakesthemextremelygoodatcapturinglocalvariationsinapatternofobservations,somethingwhichcannotbeachievedusingFourierorPolynomialbases. Oneofthemostimportantaspectofsmoothingistheproperselectionandpositioningoftheknots.Thisisbecausetheknotsactassensorsinrelayinginformationabouttheunderlyingtrueobservationalpattern.Toofewknotsoftenleadtoabiasedtwhileanexcessivenumberofknotsleadstooverttingvis-a-visoverparametrizationandmayevenworsentheresultingt.Thus,asufcientnumberofknotsshouldbeusedandtheyshouldbeplaceduniformlythroughouttherangeoftheindependentvariable.Generally,theknotsareplacedonagridofequallyspacedsamplequantilesofxandamaximumof35to40knotssufcesforanypracticalproblem( Ruppert 2002 ).Recently,therehavebeeninterestingcontributionsonknot 27

PAGE 28

Friedman ( 1991 ); Stoneetal. ( 1997 ); Denisonetal. ( 1998 ); Lindstrom ( 1999 ); DiMatteoetal. ( 2001 ); BottsandDaniels ( 2008 )).Theexibilityandwideapplicabilityofsplinesisduetothefactthatprovidedtheknotsareevenlyspreadoutovertherangeofx,f(xj,)canaccuratelyestimateaverylargeclassofsmoothfunctionsf(.)evenifthedegreeofthesplineiskeptrelativelylow(say,1or2). Thesplinecoefcients(1,...,K)in( 1 )correspondtothediscontinuouspthderivativeofthespline-thus,theymeasurethejumpsofthesplineattheknots(1,...,K).Thus,theycontributetotheroughnessoftheresultingspline.Inordertosmooth-outthet,aroughnesspenaltyisplacedontheseparameters.Thisisoftendonebyminimizingtheexpression whereisknownasthesmoothingparameter.Thisissynonymoustominimizingtherstpartof( 1 )subjecttotheconstraint0.playsacrucialroleinthesmoothingprocesssinceitcontrolsthegoodnessoftandroughnessofthettedmodel.Decreasing,thesplinewilltendtoovert,becominganinterpolatingcurveas!0.Increasing,thesplinewillbecomesmootherandwilltendtotheleastsquarestas!1.Therearedifferentmethodsforchoosingtheoptimallikecross-validation,generalizedcross-validation,Mallow'sCpcriterionetc. Broadlyspeaking,therearethreemaintypesofsplines:Regressionsplines,SmoothingsplinesandPenalizedsplines(orP-splines).Allofthemarebasedonthesameprincipleasdetailedabovebutdifferinthespecicmannerinwhichsmoothingisdoneortheknotsareselected.Inregressionsplines,smoothingisachievedbythedeletionofnon-essentialknotsorequivalently,bysettingthejumpsatthoseknotstozerokeepingthejumpsattheotherknotsundisturbed.Insmoothingandpenalizedsplines,smoothingisachievedbyshrinkingthejumpsatalltheknotstowardszerousing 28

PAGE 29

1 ).Amajordifferencebetweensmoothingsplinesandpenalizedsplinesisthat,intheformer,alltheuniquedatapointsareusedasknotsbutinthelatterthenumberofknotsaremuchsmallerresultinginmoreexibility.Infact,penalizedsplinescanbeseenasageneralizationofregressionandsmoothingsplines. Thewideapplicabilityofpenalizedsplinesindiversesettingsismainlyduetoitscorrespondencewithlinearmixedeffectsmodels.Infact,penalizedsplinescanbeshowntobebestlinearunbiasedpredictors(BLUP)'sinamixedmodelframework.Toseethis,werewrite( 1 )as where=(,)0,=(0,1,...,p)0,=(1,2,...,K)0andDisaknownpositivesemi-denitepenaltymatrixsuchthatD=0B@0(p+1)(p+1)0(p+1)(K)0(K)(p+1)1K1CA 1 )correspondstosetting=I. LetXbethematrixwiththeithrowXi=(1,xi,...,xpi)andZbethematrixwiththeithrowZi=f(xi1)p+,...,(xi1)p+).Usingthisformulationin( 1 )withthebasisfunctionin( 1 )anddividingbytheerrorvariance2e,wehave 2ekk2(1) ByassumingthatisavectorofrandomeffectswithCov()=2Iwhere2=2e=whileasthesetofxedeffectsparameters,theabovepenalizedsplineframework 29

PAGE 30

whereCov(e)=2eIandandeareindependent. BayesianP-splineshaverecentlybecomepopularbecausetheycombinetheexibilityofnon-parametricmodelsandtheexactinferenceprovidedbytheBayesianinferentialprocedure.Thisisevenmoretruebecauseoftheseamlessfusionofpenalizedsplinesintothemixedmodelframework( Wand 2003 )asshownabove.Thisequivalencealsocarriesovertothemannerinwhichsmoothingisdone.Smoothingcanbeachievedbyimposingpenaltiesonthesplinecoefcients,asshownin( 1 )orbyassumingadistributionalformfor,forexampleNK(0,2IK).IntheBayesiancontext,priorsareplacedon2andtheotherparametersandusualposteriorsamplingiscarriedout.Sincesamplesaregeneratedfromthesmoothingparameteralongsidetheotherparameters,thismethodisalsoknownasautomaticscatterplotsmoothing.Inalltheproblemstackledinthisdissertation,wewillbeusingBayesianinferentialproceduresonpenalizedsplinesasshownabove. 30

PAGE 31

Lewisetal. 1996 )haveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory,whenavailable,mayleadtoagainininformationonthecurrentdiseasestatusofasubjectandmorepreciseestimationoftheoddsratioofdisease.Itmayalsoprovideinsightsonhowthepresentdiseasestatusofasubjectisbeinginuencedbypastexposureconditionsconditionalonthecurrentones.Inthiswork,wepresentaBayesiansemiparametricapproachforanalyzingcasecontroldatawhenlongitudinalexposureinformationisavailableforbothcasesandcontrols. Statisticalanalysisofcase-controldatawaspioneeredby Corneld ( 1951 ), Corneldetal. ( 1961 )and MantelandHaenszel ( 1959 ).Sincethen,importantandfarreachingcontributionshavebeenmadeinvirtuallyeveryaspectoftheeld.Someofthenotableonesareequivalenceofprospectiveandretrospectivelikelihoods( PrenticeandPyke 1979 ),measurementerrorinexposures( Roederetal. 1996 )andmatchedcase-controlstudies( Breslowetal. 1978 ).ImportantcontributionsintheBayesianparadigmincludebinaryexposures( ZelenandParker 1986 ),continuousexposures( MullerandRoeder 1997 ),categoricalexposures( SeamanandRichardson 2001 ),equivalence( SeamanandRichardson 2004 )andmatching( Diggleetal. ( 2000 ); GhoshandChen ( 2002 )). Theanalysisofcomplexdatascenariosinacasecontrolframeworkisarelativelynewareaofresearch.Specically,analysisoflongitudinalcasecontrolstudieshasonly 31

PAGE 32

ParkandKim ( 2004 )areoneoftherstcontributorstothisarea.Theyproposedanordinarylogisticmodeltoanalyzelongitudinalcasecontroldatabutignoredthelongitudinalnatureofthecohort.Theyalsoshowedthatordinarygeneralizedestimatingequations(GEE)basedonanindependentcorrelationstructurefailsinthisframework. Inviewoftheabovechallenges,weproposetousefunctionaldataanalytictechniques,speciallynonparametricregressionmethodologytomodelboththetime 32

PAGE 33

EilersandMarx ( 1996 ); Ruppertetal. ( 2003 )).Wealsoexpresstheeffectoftheexposuresonthecurrentdiseasestateasapenalizedsplinetoaccountforanypossibletimevaryingpatternsofinuence.AnalysisiscarriedoutinahierarchicalBayesianframework.Ourmodelingframeworkisquiteexiblesinceitcanaccommodateanypossiblenon-lineartimevaryingpatternintheexposureandinuenceproles.Itisdifculttoachievethesamegoalinapurelyparametricsetting. Inacase-controlstudy,thenaturallikelihoodistheretrospectivelikelihood,basedontheprobabilityofexposuregiventhediseasestatus. PrenticeandPyke ( 1979 )showedthatthemaximumlikelihoodestimatorsandasymptoticcovariancematricesofthelog-oddsratiosobtainedfromaretrospectivelikelihoodarethesameasthatobtainedfromaprospectivelikelihood(basedontheprobabilityofdiseasegivenexposure)underalogisticformulationforthelatter.Thus,case-controlstudiescanbeanalyzedusingaprospectivelikelihoodwhichgenerallyinvolvesfewernuisanceparametersthanaretrospectivelikelihood. SeamanandRichardson ( 2004 )provedasimilarresultintheBayesiancontext.Specically,theyshowedthatposteriordistributionofthelog-oddsratiosbasedonaprospectivelikelihoodwithauniformpriordistributiononthelogodds(thatanindividualwithbaselineexposureisdiseased)isexactlyequivalenttothatbasedonaretrospectivelikelihoodwithaDirichletpriordistributionontheexposureprobabilitiesinthecontrolgroup.Thus,Bayesiananalysisofcase-controlstudiescanbecarriedoutusingalogisticregressionmodelundertheassumptionthatthedatawasgeneratedprospectively. Weshowthattheresultsof SeamanandRichardson ( 2004 )appliesfortheproposedsemiparametricframeworkthusenablingustoperformtheanalysisbasedonaprospectivelikelihoodeventhoughacasecontrolstudyisretrospectiveinnature.Weperformmodelcheckingbasedontheposteriorpredictivelosscriterion( Gelfandand 33

PAGE 34

, 1998 ).Oncetheoptimalmodelisidentied,modelassessmentiscarriedoutusingcasedeletiondiagnostics( BradlowandZaslavsky 1997 ). Etzionietal. 1999 ).Thisdatasetisbasedonabiomarkerbasedscreeningprocedureforprostatecancertoelucidatetheassociationbetweenprostatecancerandprostate-specicantigen(PSA).Theeffectivenessofbiomarkerbasedscreeningproceduresforprostatecanceriscurrentlyatopicofintensedebateandinvestigationintherealmsofhealthcarepractice,policyandresearch.Sincethediscoveryofprostate-specicantigen(PSA)andtheobservationthatserumPSAlevelsmaybesignicantlyincreasedinprostatecancerpatients,alotofefforthasbeendedicatedtoidentifyingeffectivePSAbasedtestingprogramswithfavorablediagnosticproperties. Inthisstudy,thelevelsoffreeandtotalPSAweremeasuredintheseraof71prostatecancercasesand70controls.Participantsinthisstudyincludedmenaged50to65athighriskoflungcancer.TheywererandomizedtoreceiveeitherplaceboorBetaCaroteneandRetinol.Theinterventionhadnonoticeableeffectontheincidenceofprostatecancer,withsimilarnumberofcasesobservedintheinterventionandcontrolarms.SeveralPSAmeasurementsrecordedforthecasesweretakenaslongas10yearspriortotheirdiagnosis.The71prostatecancercaseswerediagnosedbetweenSeptember1988andSeptember1995inclusive.Theindividualsdeemedcontrolswereselectedamongindividualsnotyetdiagnosedashavingcancerbythetimeofanalysis.Astheexposurevariable,weusethenaturallogarithmofthetotalPSA(Ptotal)althoughthenegativelogarithmoftheratiooffreetototalPSA(Pratio)canalsobeconsidered.Inadditiontotheabovemeasurements,observationswerecollectedontime(years)relativetoprostatecancerdiagnosisandageatblooddrawforthecases 34

PAGE 35

2-1 showsthePSAtrajectoryagainstageforsomerandomlychosencasesandcontrols. Etzionietal. ( 1999 )analyzedthisdatasetbymodelingthereceiveroperatingcharacteristic(ROC)curvesassociatedwithboththebiomarkers(PtotalandPratio)asafunctionofthetimewithrespecttodiagnosis.Theyobservedthatalthoughthetwomarkersperformedsimilarlyeightyearspriortodiagnosis,PtotalwassuperiortoPratioattimesclosertodiagnosis. Therestofthechapterisorganizedasfollows.InSection 2.2 ,weintroducethesemiparametricmodelingframework.Section 2.3 describesthedetailsofposteriorinference.InSection 2.4 ,wediscussrelevantBayesianequivalenceresultsforourframework.Section 2.5 outlinesthemodelcomparisonandmodelassessmentproceduresweperformed.WedescribethedataanalysisresultsbasedontheprostatecancerdatasetinSection 2.6 andendwithadiscussioninSection 2.7 2.2.1Notation 35

PAGE 36

Longitudinalexposure(PSA)prolesof3randomlysampledcases(1stcolumn)and3randomlysampledcontrols(2ndcolumn)plottedagainstage. 36

PAGE 37

Ourmodelingframeworkbearssomeresemblancetothatof Zhangetal. ( 2007 )whousedatwostagefunctionalmixedmodelapproachformodelingtheeffectofalongitudinalcovariateproleonascalaroutcome.Theyproposedalinearfunctionalmixedeffectsmodelformodelingtherepeatedmeasurementsonthecovariate.Theeffectofthecovariateproleonthescalaroutcomewasmodeledusingapartialfunctionallinearmodel.Indoingso,theytreatedtheunobservedtruesubject-speciccovariatetimeproleasafunctionalcovariate.Forttingpurposes,theydevelopedatwo-stagenonparametricregressioncalibrationmethodusingsmoothingsplines.Thus,estimationatboththestageswasconvenientlycastintoauniedmixedmodelframeworkbyusingtherelationbetweensmoothingsplinesandmixedmodels.ThekeydifferencesbetweentheirframeworkandoursisthatweuseBayesianinferentialtechniquestosimultaneouslyestimatetheparametersoftheexposureanddiseasemodels.Moreover,insteadofalinearmodelingframework,weuseacombinationoflinearandlogisticmodelssinceourresponseisbinary. whereeijN(0,2e),f(a)isthepopulationmeanfunctionmodelingtheoverallPSAtrendasafunctionofageforallthesubjectswhilegi(a)isthesubjectspecicdeviationfunctionreectingthedeviationoftheithsubjectspecicprolefromthemeanpopulationprole. Thereasonformodelingexposureasafunctionofageisthatforarandomlychosensubjectwithunknowndiseasestatus,thePSAvalueatacertaintimepointshoulddependonthesubject'sageatthattimepointcontrollingforthetimewithrespect 37

PAGE 38

Werepresentbothf(aij)andgi(aij)usingp-splinesasfollows wherep,(aij)=[1,aij,...,apij,(aij1)p+,...,(aijK)p+]0andq,(aij)=[1,aij,...,aqij,(aij1)q+,...,(aijM)q+]0aretruncatedpolynomialbasisfunctionsofdegreespandqwithknots(1,...,K)and(1,...,M)respectively( Durbanetal. 2004 ).Generally,MK. whereL(.)isthelogisticdistributionfunction,Xi(t+adi)isthetrue,error-freeunobservedsubject-specicexposureprolemodeledasf(t+adi)+gi(t+adi)while(t+adi)isanunknownsmoothfunctionofagewhichreectsthetimepatternoftheeffectofthePSAtrajectoryonthecurrentdiseasestatusfortheithsubject.In( 2 ),weusetherelationaij=tij+aditomodeltheexposuretrajectoryX(.)andtheinuencefunction(.)asafunctionoftimewithrespecttodiagnosis.Indoingso,wecaneasilyassesstheeffectofthetrajectoryonthecurrentdiseasestateatanygivenpointbeforediagnosisforaparticularsubject.cisthetimebywhichwegobackinthepasttorecordtheexposurehistoryfortheithsubject;e.g.c=8wouldimplythat,fortheithsubject,theexposureobservationsrecordedsinceeightyearspriortodiagnosisarebeingconsideredforanalysis.Thus,bychangingthevalueofc,theeffectofdifferentiallengthsofPSAtrajectoriesonthecurrentdiseasestatuscanbestudied. 38

PAGE 39

wherer,(t+adi)=[1,(t+adi),...,(t+adi)r,(t+adi1)r+,...,(t+adiK)r+]0,=(0,...,K+r)0and(1,...,K)aretheknots. Asspecialcasesof( 2 ),wemayconsider(t+adi)=0,inwhichcasethecovariateistheareaunderthePSAprocessfXi(t+adi),ct0gand0isitseffectonthediseaseprobability(orlogitofthediseaseprobability).Wecanalsoassume(t+adi)=0+1(t+adi)whichsigniesalinearpatternoftheeffectoftheexposuretrajectoryonthediseaseprobability.Intheabovemodels,theknotscanbechosenonagridofequallyspacedquantilesoftheages. Replacing( 2 )and( 2 )intheR.H.Sof( 2 ),wehave whereMi=Z0cp,(t+adi)r,(t+adi)0dtandQi=Z0cq,(t+adi)r,(t+adi)0dt. Forpre-chosendegreesofthebasisfunctionsandtheknots,bothMiandQiarematricesandareavailableinclosedforms.Weassumenormaldistributionalformsforthesplinecoefcientsin( 2 )and( 2 )inordertopenalizethejumpsofthesplineattheknots.Thus,wehavep+kN(0,2)(k=1,...,K);bi,q+mN(0,2b)(m=1,...,M)andk+rN(0,2)(k=1,...,K).Finally,therandomsubjectspecicdeviationfunctiongi(aij)ismodeledasbijN(0,2j)(i=1,...,N;j=0,...,q). 39

PAGE 40

2.3.1LikelihoodFunction Thelikelihoodfortotheithsubject,conditionalontherandomeffectsisgivenby wherep(Yij,ai,bi,2e)istheprobabilitydistributioncorrespondingtothetrajectorymodel,p(Dij,,)denotesthelogisticdistributioncorrespondingtothediseasemodelwhiletherestdealswiththedistributionalstructuresonthesplinecoefcientsandrandomeffects. Sincethetrajectorymodel( 2 )hasanormaldistributionalstructurewhilethediseasemodel( 2 )hasalogisticstructure,thelikelihoodfunctionandhencetheposteriorhaveacomplicatedform.Toalleviatethisproblem,weapproximatethelogisticdistributionasamixtureofnormalsusingawellknowndataaugmentationalgorithmproposedby AlbertandChib ( 1993 ).ThisisbrieyexplainedinSection3.3. 40

PAGE 41

LikelihoodApproximation AlbertandChib ( 1993 )toapproximatethelikelihoodandthussimplifyposteriorinference.Theyshowedthatalogisticregressionmodelonbinaryoutcomescanbewellapproximatedbyanunderlyingmixtureofnormalregressionstructureonlatentcontinuousdata.Indoingso,itcanbeshownthatalogitlinkisapproximatelyequivalenttoaStudent-tlinkwith8degreesoffreedom. Asin AlbertandChib ( 1993 ),weintroducelatentvariablesZ1,Z2,...,ZNsuchthatDi=1ifZi>0andDi=0otherwise.LetZibeindependentlydistributedfromatdistributionwithlocationHi=+0Mi+b0iQi,scaleparameter1anddegreesoffreedom.Equivalently,withtheintroductionoftheadditionalrandomvariablei,thedistributionofZicanbeexpressedasscalemixturesofnormaldistribution 26 )as 41

PAGE 42

2 ).Since,themarginalposteriordistributionofisanalyticallyintractable,weconstructanMCMCalgorithmtosamplefromitsfullconditionals.Indoingso,weusemultiplechainsandmonitorconvergenceofthesamplersusingGelmanandRubindiagnostics( GelmanandRubin 1992 ). 1.2 ), SeamanandRichardson ( 2004 )showedthatforcertainchoicesofthepriorsonthelogodds,posteriorinferencefortheparameterofinterestbasedonaprospectivelogisticmodelcanbeshowntobeequivalenttothatbasedonaretrospectiveone.Asaresult,aprospectivemodelingframeworkcanbeusedtoanalyzecase-controldatawhicharegenerallycollectedretrospectively.HereweshowthattheBayesianequivalenceresultsof SeamanandRichardson ( 2004 )canbeextendedtothesemiparametricframeworkwehaveproposed.Thisenablesustouseaprospectivelogisticframework(asdescribedinSection( 2.2.2 ))toanalyzethePSAdataset. Ourmodelingframeworkhingesontheideathatforeverysubject,insteadofasingleexposureobservation,aseriesofpastexposureobservationsareavailable.Weusethisexposuretrajectoryorexposureproleinanalyzingthepresent 42

PAGE 43

Rubin ( 1981 )andlaterby Gustafsonetal. ( 2002 )canbeappliedtothetrajectoryasawholei.efXi(t),ct0gcanbeassumedtobeadiscreterandomvariablewithsupportfZ1(t),...,ZJ(t),ct0g,thesetofallobservableexposuretrajectorieswherefZj(t),ct0,j=1,...,JgisanitecollectionofelementsinthesupportoftheXij's.LetY0jandY1jbethenumberofcontrolsandcaseshavingexposureprolefZj(t),ct0g.WedenotetheNullorbaselinetrajectoryasfX(t)=0,ct0g. TheoddsratioofdiseasecorrespondingtofZj(t),ct0gwithrespecttobaselineexposureisexpZ0cZj(t)(t)dt.AssumingthatacontrolhasexposureprolefZj(t),ct0gwithprobabilityj=PJk=1k,itcanbeeasilyshownthatP(X(t)=Zj(t),ct0jD=1)=jexpZ0cZj(t)(t)dt 43

PAGE 44

since(t)=(t)0=0(t)by( 2 ).Weassume1=1foridentiability.Hered=0and1standsforcontrolsandcasesrespectively.Assuming#tobethebaselineoddsofdisease,theprospectivelikelihoodisgivenby Basedontheabovesetup,wehavethefollowingequivalenceresults: 2 )withrespecttoisthesameasthatobtainedbymaximizingL(#,)in( 2 )withrespectto#. 44

PAGE 45

(ii)Assuming=(1,...,J)andj=j=JXk=1k,theposteriordensityof(,)is (iii)Themarginalposteriordensitiesofobtainablefromp(w,jy)andp(,jy)arethesame. Theproofsoftheabovetheoremaresimilarinnaturetothosein SeamanandRichardson ( 2004 )andaregivenintheAppendixA.SincewehaveconsiderednearuniformpriorforandourprioronensurestheexistenceandnitenessofE(),theconditionsofTheorem2areessentiallysatisedforourframework. Basedontheaboveresults,itcanbeconcludedthatthemarginalposteriordistributionof-theparameterofinterest,willbethesameregardlessofwhetherwetaprospectiveorretrospectivemodel.Thus,wecananalyzethePSAdatausingtheprospectivesemiparametricmodelingframeworkdescribedabove.Bayesianequivalencecanalsobeshowninthemoregeneralcaseofmulticategorycasecontrolsetup,i.ewhentherearemultiple(>2)diseasestates.Wehavethefollowingresult

PAGE 46

KXl=1ldl1CCCCCAndkKYk=11k!() TheproofoftheabovetheoremisgiveninAppendixA. 2.5.1PosteriorPredictiveLoss GelfandandGhosh ( 1998 ).Thiscriterionisbasedontheideathatanoptimalmodelshouldprovideaccuratepredictionofareplicateoftheobserveddata. 46

PAGE 47

( 1998 )obtainedthiscriterionbyminimizingtheposteriorlossforagivenmodelandthen,forallmodelsunderconsideration,selectingtheonewhichminimizesthiscriterion.Foragenerallossfunction,thiscriterioncanbeexpressedasalinearcombinationoftwodistinctpartsi.eagoodness-of-tpartandapenaltypart.Forourframework,theposteriorpredictivelosscanbewrittenas k+1NXi=1Var(^Di)(2) where^Di=E(Drepijy,D)andVar(^Di)=Var(Drepijy,D)=E(Drepijy,D)(E(Drepijy,D))2.Forourframework,Drep=(Drep1,...,DrepN)isthereplicateddiseasestatusvectorforallthesubjects.ItisstraightforwardtocalculatetheexpectedvalueoftheabovecriterionusingtheposteriorsamplesobtainedfromtheGibbssampler.Lowervaluesofthiscriterionwouldimplyabettermodelt.Weassumek=1andobtainthevaluesofposteriorpredictivelossfordifferentlengthsofexposuretrajectoriesanddifferentnumberofknots.TheresultsaregiveninTable 2-3 andexplainedinSection 2.6 .Fortheoptimalmodelselectedusingtheposteriorpredictivelosscriterion,modelassessmentwasperformedusingKappameasuresofagreementandcasedeletiondiagnostics.Themethodologyisdescribedbelow. Agresti 2002 )whichcomparesagreementagainstthatwhichmightbeexpectedbychance.Thevalueofrangesfrom1to1;=1impliesperfectagreementwhile=1impliescompletedisagreement.Avalueof0indicatesnoagreementaboveandbeyondthatexpectedbychance. 47

PAGE 48

Theobserveddiseasestatus(vis-a-viscaseorcontrolstatus)ofasubjectisobtainedfromthedatasetwhilethepredicteddiseasestatusiscalculatedfromtheposteriorestimatesoftheparameters.AtiterationnoftheGibbssampler,wecancalculatethequantity^p(n)i=^P(n)(Di=1jXi(t+adi),t2[c,0])=L(n)(+0Mi+b0iQi)whereL(.)canbeeithertheexactlogitcdfortheapproximateStudent-tcdf(with8degreesoffreedom).Basedonthevalueof^p(n)i,wecanassign^D(n)i=8><>:1if^p(n)i>0.50if^p(n)i0.5 Hampeletal. 1987 ).Thesediagnosticscanbeusedtodetectobservationswithanunusualeffectonthettedmodelandthusmayleadtoidenticationofdataormodelerrors. BradlowandZaslavsky ( 1997 )appliedcaseinuencetoolsin 48

PAGE 49

LetHi=+0Mi+b0iQiandSij=p,(aij)0+q,(aij)0bi.SupposeL(YijjSij,2e)bethedensityfunctioncorrespondingtothetrajectorymodel,whileL(DijHi)betheoneforthediseasemodel.Weworkedwiththefollowingthreetypesofweightingschemesbasedonthoseproposedby BradlowandZaslavsky ( 1997 ) HerendenotethenthiterationoftheGibbssampler,thesubscriptidenotethedeletionofyiandthesuperscriptdenoteunnormalizedweights.Inthelastweighingscheme,L(YijjSij,2e)andL(DijHi)aretheusuallikelihoodswiththepopulationlevelparametersi.e(,,,2e)replacedbythefulldataposteriormedians.Herefulldataposterioristheposteriordistributionobtainedfromthecompletedataseti.etheonehavingallthesubjects. 2.2 toanalyzetheprostatecancerdatasetdescribedinSection 2.1.2 .MultipleobservationsonfreeandtotalPSAwereobtainedfor71prostatecancercasesand70controls.Forsomesubjects,observationswerecollectedasfaras10yearspriortodiagnosis.WeusethenaturallogarithmoftotalPSA(Ptotal)asourexposureofinterest.Ourprinciple 49

PAGE 50

Forthepurposeofouranalysis,wehaveusedalinearp-spline(p=1)withasubjectspecicslopeparametertomodeltheexposuretrajectoryasfollows Fortheprospectivediseasemodel( 2 ),weconsideredtwospecicscenariosviz.constantinuence,(t+adi)=0andlinearinuence,(t+adi)=0+1(t+adi).Theresultsforthesetwocasesaresummarizedbelow. Onttingtheabovemodel,weobservedthatforalltrajectorylengths,0issignicant(its95%credibleintervaldoesnotcontain0).Foranyparticularinterval(i.echoiceofc),theposteriormeansand95%credibleintervalsof0donotchangemuchwiththenumberofknots(K).Inaddition,0increasesasthetrajectorylengthdecreasesi.easwemoveclosertothepointofdiagnosis.ThisislikelyrelatedtothescaleoftheareaunderthePSAprocessbutitalsoseemstosupportthewellknownmedicalfactthattotalPSAisabetterdiscriminatorofprostatecancerattimesclosertodiagnosisthanattimesfurtheroff( Catalonaetal. 1998 ).ToassesstheimpactofonlythepastPSAobservationsonthecurrentdiseasestate,weconsideredtheexposureintervalI=(10,5)and3knotsinthetrajectory.Theposteriormeanof0is0.298 50

PAGE 51

2.6.3 ParameterizingfZi(t+adi),ct0gasp,(t+adi)0+q,(t+adi)0di,asin( 2 ),wecanrewrite( 2 )asexp()0Z0cp,(t+adi)r,(t+adi)0dtexp(dibi)0Z0cq,(t+adi)r,(t+adi)0dt. expmZ0c(t+adi)dt=expcm(0+(adic=2)1).(2) 51

PAGE 52

2-1 showstheposteriormeansand95%credibleintervalsoftheoddsratioscorrespondingtodifferenttrajectorylengthsandageatdiagnosiswhenm=0.5.Foraxedtrajectorylength,theoddsratiosdecreaseasageatdiagnosisincreases.This Table2-1. Estimatesofoddsratiosfordifferenttrajectorylengthsandageatdiagnosisfora0.5verticalshiftoftheexposuretrajectoryforthelinearinuencemodel Age(3,0)(5,0)(8,0)(10,0) seemstosupportthenotionthatyoungersubjectstendtohavemoreaggressiveformofprostatecancerthanolderonesandthusaremostlikelytobebenetedfromearlydetection( Catalonaetal. 1998 ).Formostagesatdiagnosis,theoddsratiossteadilyincreaseaslongerexposuretrajectoriesareconsideredi.easpastexposureobservationsaretakenintoaccount.However,therateofincreaseishigherforlowerageatdiagnosis.Thus,considerationofpastexposureobservationsinadditiontorecentonesresultinasignicantgainininformationaboutthecurrentdiseasestatusofasubject.Finally,forthehighestageatdiagnosisconsidered(80),theoddsratiosdecreaseaslongerexposuretrajectoriesareconsidered.Thismayimplythatforasubjectwithveryhighageatdiagnosis,his/herpastexposureobservationsmaynotcontainsignicantamountsofinformationaboutthepresentdiseasestatus. Asbefore,wettedthediseasemodelontheintervalI=(10,5).Theposteriormeanand95%credibleintervalof0and1arerespectively1.24(0.29,2.19)and-0.015(-0.029,0.003)implyingthatexposureobservationsrecorded5-10yearspriortodiagnosisalsohasasignicanteffectonthecurrentdiseasestatus.Theposteriormeansand95%credibleintervalsoftheoddsratiosshowninTable 2-2 corroboratetheaboveconclusion. 52

PAGE 53

Posteriormeansand95%condenceintervalsofoddsratioforI=(10,5)forthelinearinuencemodel AgeatDiagnosis 50607080 Mean4.993.272.221.5695%C.I(1.96,10.41)(1.91,5.36)(1.67,2.98)(1.10,2.29) 2-3 ThePPLvaluesforthelinearmodelweresmallerthanthosecorrespondingtotheconstantinuencemodel.Thus,wecanconcludethatfortheprostatecancerdata,theclassoflinearinuencemodelstbetterthantheclassofconstantinuencemodels.Forbothsetups,themodelwith0knotshastheworstt(highestPPLcriterion)acrossalltrajectorylengths.Foragiventrajectory,themodelstendtoimprovewithanincreaseinthenumberofknotsuntilacertainnumberofknotsisreached.Furtherincreaseofknotstendtoworsenthet;thisagreeswiththendingsof Ruppert ( 2002 ).Theimportantpointtonotehereisthatthenumberofknotsandthelengthoftheexposuretrajectoryseemtointeractintheireffectonmodelt.Thebestttingconstantinuencemodelseemtobetheonewithexposuretrajectory(10,0)and3knots. Forthelinearinuencesetup,thePPLcriterionhasadecreasingtrendaslongerexposuretrajectoriesaretakenintoaccount.Thus,inclusionofpastexposuresresultinanimprovementofmodelt.Thismaybeindicativeofthefactthatpastexposureobservationscontainsignicantamountofinformationaboutthecurrentdiseasestatus.Inaddition,forthetrajectoryintervalI=(10,5),thePPLcriteriacorrespondingtothelinearandconstantinuencemodelsaremoderatelysmall.Thus,exposureobservationsrecorded5-10yearspriortodiagnosisalsoprovideamodestamountofinformationtowardpredictingthecurrentdiseasestatus,corroboratingtheconclusions 53

PAGE 54

Posteriorpredictivelosses(PPL)fortheconstantandlinearinuencemodelsforvaryingexposuretrajectoriesandknots KnotsModel(2,0)(5,0)(8,0)(10,0)(10,5) reachedearlier.Forthelinearsetup,themodelwithexposuretrajectoryI=(8,0)and4knotsperformthebest(hasthelowestPPLcriterionamongallthemodelsconsidered). Forthismodel,theposteriormeanofwasabout0.6with95%credibleinterval(0.535,0.680)whichindicatessubstantialagreementbeyondwhatisexpectedbychance.Wenextperformedcasedeletionanalysis.Wedeletedeachsubject(withalltheobservations)ratherthaneachobservationforasubject.Figure 2-2 (a)-(c)showsthecasedeletedposteriormeansand95%credibleintervalsfor1,0and1.(In 54

PAGE 55

2-2 (d)showstheplotoftheposteriormeansofthedifferenceprobabilitiesandthecorrespondingcondenceintervals.(Inthisgure,thesolidlinerepresentszerodifference.Thesolidpointsrepresentsthedifferenceindiseaseprobabilitiesbasedonthefullandcasedeletedposteriors.Theverticallinesegmentsarethe95%posteriorintervalsofthedifferences).Surprisingly,theobservationforcasenumber108hasasignicantdeparturefromtherest.Onanalyzingthissubject,itwasfoundthatithadtheuniquecombinationofveryhighageandveryhighvaluesofPSA.Infactithadthehighestmeanageinthesample,thehighestageatdiagnosiswhilethethirdhighestmeanPtotalvalue.Thesecharacteristicsmayhavecontributedtotheexceptionallyhighdifferenceinthepredictedprobabilityofdisease. Wealsoperformedcasedeletionanalysisoftheinterceptparametersofthediseaseandtrajectorymodelsandthevariancecomponents.Noneofthesubjectswerefoundtobeinuentialontheposteriorestimatesoftheseparameters.Thus,basedontheabovetwomeasures,wemayconcludethatthesemiparametriclinearinuencemodelwithtrajectoryI=(8,0)and4knotsseemstottheobserveddatarelativelywell. 55

PAGE 56

Sensitivityof1,0,1anddiseaseprobabilityestimatestocase-deletions. 56

PAGE 57

Inthiswork,wehaveappliedsemiparametricregressiontechniquesinanalyzinglongitudinalcasecontrolstudies.Wehaveusedpenalizedregressionsplinesinmodelingtheexposuretrajectoriesforthecasesandthecontrols.Thusourframeworkcanbeusedevenwhenexposureobservationsarecollectedatdifferenttimepointsacrosssubjectsi.ewhenexposuresareunbalancedinnature.Theexposuretrajectoryisusedasthepredictorinaprospectivelogisticmodelforthebinarydiseaseoutcome.Wehavealsomodeledtheslopeparameterofthediseasemodelasap-splinetoaccountforanytimevaryinginuencepatternoftheexposuretrajectoryonthecurrentdiseasestatus.Indoingso,wehavesummarizedtheexposurehistoryforthecasesandcontrolsinaexiblewaywhichallowedustoconsiderdifferentiallengthsoftheexposuretrajectoryinanalyzingitseffectonthecurrentdiseasestatus.Inordertosimplifytheanalysis,weusedthelogit-mixtureofnormalapproximation( AlbertandChib 1993 ).WeshowedthattheBayesianequivalenceresultsof SeamanandRichardson ( 2004 )essentiallyholdsforourframework,thusallowingustouseaprospectivelogisticmodelhavingfewernuisanceparametersalthoughthedatasetwascollectedretrospectively.AnalysishavebeencarriedoutinanhierarchicalBayesianframework.ParameterestimatesandassociatedcredibleintervalsareobtainedusingMCMCsamplers.Wehaveappliedourmethodologytoalongitudinalcasecontrol 57

PAGE 58

Weanalyzedourmodelusingdifferentiallengthsofexposuretrajectories.Indoingso,wehaveconcludedthatpastexposureobservationsdoprovidesignicantinformationtowardspredictingthecurrentdiseasestatusofasubject.Specically,wehaveshownthatacrossallageatdiagnosisgroups,theoddsofdiseasesteadilyincreaseaspastexposureobservationsaretakenintoaccountinadditiontotherecentones.Wealsoobservedthatforaxedtrajectorylength,theoddsofdiseasesteadilydecreaseastheageatdiagnosisincreasescorroboratingthemedicalfactthatyoungersubjectstendtohavemoreaggressiveformofprostatecancerandthusaremostlikelytobebenettedfromearlydetection.Weperformedmodelcomparisonusingposteriorpredictiveloss( GelfandandGhosh 1998 ).Thiscriterionindicatedthatmodelswithlongerexposuretrajectoriestendtoperformbetterthanthosewithshortertrajectories.Lastly,modelassessmentwasperformedontheoptimalmodelusingthekappastatisticandcasedeletiondiagnostics.Boththesetoolssuggestedthatourmodeltsrelativelywelltothedata. Someinterestingextensionscanbedonetooursetup.Forricherdatasets,itwillbeinterestingtomodelthesubjectspecicdeviationfunctionsasp-splines.Inaddition,wehaveonlyassumedconstantandlinearparameterizationsoftheinuencefunctionoftheprospectivediseasemodel.Foralargerdataset,ap-splineformulationcanalsobeusedfortheinuencefunctionwhichmaybringoutanyunderlyingnon-linearpatternofinuenceoftheexposuretrajectoryonthecurrentdiseasestatus.Althoughwehaveusedabinarydiseaseoutcome,itwillbeinterestingtoextendourframeworktoaccommodatemulti-categorydiseasestates.Ourmodelingframeworkcanalsobegeneralizedbyincorporatingalargerclassofnonparametricdistributionalstructures(likeDirichletprocessesorPolyatrees)forthesubjectspecicrandomeffects. 58

PAGE 59

59

PAGE 60

ThecurrentmethodologyoftheSAIPEprogramisbasedoncombiningstateandcountyestimatesofpovertyandincomeobtainedfromtheAmericanCommunitySurvey(ACS)withotherindicatorsofpovertyandincomeusingtheFay-Herriotclassofmodels( FayandHerriot 1979 ).Theindicatorsaregenerallythemeanandmedianadjustedgrossincome(AGI)fromIRStaxreturns,SNAPbenetsdata(formerlyknownasFoodStampProgramdata),themostrecentdecennialcensus,intercensalpopulationestimates,SupplementalSecurityIncomeReceipiencyandothereconomicdataobtainedfromtheBureauofEconomicAnalysis(BEA).EstimatesfromACSarebeingusedsinceJanuary2005ontherecommendationoftheNationalAcademyofSciencesPanelonEstimatesofPovertyforSmallGeographicAreas(2000).Incomeandpovertyestimatesuntil2004werebasedondatafromtheAnnualSocialandEconomicSupplement(ASEC)oftheCurrentPopulationSurvey(CPS). Apartfromvariouspovertymeasures,theSAIPEprogramprovidesannualstateandcountylevelestimatesofmedianhouseholdincome.Atthispoint,directACSestimatesofmedianhouseholdincomeareonlyavailablefortheperiod2005-2008.Thus,forillustrationpurpose,wehaveconsidereddatafromASECfortheperiod1995-1999inordertoestimatethestatelevelmedianhouseholdincomefor1999.Thisisbecause,themostrecentcensusestimatescorrespondtotheyear1999andthesecensusvaluescanbeusedforcomparisonpurposes.TheSAIPEregressionmodelforestimatingthemedianhouseholdincomefor1999useascovariates,themedianadjustedgrossincome(AGI)derivedfromIRStaxreturnsandthemedianhouseholdincomeestimatefor1999obtainedfromthe2000Census.Theresponsevariableisthedirectestimateofmedianhouseholdincomefor1999obtainedfromthe 60

PAGE 61

Bell 1999 ).NoninformativepriordistributionsareplacedontheregressionparametercorrespondingtotheIRSmedianincomesinceitwasfoundtobestatisticallysignicanteveninthepresenceofcensusdata,bothinthe1989and1999models. Fay ( 1987 )inthisregard.EstimationwascarriedoutinanempiricalBayes(EB)frameworksuggestedby Fayetal. ( 1993 ).Later, Dattaetal. ( 1993 )extendedtheEBapproachof Fay ( 1987 )andalsoputforwardunivariateandmultivariatehierarchicalBayes(HB)models.TheestimatesfromtheirEBandHBproceduressignicantlyimprovedovertheCPSmedianincomeestimatesfor1979. Ghoshetal. ( 1996 )exploitedtherepetitivenatureofthestate-specicCPSmedianincomeestimatesandproposedaBayesiantimeseriesmodelingframeworktoestimatethestatewidemedianincomeoffour-personfamiliesfor1989.Indoingso,theyusedatimespecicrandomcomponentandmodeleditasarandomwalk.TheyconcludedthatthebivariatetimeseriesmodelutilizingthemedianincomesoffourandvepersonfamiliesperformsthebestandproducesestimateswhicharemuchsuperiortoboththeCPSandCensusBureauestimates.Ingeneral,thetimeseriesmodelalwaysperformedbetterthanitsnon-timeseriescounterpart. 61

PAGE 62

Opsomeretal. ( 2008 )inwhichtheycombinedsmallarearandomeffectswithasmooth,non-parametricallyspeciedtrendusingpenalizedsplines.Indoingso,theyexpressedthenon-parametricsmallareaestimationproblemasamixedeffectsregressionmodelandanalyzeditusingrestrictedmaximumlikelihood.Theoreticalresultswerepresentedonthepredictionmeansquarederrorandlikelihoodratiotestsforrandomeffects.Inferencewasbasedonasimplenon-parametricbootstrapapproach.Themethodologywasusedtoanalyzeanon-longitudinal,spatialdatasetconcerningtheestimationofmeanacidneutralizingcapacity(ANC)oflakesinthenortheasternstatesofU.S. Ghoshetal. ( 1996 ),wehaveviewedthestatespecicannualhouseholdmedianincomevaluesaslongitudinalprolesorincometrajectories.ThisgainedmoregroundbecauseweusedthestatewideCPSmedianhouseholdincomevaluesforonlyveyears(1995-1999)inourestimationprocedure.Figure 3-1 showssamplelongitudinalCPSmedianhouseholdincomeprolesforsixstatesspanning1995to2004whileFigure 3-2 showstheplotsoftheCPSmedianincomeagainsttheIRSmeanandmedianincomesforallthestatesfortheyears1995through1999.ItisapparentthatCPSmedianincomemayhaveanunderlyingnon-linearpatternwithrespecttoIRSmeanincome,speciallyforlargevaluesofthelatter.Theabovetwofeaturesmotivatedustouseasemiparametricregressionapproach.Indoingso,wehavemodeledtheincometrajectoryusingpenalizedspline(orP-spline)( EilersandMarx 1996 )whichisacommonlyusedbutpowerfulfunctionestimationtoolinnon-parametricinference.TheP-splineis 62

PAGE 63

LongitudinalCPSmedianincomeprolesfor6statesplottedagainstIRSmeanandmedianincomes.(1stcolumn:IRSMeanIncome;2ndcolumn:IRSMedianIncome). 63

PAGE 64

GelfandandGhosh 1998 )hasbeenusedtoobtaintheparameterestimates. Wehavecomparedthestate-specicestimatesofmedianhouseholdincomefor1999withthecorrespondingdecennialcensusvaluesinordertotestfortheiraccuracy.Indoingso,weobservedthatthesemiparametricmodelestimatesimproveuponboththeCPSandtheSAIPEestimates.Interestingly,thepositioningoftheknotshadsignicantinuenceontheresultsaswillbediscussedlateron.WewanttomentionherethattheSAIPEmodelhadaconsiderableadvantageoveroursinthattheyusedthecensusestimatesofthemedianincomefor1999asapredictor.Insmallareaestimationproblems,thecensusestimatesareregardedasthegoldstandardsincethesearethemostaccurateestimatesavailablewithvirtuallynegligiblestandarderrors.So,usingthoseasexplanatoryvariableswasanaddedadvantageoftheSAIPEstatelevelmodels.ThefactthatourestimatesstillimproveontheSAIPEmodelbasedestimatesisatestamenttotheexibilityandstrengthofthesemiparametricmethodologyspeciallywhenobservationsarecollectedovertime.ItalsoindicatesthatitmaybeworthwhiletotakeintoaccountthelongitudinalincomepatternsinestimatingthecurrentincomeconditionsofthedifferentstatesoftheU.S. Therestofthechapterisorganizedasfollows.InSection 3.2 weintroducethetwotypesofsemiparametricmodelswehaveused.Section 3.3 goesoverthehierarchicalBayesiananalysisweperformed.InSection 3.4 ,wedescribetheresultsofthedata 64

PAGE 65

BIRSmedianincomeplot PlotsofCPSmedianincomeagainstIRSmeanandmedianincomesforallthestatesoftheU.S.from1995to1999. analysiswithregardtothemedianhouseholdincomedataset.InSection 3.5 ,wediscusstheBayesianmodelassessmentprocedureweusedtotestthegoodness-of-tofourmodels.WeendwithadiscussioninSection 3.6 .Theappendixcontainstheproofsoftheposteriorproprietyandtheexpressionsofthefullconditionaldistributions. 3.2.1GeneralNotation 65

PAGE 66

wheref(xij)isanunspeciedfunctionofxijreectingtheunknownresponse-covariaterelationship. Weapproximatef(xij)usingaP-splineandrewrite( 3 )as whereij=X0ij+Z0ij+bi+uijisourtargetofinference. HereXij=(1,xij,...,xpij)0,Zij=f(xij1)p+,...,(xijK)p+g0,=(0,...,p)0isthevectorofregressioncoefcientswhile=(1,...,K)0isthevectorofsplinecoefcients.Theabovesplinemodelwithdegreepcanadequatelyapproximateanyunspeciedsmoothfunction.Typically,linear(p=1)orquadratic(p=2)splinesservesmostpracticalpurposessincetheyensureadequatesmoothnessinthettedcurve.mandtrespectivelydenotethenumberofsmallareasandthenumberoftimepointsatwhichtheresponseandcovariatesaremeasured.Thus,inourcase,m=51,forallthe50statesoftheU.S.andtheDistrictofColumbiaandt=5fortheyears1995-1999.biisastate-specicrandomeffectwhileuijrepresentsaninteractioneffectbetweentheithstateandthejthyear.Weassumebii.i.dN(0,2b)andN(0,2IK).2controlstheamountofsmoothingoftheunderlyingincometrajectory.Moreover,itisassumed 66

PAGE 67

3.1.1 .InthedatasetsprovidedbytheCensusBureau,theseestimatesaregivenforallthestatesateachofthetimepoints.Theknots(1,...,K)areusuallyplacedonagridofequallyspacedsamplequantilesofxij's. From( 3 )and( 3 ),wehave 3 )andmodeleditasarandomwalkasfollows whereij=X0ij+Z0ij+bi+vj+uij Beforeproceedingtothenextsection,wemaynotethatunlikethemodelsof Ghoshetal. ( 1996 ),themodelsgivenin( 3 )and( 3 )incorporatestatespecicrandomeffects(bi).Thisrectiesalimitationoftheformeraspointedoutin Rao ( 2003 ). 67

PAGE 68

3.3.1LikelihoodFunction Here,L(Uja,b)denotesanormaldensitywithmeanaandvariancebwhileL(bij2b)andL(j2)denotesanormaldistributionwithmean0andvariances2band2respectively. Fortherandomwalkmodel,theparameterspacefortheithstatewouldbei=(i,,,bi,v,2,2b,2,2v)wherev=(v1,...,vt)isthevectoroftimespecicrandomeffects.Thus,thelikelihoodfunctionfortheithstatewillhaveanextracomponentcorrespondingtovasfollows whereL(vjjvj1,2v)denotesanormaldistributionwithmeanvj1andvariance2vwherev0=0. 68

PAGE 69

Thus,wehavethefollowingpriors:uniform(Rp+1),(2j)1G(cj,dj)(j=1,...,t),(2b)1G(c,d),(2)1G(c,d)and(2v)1G(cv,dv).HereXG(a,b)denotesagammadistributionwithshapeparameteraandrateparameterbhavingtheexpressionf(x)/xa1exp(bx),x0.Sincewehavechosenimproperpriorsfor,posteriorproprietyofthefullposteriorhavebeenshown.Wehavethefollowingtheorem Fortherandomwalkmodel,therewillbeanadditionalterm(2v).Bytheconditionalindependenceproperties,wecanfactorizethefullposterioras[,,,b,2b,2,f21,...,2tgjY,X,Z]/[Yj][j,,b,f21,...,2tg,X,Z][bj2b][j2][][2][2b]tYj=1[2j] 69

PAGE 70

GelmanandRubin ( 1992 )andrunn(2)parallelchains.Foreachchain,werun2diterationswithstartingpointsdrawnfromanoverdisperseddistribution.Todiminishtheeffectsofthestartingdistributions,therstditerationsofeachchainarediscardedandposteriorsummariesarecalculatedbasedontherestofthediterates.Thefullconditionalsforboththemodelsaregivenintheappendix. 3.2.2 toanalyzethemedianhouseholdincomedatasetreferredtoinSection 3.1.3 .TheresponsevariableYijandthecovariatesXijrespectivelydenotetheCPSmedianhouseholdincomeestimateandthecorrespondingIRSmean(ormedian)incomeestimatefortheithstateatthejthyear(i=1,...,51;j=1,...,5).Thestate-specicmeanormedianincomeguresareobtainedfromIRStaxreturndata.TheCensusBureaugetslesofindividualtaxreturndatafromtheIRSforuseinspecicallyapprovedprojectssuchasSAIPE.Foreachstate,theIRSmean(median)incomeisthemean(median)adjustedgrossincome(AGI)acrossallthetaxreturnsinthatstate.LikeotherSAIPEmodelcovariatesobtainedfromadministrativerecordsdata,thesevariablesdonotexactlymeasurethemedianincomeacrossallhouseholdsinthestate.OneofthereasonsforthisisthattheAGIwouldnotnecessarilybethesameastheexactincomegureandthetaxreturnuniversedoesnotcovertheentirepopulationi.esomehouseholdsdonotneedtoletaxreturns,andthosethatdonotarelikelytodifferinregardtoincomethanthosethatdo.However,theuseofthemeanormedianAGIasacovariateonlyrequiresittobecorrelatedwithmedianhouseholdincome,notnecessarilybethesamething.Specicallyforthisstudy,wehaveusedIRSmeanincomeasourcovariate.Thisisbecause,itseemstopossess 70

PAGE 71

3-2A ),andsoitismoresuitedtoasemiparametricanalysis. Inordertochecktheperformanceofourestimates,weplantousefourcomparisonmeasures.ThesewereoriginallyrecommendedbythepanelonsmallareaestimatesofpopulationandincomesetupbytheCommitteeonNationalStatisticsinJuly1978andareavailableintheirJuly1980report(p.75).Theseare ThebasicstructureofourmodelswouldremainthesameasinSection 3.2.2 .WehaveusedtruncatedpolynomialbasisfortheP-splinecomponentinboththemodels.SinceFig2adoesnotindicateahighdegreeofnon-linearity,wehaverestricted 71

PAGE 72

Ruppert 2002 ).Generally,atmost35to40knotsarerecommendedforeffectivelyallsamplesizesandfornearlyallsmoothregressionfunctions.Followingthegeneralconvention,wehaveplacedtheknotsonagridofequallyspacedsamplequantilesoftheindependentvariable(IRSmeanincome). GelmanandRubin ( 1992 ).Weranthreeindependentchainseachwithasamplesizeof10,000andwithaburn-insampleofanother5,000.Weinitiallysampledtheij'sfromt-distributionswith2dfhavingthesamelocationandscaleparametersasthecorrespondingnormalconditionalsgivenintheAppendix.ThisisbasedontheGelman-Rubinideaofinitializingcertainsamplesofthechainfromoverdisperseddistributions.However,onceinitialized,thesuccessivesamplesofij'saregeneratedfromregularunivariatenormaldistributions.ConvergenceoftheGibbssamplerwasmonitoredbyvisuallycheckingthedynamictraceplots,acfplotsandbycomputingtheGelman-Rubindiagnostic.Thecomparisonmeasuresdeviatedslightlyfordifferentinitialvalues.Wechosetheleastofthoseasthenalmeasurespresentedinthetablesthatfollows. 72

PAGE 73

WettedModelI(SPM)withallpossibleknotchoicesfrom0to40butthebestresultswereachievedwith5knots.Theestimates(with5knots)improvedsignicantlyovertheCPSestimatesbasedonallthefourcomparisonmeasures.Additionofmoreknotsseemedtodegradethetofthemodel.Thismayhappenaspointedoutin Ruppert ( 2002 ).Ontheotherhand,theSAIPEmodelbasedestimateswereslightlysuperiortotheSPMestimates. Next,wettedthesemiparametricrandomwalkmodel(SPRWM)toourdata.Overall,therandomwalkstructureleadtosomeimprovementintheperformanceoftheestimates.However,forthemodelwith5knots,theperformanceoftheestimatesremainednearlythesame.Thismaybebecause5knotsissufcienttocapturetheunderlyingpatternintheincometrajectoryandtherandomwalkcomponentdoesnotleadtoanyfurtherimprovement.Lastbutnottheleast,therandomwalkmodelestimates,althoughgenerallybetterthanthoseofthebasicsemiparametricmodel,stillcannotclaimtobesuperiortotheSAIPEestimatesforallthecomparisonmeasures.Table 3-1 reportstheposteriormean,medianand95%CIfortheparametersoftheSPRWMwith5knots. Itisofinterestthatthe95%CIfor1,4and5doesnotcontain0indicatingthesignicanceoftherst,fourthandfthknots.ThisisindicativeoftherelevanceofknotsinthepenalizedsplinetontheCPSmedianincomeobservations.ThesameistrueforthecoefcientsofSPM. 73

PAGE 74

ParameterestimatesofSPRWMwith5knots ParameterMeanMedian95%CI 3.1.1 ,theSAIPEstatemodelsusethecensusestimatesofmedianincome(for1999)asoneofthepredictorwhichessentiallygivesthemabigedgeoverus.Thismaybeoneofthereasonswhytheestimatesobtainedfromthesemiparametricmodelsareatmostcomparable,butnotsuperiortotheSAIPEestimates.Butthatdoesn'truleoutthefactthatthesemiparametricmodelshaveroomforimprovement.Inthissection,wewilllookforanypossibledecienciesintheourmodelsandwilltrytocomeupwithsomeimprovements,ifthereisany. AsmentionedinSection 3.4.1 ,selectionandproperpositioningofknotsplaysapivotalroleincapturingthetrueunderlyingpatterninasetofobservations.Poorlyplacedknotsdoeslittleinthisregardandcanevenleadtoanerroneousorbiasedestimateoftheunderlyingtrajectory.Ideally,asufcientnumberofknotsshouldbeselectedandplaceduniformlythroughouttherangeoftheindependentvariabletoaccuratelycapturetheunderlyingobservationalpattern. Figures 3-3A and 3-3B showstheexactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Inboththecases,theknotsareplacedonagridofequallyspacedsamplequantilesofIRSmeanincome.Inboththegures,theknotslieontheleftofIRSmean=50000,theregionwherethedensityofobservationsishigh.Theknotstendtolieinthisregionbecausetheyareselectedbasedonquantileswhichisadensity-dependentmeasure.Thus,inboththegures,thecoverageareaofknots(i.ethepartoftheobservationalpatternwhichiscapturedbytheknots)isthe 74

PAGE 75

BPositioningof7Knots Exactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Theknotsaredepictedastheboldfacedtrianglesatthebottom. regiontotheleftofthedottedverticallines.Ontheotherhand,thenon-linearpatternistangibleonlyinthelowdensityareaoftheploti.etheregionlyingtotherightofIRSmean=50000.Evidently,noneoftheknotslieinthispartofthegraph.Thus,wecanpresumethatinboththecases(5and7knots),theunderlyingnon-linearobservationalpatternisnotbeingadequatelycaptured. Asanaturalsolutiontothisissue,wedecidedtoplacehalfoftheknotsinthelowdensityregionofthegraphwhiletheotherhalfinthehighdensityregion.Theexactboundarylinebetweenthehighdensityandlowdensityregionsishardtodetermine.WetesteddifferentalternativesandcameupwithIRSmean=47000asatentativeboundarybecauseitgavethebestresults.Inboththeregions,weplacedtheknotsatequallyspacedsamplequantilesoftheindependentvariable.Figure 3-4 showsthenewknotpositionsfor5knots. ItisclearfromFigure 3-4 thatthenewknotsaremoredispersedthroughouttherangeofIRSmeanthantheoldones.Theregionbetweentheboldanddashedverticallinesdenotestheadditionalcoveragethathasbeenachievedwiththeknot 75

PAGE 76

Positionsof5knotsafterrealignment.Theknotsaretheboldfacedtrianglesatthebottom.Theregionbetweenthedashedandboldlinesistheadditionalcoverageareagainedfromtherealignment. rearrangement.Basedonthenumberofdatapointsinsidethisregion,itisclearthatamuchlargerproportionofobservationshasbeencapturedwiththeknotrealignment.Noknotsareintheregionbeyondtheboldverticallines(i.ebeyondIRSmean56000)possiblyduetotheverylowdensityoftheobservationsinthatarea.Overall,itseemsthat,thenewknotscancapturesomeoftheunderlyingnon-linearpatterninthedatasetwhichtheoldknotsfailedtoachieve.Wealsoexperimentedbyplacingalltheknotsinthelowdensityregion(beyondIRSmean=47000)buttheresultswerenotsatisfactory.Thisindicatesthattheknotsshouldbeuniformlyplacedthroughouttherangeoftheindependentvariabletogetanoptimalt. Wehaveworkedwith5knotsbecauseitperformedconsistentlywellforboththeSPMandSPRWmodels.Onttingthesemiparametricmodelswiththenewknotalignment,wedidachievesomeimprovementintheresults.Table 3-2 reports 76

PAGE 77

3-3 depictsthepercentageimprovementofthesemiparametricestimatesovertheCPSandSAIPEestimates.Here,SPM(5)andSPRWM(5)respectivelydenotethesemiparametricmodelswiththerealigned5knots. Table3-2. ComparisonmeasuresforSPM(5)andSPRWM(5)estimateswithknotrealignment EstimateARBASRBAABASD CPS0.04150.00271,753.335,300,023SAIPE0.03260.00151,423.753,134,906SPM(5)0.0280.00121173.712,334,379SPRWM(5)0.02950.00131256.082,747,010 Table3-3. PercentageimprovementsofSPM(5)andSPRWM(5)estimatesoverSAIPEandCPSestimates EstimateModelARBASRBAABASD SPM(5)14.11%20.00%17.56%25.54%SAIPESPRWM(5)9.51%13.33%11.78%12.37%SPM(5)32.53%55.55%33.06%55.96%CPSSPRWM(5)28.92%51.85%28.36%48.17% Itisclearthat,withtheknotrealignment,thecomparisonmeasurescorrespondingtothesemiparametricestimateshavedecreasedsubstantially,speciallysofortheSPM.ThenewcomparisonmeasuresforthesemiparametricmodelsarequitelowerthanthosecorrespondingtotheSAIPEestimates.Thus,wemaysaythatthesemiparametricmodelestimatesperformsbetterthantheSAIPEestimateswiththerealignedknots.Thisimprovementisapparentlyduetotheadditionalcoverageoftheobservationalpatternthatisbeingachievedwiththerelocationoftheknots.Asaresultofthisincreasedcoverage,alargerproportionoftheunderlyingnonlinearpatternintheobservationsinbeingcapturedbythenewknots.Althoughwehavedonethisexercisewithonly5knots,itwouldbeinterestingtoexperimentwithothertypesofknotalignment 77

PAGE 78

3-4 andTable 3-5 reporttheposteriormean,medianand95%CIfortheparametersinSPM(5)andSPRWM(5)respectively. Table3-4. ParameterestimatesofSPM(5) Table3-5. ParameterestimatesofSPRWM(5) Itisofinteresttonotethat,withtheknotrealignment,alltheknotcoefcients(i.ethe's)aresignicantforbothSPMandSPRWM.Fortheoldconguration,someoftheknotcoefcientswerenotsignicantforthemodels.Thiscorroboratesthefactthat,withtheknotrealignment,alltheveknotsaresignicantlycontributingtothecurvettingprocessintermsofcapturingthetrueunderlyingnon-linearpatternintheobservations. Ghoshetal. ( 1996 ),henceforthreferredtoastheGNKmodel.Theirunivariatemodelisasfollows where(bjjbj1)N(0,2b),uijN(0,2j)andeijN(0,2ij). 78

PAGE 79

wherebii.i.dN(0,2b)whileuijandeijhavethesamedistributionasabove.Clearly,theonlydifferencebetween( 3 )and( 3 )isthattheformercontainsatimespecicrandomcomponentwhilethelattercontainsaareaspecicrandomcomponent. Ghoshetal. ( 1996 )showedthattheestimatesfromthebivariateversionoftheGNKmodel( 3 )performsmuchbetterthanthecensusbureauestimatesinestimatingthemedianhouseholdincomeof4-personfamiliesintheUnitedStates.Table 3-6 depictsthecomparisonmeasurescorrespondingtotheabovemodels. Table3-6. Comparisonmeasuresfortimeseriesandothermodelestimates EstimateARBASRBAABASD CPS0.04150.00271,753.335,300,023SAIPE0.03260.00151,423.753,134,906GNK0.03970.00251709.585,229,869SPM(0)0.03370.00171408.73,137,978SPM(5)0.0280.00121173.712,334,379SPRWM(5)0.02950.00131256.082,747,010 Itisclearthat,althoughtheestimatesfromtheGNKmodelperformslightlybetterthantheCPS,thosearequiteinferiortothesemiparametricandSAIPEestimates.Thismaybebecausethestatespecicrandomeffectsinthesemiparametricmodelscanaccountforthewithin-statecorrelationsintheincomevalues,somethingwhichtheGNKmodelfailstodo.SincethecomparisonmeasuresforSPM(0)aremuchlowerthanthosefortheGNKmodel,wecanalsoconcludethattheareaspecicrandomeffectismuchmorecriticalthanatimespecicrandomcomponentinthissituation. 79

PAGE 80

Johnson ( 2004 ).ThisisessentiallyanextensionoftheclassicalChi-squaregoodness-of-ttestwherethestatisticiscalculatedateveryiterationoftheGibbssamplerasafunctionoftheparametervaluesdrawnfromtherespectiveposteriordistribution.Thus,aposteriordistributionofthestatisticisobtainedwhichcanbeusedforconstructingglobalgoodness-of-tdiagnostics. Toconstructthisstatistic,weform10equallyspacedbins((k1)=10,k=10),k=1,...,10,withxedbinprobabilities,pk=1=10.Themainideaistoconsiderthebincountsmk(~)toberandomwhere~denotesaposteriorsampleoftheparameters.AteachiterationoftheGibbssampler,binallocationismadebasedontheconditionaldistributionofeachobservationgiventhegeneratedparametervaluesi.eYijwouldbeallocatedtothekthbinifF(Yijj~)2((k1)=10,k=10),k=1,...,10.TheBayesianchi-squarestatisticisthencalculatedasRB(~)=10Xk=1"mk(~)npk Theonlyassumptionsforthisstatistictoworkarethattheobservationsshouldbeconditionallyindependentandtheparametervectorshouldbenitedimensional.The 80

PAGE 81

BSemiparametricRWModel Quantile-quantileplotofRBvaluesfor10000drawsfromtheposteriordistributionofthebasicsemiparametricandsemiparametricrandomwalkmodels.TheX-axisdepictstheexpectedorderstatisticsfroma2distributionwith9degreesoffreedom. secondassumptionnaturallyholdsinourcase.Regardingtherstone,sincewehavemultipleobservationsovertimeforeverystate,theremaybewithin-statedependencebetweenthose.Thus,insteadoftakingalltheobservations(i.etheCPSmedianincomevalues),wedecidedtousethelastobservationforeachstate.Forthebasicsemiparametricmodel(SPM),theabovesummarymeasureswererespectively0.049and0.5whilefortherandomwalkmodel(SPRWM),thesewere0.047and0.51.ThesemeasuressuggestthatbothSPMandSPRWMtsthedataquitewell.Figure 3-5A and 3-5B showsthequantile-quantileplotsofRBvaluesobtainedfrom10000samplesofSPMandSPRWMwith5knots.BoththeplotsdemonstrateexcellentagreementbetweenthedistributionofRBandthatofa2(9)randomvariable. JohnsonpointsoutthattheBayesianchi-squareteststatisticisalsoanusefultoolforcodeverication.IftheposteriordistributionofRBdeviatessignicantlyfromitsnulldistribution,itmayimplythatthemodelisincorrectlyspeciedortherearecodingerrors.Sincethesummarymeasuresarequiteclosetothecorrespondingnullvalues, 81

PAGE 82

FayandHerriot 1979 ).Inthisstudy,wehaveproposedasemiparametricclassofmodelswhichexploitthelongitudinaltrendinthestate-specicincomeobservations.Indoingso,wehavemodeledtheCPSmedianincomeobservationsasanincometrajectoryusingpenalizedsplines( EilersandMarx 1996 ).Wehavealsoextendedthebasicsemiparametricmodelbyaddingatimeseriesrandomwalkcomponentwhichcanexplainanyspecictrendintheincomelevelsovertime.Wehaveusedasourcovariate,themeanadjustedgrossincome(AGI)obtainedfromIRStaxreturnsforallthestates.AnalysishasbeencarriedoutinahierarchicalBayesianframework.OurtargetofinferencehasbeenthemedianhouseholdincomesforallthestatesoftheU.S.andtheDistrictofColumbiafortheyear1999.Wehaveevaluatedourestimatesbycomparingthosewiththecorrespondingcensusestimatesof1999usingsomecommonlyusedcomparisonmeasures. Ouranalysishasshownthatinformationofpastmedianincomelevelsofdifferentstatesdoprovidestrengthtowardstheestimationofstatespecicmedianincomesforthecurrentperiod.Infact,ifthereisanunderlyingnon-linearpatterninthemedianincomelevels,itmaybeworthwhiletocapturethatpatternasaccuratelyaspossibleandusethatintheinferentialprocedure.Intermsofmodelingtheunderlyingobservationalpattern,thepositioningofknotsprovedtobebothimportantandinteresting.The 82

PAGE 83

Theabovemodelscanbeextendedinvariouswaysbasedonthenatureoftheobservationalpatternandthequality(orrichness)ofthedataset.Someobviousextensionsaregivenasfollows:(1)Inthemodelsconsideredabove,thesplinestructuref(xij)representsthepopulationmeanincometrajectoryforallthestatescombined.Thedeviationoftheithstatefromthemeanismodeledthroughtherandominterceptbi.Thisimpliesthatthestate-specictrajectoriesareparallel.Amoreexible 83

PAGE 84

Heregi(x)isanunspeciednonparametricfunctionrepresentingthedeviationoftheithstate-specictrajectoryfromthepopulationmeantrajectoryf(x).gi(x)isalsomodeledusingP-splinewithalinearpart,bi1+bi2xandanon-linearone,PKk=1wik(xk)+thusallowingformoreexibility.Boththesecomponentsarerandomwith(bi1,bi2)0N(0,)(beingunstructuredordiagonal)andwikN(0,2w).Thisextensionisparticularlyrelevantinsituationswherethestate-specicincometrajectoriesarequitedistinctfromthepopulationmeancurveandthusneedtobemodeledexplicitly.Weplantopursuethisextensionifwecanprocurearicherdatasetwithlongerstatespecicincometrajectories.(2)Sometimesthefunctiontobeestimated(herethemedianincomepattern)mayhavevaryingdegreesofsmoothnessindifferentregions.Inthatcase,asinglesmoothingparametermaynotbeproperandaspatiallyadaptivesmoothingprocedurecanbeused( RuppertandCarroll 2000 ).(3)WeusedthetruncatedpolynomialbasisfunctiontomodeltheincometrajectorybutothertypesofbaseslikeB-splines,radialbasisfunctionsetccanalsobeused.(4)Althoughweusedaparametricnormaldistributionalassumptionfortherandomstateandtimespeciceffects,abroaderclassofdistributionslikethemixturesofDirichletprocesses( MacEachernandMuller 1998 )orPolyatrees( HansonandJohnson 2000 )maybetested. Lastbutnottheleast,wethinkthatsemiparametricmodelingapproachholdsalotofpromiseforsmalldomainproblemsspeciallywhenobservationsforeachdomainarecollectedovertime.TheassociatedclassofsemiparametricmodelscanwellbeanattractivealternativetothemodelsgenerallyemployedbytheU.S.CensusBureau. 84

PAGE 85

TheU.S..CensusBureauhasalwaysbeenconcernedwiththeestimationofincomeandpovertycharacteristicsofsmallareasacrosstheUnitedStates.Theseestimatesplayavitalroletowardstheadministrationoffederalprogramsandtheallocationoffederalfundstolocaljurisdictions.Forexample,statelevelestimatesofmedianincomeforfour-personfamiliesareneededbytheU.S.DepartmentofHealthandHumanServices(HHS)inordertoformulateitsenergyassistanceprogramtolowincomefamilies.Sinceincomecharacteristicsforsmallareasaregenerallycollectedovertime,theremaywellbeatimevaryingpatterninthoseobservations.Neglectingthosepatternsmayleadtobiasedestimateswhichdoesnotreectthetruepicture.Inthisstudy,weputforwardamultivariateBayesiansemiparametricprocedurefortheestimationofmedianincomeoffour-personfamiliesforthedifferentstatesoftheU.S.whileexplicitlyaccommodatingforthetimevaryingpatternintheobservations. 85

PAGE 86

Inestimatingthemedianincomeoffour-personfamilies,theU.S.CensusBureaureliedondatafromthreesources.ThebasicsourcewastheannualdemographicsupplementtotheMarchsampleoftheCurrentPopulationSurvey(CPS)whichusedtoprovidethestatespecicmedianincomeestimatesfordifferentfamilysizes.Thesecondsourcewasthedecennialcensusestimatesfortheyearpreceedingthecensusyeari.e1969,1979,1989andsoon.Lastly,theCensusBureaualsousedtheannualestimatesofpercapitaincome(PCI)providedbytheBureauofEconomicAnalysis(BEA)oftheU.S.DepartmentofCommerce.Eachoftheabovedatasources(andtheresultingestimates)havesomedisadvantageswhichneccesiatedanestimationprocedurethatusedacombinationofallthreetoproducethenalmedianincomeestimates.TheCPSestimateswerebasedonsmallsampleswhichresultedinsubstantialvariability.Ontheotherhand,decennialcensusestimates,althoughhavingnegligiblestandarderrors,wereonlyavailableevery10years.Duetothislaginthereleaseofsuccessivecensusestimates,therewasasignicantlossofinformationconcerninguctuationsintheeconomicsituationofthecountryingeneralandsmallareasinparticular.Lastly,thepercapitaincomeestimatesdidnothaveassociatedsamplingerrorssincetheywerenotobtainedusingtheusualsamplingtechniques.Thedetailsoftheestimationprocedureappearsin Fayetal. ( 1993 ). TheCensusBureaubasedtheirestimationprocedureonabivariateregressionmodelsuggestedby Fay ( 1987 ).Indoingso,theyusedmedianincomeobservationsforthreeandvepersonfamiliesinadditiontothoseoffourpersonfamilies.ThebasicdatasetforeachstatewasabivariaterandomvectorwithonecomponenttheCPSmedianincomeestimatesoffourpersonfamiliesandtheothercomponentbeingtheweightedaverageofCPSmedianincomesofthreeandvepersonfamilies,withweights0.75and0.25respectively.Boththeregressionequationsusedthebaseyear 86

PAGE 87

Adjustedcensusmedian(c)=PCI(c) PCI(b)censusmedian(b) HerePCI(c)andPCI(b)denotesthepercapitaincomeestimatesproducedbytheBEAforthecurrentandbaseyearsrespectively.Thus,intheaboveexpression,thecurrentyearadjustedcensusmedianestimateisobtainedbyadjustingthebaseyearcensusmedianbytheproportionalgrowthinthePCIbetweenthebaseyearandthecurrentyear.Intheregressionequation,thebaseyearcensusmedianadjustsforanypossibleoverstatementoftheeffectofchangeinthePCIinestimatingthecurrentmedianincomes.Finally,theCensusBureauusedanempiricalBayesian(EB)technique( Fay ( 1987 ); Fayetal. ( 1993 ))tocalculatetheweightedaverageofthecurrentCPSmedianincomeestimateandtheestimatesobtainedfromtheregressionequation. Dattaetal. ( 1993 )extendedandrenedtheideasof Fay ( 1987 )andproposedamoreappealingempiricalBayesianprocedure.TheyalsoperformedanunivariateandmultivariatehierarchicalBayesiananalysisofthesameproblemandshowedthatboththeEBandHBproceduresresultedinsignicantimprovementovertheCPSmedianincomeestimatesfortheunivariateandmultivariatemodels.However,themultivariatemodelresultedinconsiderablylowerstandarderrorandcoefcientofvariationthantheunivariatemodelalthoughthepointestimatesweresimilar.Later, Ghoshetal. ( 1996 )(henceforthreferredtoasGNK)presentedaBayesiantimeseriesanalysisofthesameproblembyexploitingtheinherentrepetitivenatureoftheCPSmedianincomeestimates.Indoingso,theyestimatedthestatewidemedianincome 87

PAGE 88

Semiparametricregressionmethodshavenotbeenusedinsmallareaestimationcontextsuntilrecently.Thiswasmainlyduetomethodologicaldifcultiesincombiningthedifferentsmoothingtechniqueswiththeestimationtoolsgenerallyusedinsmallareaestimation.Thepioneeringcontributioninthisregardistheworkby Opsomeretal. ( 2008 )inwhichtheycombinedsmallarearandomeffectswithasmooth,non-parametricallyspeciedtrendusingpenalizedsplines( EilersandMarx 1996 ).Indoingso,theyexpressedthenon-parametricsmallareaestimationproblemasamixedeffectsregressionmodelandanalyzeditusingrestrictedmaximumlikelihood.Theyalsopresentedtheoreticalresultsonthepredictionmeansquarederrorandlikelihoodratiotestsforrandomeffects.Inferencewasbasedonasimplenon-parametricbootstrapapproach.Theyappliedtheirmodeltoanon-longitudinal,spatialdatasetconcerningtheestimationofmeanacidneutralizingcapacity(ANC)oflakesinthenortheasternstatesofU.S. 88

PAGE 89

Ghoshetal. ( 1996 ),wehavetreatedthestatespecicmedianincomeobservationsaslongitudinalprolesorincometrajectories.Aswithanylongitudinallyvaryingobservations,theincomeproles(bothstate-specicandoverall)mayhaveanon-linearpatternovertime.Moreover,thesuccessiveincomeobservationsmaybeunbalancedinnature.Thesefeaturesmotivatedustouseasemiparametricregressionapproachinourmodelingframework.Indoingso,wehavemodeledtheincometrajectoryusingpenalizedspline(orP-spline)whichisacommonlyusedbutpowerfulfunctionestimationtoolinnon-parametricinference.TheP-splineisexpressedusingtruncatedpolynomialbasisfunctionswithvaryingdegreesandnumberofknotsalthoughothertypesofbasisfunctionslikeB-splinesorthinplatesplinescanalsobeused.Ascovariates,wehaveusedtheadjustedcensusmedianincomessinceitwasfoundtobethemosteffectivecovariateby Ghoshetal. ( 1996 ).Wetestedfourdifferentregressionmodelsviz(1)AunivariatemodelwithonlytheCPSmedianincomeoffour-personfamilyastheresponsevariable;(2)AbivariatemodelwiththeCPSmedianincomesofthreeandfourpersonfamiliesastheresponsevariables;(3)AbivariatemodelwiththeCPSmedianincomesoffourandvepersonfamiliesastheresponsevariables;andlastly(4)AbivariatemodelwiththeCPSmedianincomesoffourpersonfamilyandweightedaverageoftheCPSmedianincomesofthreeandvepersonfamilies(withweights0.75and0.25)astheresponsevariables.Inallthecases,ourprimaryobjectivehasbeentheestimationofmedianincomesoffour-personfamiliesofallthe50U.S.statesandtheDistrictofColumbiafor1989.Foreachofthesemodels,analysishasbeencarriedoutusingahierarchicalBayesianapproach.Sincewechosenon-informativeimproperpriorsfortheregressionparameters,proprietyoftheposteriorhasbeenrigorouslyprovedbeforeproceedingwiththecomputations(seeTheorem3in 89

PAGE 90

GelfandandSmith 1990 )hasbeenusedtoobtaintheparameterestimates. Wehavecomparedthestate-specicestimatesofmedianhouseholdincomefor1989withthecorrespondingdecennialcensusvaluesinordertotestfortheiraccuracy.Indoingso,weobservedthatthesemiparametricmodelestimatesimproveuponboththeCPSandtheCensusBureauestimates.Interestingly,foralltheabovemodels,thesemiparametricestimatesaregenerallysuperiororatleastcomparabletothecorrespondingestimatesfromthetimeseriesmodelsof Ghoshetal. ( 1996 ).Thisisatestamenttotheexibilityandstrengthofthesemiparametricmethodologyspeciallywhenobservationsarecollectedovertime.ItalsoindicatesthatitmaybeworthwhiletotakeintoaccountthelongitudinalincomepatternsinestimatingthecurrentincomeconditionsoftheU.S.states.Lastly,thesemiparametricmodelingframeworkisverygeneralandcanbeappliedtoanysituationwherevariouscharacteristicsofsmallareasarecollectedovertime. Therestofthechapterisorganizedasfollows.InSection 4.2 weintroducethebivariatesemiparametricmodelingframework.Section 4.3 goesoverthehierarchicalBayesiananalysisweperformed.InSection 4.4 ,wedescribetheresultsofthedataanalysiswithregardtothemedianhouseholdincomedataset.Finally,weendwithadiscussionandsomereferencestowardsfutureworkinSection 4.5 .Theappendixcontainstheproofsoftheposteriorproprietyandtheexpressionsofthefullconditionaldistributionsforourmodels. 4.2.1Notation 90

PAGE 91

3 .Here,wewillexplainthebivariateframeworkwhichisoftwotypesvizasimplebivariatemodelandabivariaterandomwalkmodel.ThesecanalsobeseenasextensionsoftheunivariatemodelsexplainedinSection 3.2.2 Thisisthemostgeneralstructuresincethedegreesofthesplineaswellasthenumberandpositionoftheknotsaredifferentforthetwomodels.Iffori=1,2,...,m;j=1,2,...,t,fYij1,Xij1gandfYij2,Xij2ghavesimilarrelationship,wecanassumep=qandk1=k2,k=1,2,...,K1(=K2). Equation( 4 )canberewrittenas 91

PAGE 92

4 )asfollows whereij=U0ij+Z0ij+bi+vj+uij. AsinSection 3.2.2.2 ,weassumethat(vjjvj1,v)N(vj1,v)withv0=0.Alternatively,wemaywritevj=vj1+wjwherewji.i.dN(0,v). 92

PAGE 93

3 Here,L(Xj,)denotesamultivariatenormaldensitywithmeanvectorandvariancecovariancematrix. Forthebivariaterandomwalkmodel,theparameterspacefortheithstatewouldbei=(i,,,bi,v,f1,...,tg,0,,v)wherev=(v01,...,v0t)0isthevectoroftimespecicrandomeffects.ThehierarchicalBayesianframeworkisgivenby 1.

PAGE 94

4 )willhaveanextracomponentcorrespondingtovgivenbyL(vjjvj1,v)whichhasanormaldistributionwithmeanvj1andcovariancematrixv. Thus,wehavethefollowingpriors:uniform(Rp+q+2),jIW(Sj,dj)(j=1,...,t),IW(S,d),0IW(S0,d0)andvIW(Sv,dv)HereXIW(A,b)denotesainverseWishartdistributionwithscalematrixAanddegreesoffreedombhavingtheexpressionf(X)/jXj(b+p+1)=2exp(tr(AX1)=2),pbeingtheorderofA. Fortherandomwalkmodeltherewillbeanadditionalterm(v).Byconditionalindependenceproperties,wecanfactorizethefullposterioras[,,,b,0,,f1,...,tgjY,U,Z]/[Yj][j,,b,f1,...,tg,X,Z][bj0][j][][][0]tYj=1[j] 94

PAGE 95

GelmanandRubin ( 1992 )andrunn(2)parallelchains.Foreachchain,werun2diterationswithstartingpointsdrawnfromanoverdisperseddistribution.Todiminishtheeffectsofthestartingdistributions,therstditerationsofeachchainarediscardedandposteriorsummariesarecalculatedbasedontherestofthediterates.Thefullconditionalsforboththemodelsaregivenintheappendix. Onceposteriorsamplesaregeneratedfromthefullconditionalsoftheparameters,Rao-Blackwellizationyieldsthefollowingposteriormeansandvariancesofij and 4.2.2 .toanalyzethemedianincomedatasetreferredtoinSection 4.1.3 .Thebasicdatasetforourproblemisthetriplet(Yij1,Yij2,Yij3)andtheassociatedvariancecovariancematrixij(i=1,...,51;j=1,...,11).HereYij1,Yij2andYij3respectivelydenotetheCPSmedianincomesof 95

PAGE 96

Fortheunivariatesetup,theresponseandcovariatesarerespectivelyYij1andXij1.Forthebivariatesetup,thebasicdatavectorisadupletwithrstcomponentYij1andsecondcomponentiseitherYij2,Yij3or0.75Yij2+0.25Yij3.Theadjustedcensusmediansarechosenanalogously.Asmentionedbefore,ourtargetofinferencearethestatespecicmedianincomesoffourpersonfamiliesfor1989. Inordertochecktheperformanceofourestimates,weplantousefourcomparisonmeasures.ThesewereoriginallyrecommendedbythepanelonsmallareaestimatesofpopulationandincomesetupbytheCommitteeonNationalStatisticsinJuly1978andisavailableintheirJuly1980report(p.75).Theseare

PAGE 97

ThebasicstructureofourmodelswouldremainthesameasinSection 4.2.2 .WehaveusedlineartruncatedpolynomialbasisfunctionsfortheP-splinecomponentinourmodelssincethemedianincomeprolesdidnotexhibitahighdegreeofnon-linearity.Forhighlynon-linearprolesaquadraticorcubicpolynomialbasisfunctionrepresentationcanbeused.Innon-parametricregressionproblems,theproperselectionofknotsplaysacriticalrole.Ideally,asufcientnumberofknotsshouldbeselectedandplaceduniformlythroughouttherangeoftheindependentvariablesothattheunderlyingobservationalpatternisproperlycaptured.Toofewortoomanyknotsgenerallydegradesthequalityofthet.Thisisbecause,iftoofewknotsareused,thecompleteunderlyingpatternmaynotbecapturedproperly,thusresultinginabiasedt.Ontheotherhand,oncethereareenoughknotstotimportantfeaturesofthedata,furtherincreaseintheknotshavelittleeffectonthetandmayleadtooverparametrization( Ruppert 2002 ).Generally,atmost35to40knotsarerecommendedforeffectivelyallsamplesizesandfornearlyallsmoothregressionfunctions.Followingthegeneralconvention,wehaveplacedtheknotsonagridofequallyspacedsamplequantilesoftheindependentvariable(adjustedcensusmedianincome). GelmanandRubin ( 1992 ).Weranthreeparallelchains,withvaryinglengthsandburn-ins.Weinitiallysampledtheij'sfrommultivariatet-distributionswith2dfhavingthesamelocationandscalematricesasthecorrespondingmultivariatenormalconditionalsgivenintheAppendix.ThisisbasedontheGelman-Rubinideaofinitializingthechainatoverdisperseddistributions.However,onceinitialized,the 97

PAGE 98

Wettedboththeunivariateandbivariatemodelstothemedianincomedataset.Indoingso,weworkedwithallpossibleknotchoicesfrom0to40.Here,wewouldonlyshowtheresultscorrespondingtothebestperformingmodeli.ethemodelwiththelowestvaluesofthecomparisonmeasures. Intheunivariateframework,themodelwith3knotsintheincometrajectoryperformedthebest.Table 4-1 reportsthecomparisonmeasuresforthismodel(denotedasUSPM(3))alongwiththoseoftheCPSestimates(CPS),CensusBureauestimates(Bureau),andtheunivariateGNKtimeseries(GNK.TS)andnon-timeseries(GNK.NTS)estimates.Table 4-2 reportsthepercentageimprovementofthetimeseries,non-timeseriesandthesemiparametricestimatesoverthecensusbureauestimates. FromTable 4-1 ,itisclearthatthesemiparametricestimatessignicantlyimproveupontheCPS,timeseriesandnon-timeseriesestimateswithrespecttoallthecomparisonmeasures.Infact,thesemiparametricestimatesperformslightlybetterthanthebivariateCensusBureauestimatestoowithrespecttoARBandAAB.This 98

PAGE 99

Comparisonmeasuresforunivariateestimates EstimateARBASRBAABASD CPS0.07350.00842,928.8213,811,122.39Bureau0.02960.00131,183.902,151,350.18GNK.TS0.03380.00181,351.673,095,736.14GNK.NTS0.03630.00211,457.473,468,496.61USPM(3)0.02890.00141169.742,549,698.26 Table4-2. PercentageimprovementsofunivariateestimatesoverCensusBureauestimates EstimateARBASRBAABASD GNK.TS-14.19%-38.46%-14.17%-43.90%GNK.NTS-22.64%-61.54%-23.11%-61.22%USPM(3)2.37%-7.69%1.2%-18.52% isalsoreectedinTable 4-2 wherethesemiparametricestimatesmarginallyimproveupontheBureauestimatesfortheabovetwocomparisonmeasures.Overall,thedegreeofdominanceoftheBureauestimatesonthetimeseriesandnontimeseriesestimatesismuchlargercomparedtothatonthesemiparametricestimates.Theseresultsindicatethat,intheunivariateframework,thesemiparametricmodelwith3knotsperformsignicantlybetterthanthetimeseriesandnon-timeseriesmodelsof Ghoshetal. ( 1996 ). Now,wemoveontothebivariatenon-randomwalksetup.First,weconsiderthemodelwithresponsevectortheCPSmedianincomeof4and3personfamiliesi.e(Yij1andYij2).Thecovariatesarethecorrespondingadjustedcensusmedians.SinceweassumedinverseWishartpriorsforthevariancecovariancematrices,thevaluesofthecomparisonmeasuresweredependentonthedegreesoffreedomoftheWishartdistributionandthenumberofknotsintheincometrajectory.Weworkedwithdifferentcombinationsofthetwointtingthesemodels.Thebestresults(lowestcomparisonmeasures)wereobtainedfortwomodels,bothwith6knotsbutwithdegreesoffreedoms7and9respectively.ThesemodelsaredenotedbyBSPM(1)(4,3)andBSPM(2)(4,3)respectively.Whenweconsiderthemedianincomesof4and5person 99

PAGE 100

Comparisonmeasuresforbivariatenon-randomwalkestimates EstimateARBASRBAABASD CPS0.07350.00842,928.8213,811,122.39Bureau0.02960.00131,183.902,151,350.18GNK.TS(4,3)0.02950.00131,171.712,194,553.67GNK.NTS(4,3)0.03230.00161,287.782,610,249.94BSPM(1)(4,3)0.02740.00131079.632,182,669.56BSPM(2)(4,3)0.02860.00111131.611,880,089.29GNK.TS(4,5)0.02300.0009932.511,618,025.33GNK.NTS(4,5)0.02950.00131,179.942,216,738.06BSPM(4,5)0.02550.00101033.121,859,373.98GNK.TS(4,3+5)0.02870.00131,150.242,116,692.71GNK.NTS(4,3+5)0.03240.00151,297.122,530,938.06BSPM(1)(4,3+5)0.02710.00121078.52,128,679.65BSPM(2)(4,3+5)0.02890.00121132.101,838,598.30 families,thelowestcomparisonmeasureswereobtainedforthemodelwith4knotsintheincometrajectoryand7degreesoffreedom.WedenotethismodelbyBSPM(4,5). Lastly,forthemodelwithmedianincomesof4personfamiliesandtheweightedaverageincomesof3and5personfamilies(withweights0.75and0.25)asresponsevectors,thebestresultswereobtainedfortwomodels,bothwith6knotsandwithdegreesoffreedoms7and9respectively.WedenotethesemodelsasBSPM(1)(4,3+5)andBSPM(2)(4,3+5)respectively.Table 4-3 reportsthecomparisonmeasuresforthesemodelsalongwiththoseofCPS,Bureau,andthecorrespondingbivariateGNKtimeseriesandnon-timeseriesestimates.Table 4-4 reportsthepercentageimprovementoftheaboveestimatesoverthecensusbureauestimates. FromTable 4-3 andTable 4-4 ,itisclearthatbothBSPM(4,3)andBSPM(4,3+5)estimatesimproveuponthebivariatetimeseriesandnontimeseriesestimateswithrespecttonearlyallthefourcomparisonmeasures.ThesemiparametricestimatesalsoimprovesupontheCensusBureauestimatesandtherawCPSestimates.Forthemodelwithmedianincomeoffourandvepersonfamiliesasresponse,thesemiparametricestimatesfallswellbehindthebivariatetimeseriesestimatesof Ghoshetal. ( 1996 )butsignicantlyimprovesupontheCPSandCensusBureauestimates. 100

PAGE 101

Percentageimprovementsofbivariatenon-randomwalkestimatesoverCensusBureauestimates EstimateARBASRBAABASD GNK.TS(4,3)-0.48%-2.52%1.03%-2.01%GNK.NTS(4,3)-8.99%-22.45%-8.77%-21.33%BSPM(1)(4,3)7.43%0.00%8.81%-1.46%BSPM(2)(4,3)3.38%15.38%4.42%12.61%GNK.TS(4,5)22.19%30.52%21.23%24.79%GNK.NTS(4,5)0.31%-0.18%0.33%-3.04%BSPM(4,5)13.85%23.08%12.74%13.57%GNK.TS(4,3+5)2.94%3.56%2.84%1.61%GNK.NTS(4,3+5)-9.36%-17.18%-9.56%-17.64%BSPM(1)(4,3+5)8.45%7.69%8.90%1.05%BSPM(2)(4,3+5)2.37%7.69%4.37%14.54% Nowletusconsiderthebivariaterandomwalkmodel.Forthecasewith4and3personfamilies,thelowestcomparisonmeasureswereobtainedforthreemodelswithdegreesoffreedomsandnumberofknots(3,6),(5,6)and(9,1)respectively.WedenotethesemodelsasBRWM(1)(4,3),BRWM(2)(4,3)andBRWM(3)(4,3)respectively.EachofthesemodelssignicantlyimprovesupontheCPSandCensusBureauestimatesandarealsosuperiortothebivariatetimeseriesandnon-timeseriesmodelsproposedby Ghoshetal. ( 1996 )(GNK).Therandomwalkestimatesalsoseemtoimprovemarginallyoverthosecorrespondingtothenon-randomwalksemiparametricmodel.Whenweconsiderthemedianincomeestimatesof4and5personfamilies,therandomwalkmodelwithdegreesoffreedom5and1knotinthetrajectoryseemstoperformthebest.ThecomparisonmeasuresaresignicantlybetterthantheCPS,Bureauandthenon-timeseriesmodelofGNK.However,theyfallmarginallyshortofthetimeseriesestimatesbutfarebetterthanthecorrespondingestimatesobtainedfromthenon-randomwalkmodel(BSPM(4,5)).WedenotethismodelasBRWM(4,5).Lastly,forthemodelwithmedianincomesof4personfamiliesandtheweightedaverageincomesof3and5personfamilies(withweights0.75and0.25)asresponsevectors,thebestresultswereobtainedforthemodelwith5degreesoffreedomand1knotinthetrajectory.ThecomparisonmeasuresweresignicantlybetterthantheCPS, 101

PAGE 102

Comparisonmeasuresforbivariaterandomwalkmodel EstimateARBASRBAABASD BRWM(1)(4,3)0.02610.00111043.331,902,416.1BRWM(2)(4,3)0.02740.00101094.251,804,969.06BRWM(3)(4,3)0.02580.00121037.032,114,599.65BRWM(4,5)0.02450.0010978.121,672,183.6BRWM(4,3+5)0.02440.0011990.501,941,833.29 BureauandGNK(bothtimeseriesandnon-timeseries)whileitalsoimproveduponthenon-randomwalksemiparametricmodel.WedenotethismodelasBRWM(4,3+5).Table 4-5 reportsthecomparisonmeasuresfortherandomwalkmodels. EstimationofmedianincomesoffourpersonfamiliesfordifferentstatesofU.S.(hereplayingtheroleofsmallareas)isofinteresttotheU.S.BureauoftheCensus.Towardsthisend,theBureauofCensuscollectedannualmedianincomeestimatesof3,4and5personfamiliesforallthestatesandtheDistrictofColumbiaforeveryyear.ButthemethodologyusedbytheCensusBureaudoesnottakeintoaccountthelongitudinalnatureofthestate-specicmedianincomeobservations. 102

PAGE 103

Ghoshetal. ( 1996 ).Wealsoextendedthebasicsemiparametricframeworkbyincorporatingatimeseries(randomwalk)componenttoaccountforthewithinstatedependenceinthesuccessiveincomeobservations.Theclassofrandomwalkmodelsseemedtoimproveupontheirnon-randomwalkcounterpartsbutmorestudiesarerequiredtobedonebeforereachingadeniteconclusionabouttheirrelativeperformance.Overall,westronglythinkthatsemiparametricproceduresholdsalotofpromiseforsmallareaestimationproblems,specicallyinsituationswheremultipletimevaryingobservationsofsomecharacteristicareavailableforthesmallareas. 103

PAGE 104

Inmydissertation,Ihaveconcentratedontheapplicationofsemiparametricmethodologiesinanalyzingunorthodoxdatascenariosoriginatingindiverseeldslikecasecontrolstudiesandsmallareaestimation.Intheformerscenario,Ihaveusedpenalizedsplinestomodellongitudinalexposureprolesanditsinuencepatternonthecurrentdiseasestatusforagroupofcasesandcontrols.Indoingso,Ihavecometotheconclusionthatpastexposureobservationsmayhavesignicanteffectonthepresentdiseasestatus.Ourmodelingframeworkisquitegeneralandexibleinthesensethatitcanbeusedtomodelanypossiblepatternsofexposureprolesandalsoitcancapturecomplextimevaryingpatternsofinuenceoftheexposurehistoryonthecurrentdiseasestatus.WeappliedourmodelingframeworkonanestedcasecontrolstudyofprostatecancerwheretheexposurewastheProstateSpecicAntigen(PSA).Inthesecondscenario,wehaveusedsemiparametricprocedurestomodeltheincometrajectoriesofdifferentsmallareasandhaveusedthatinformationtoestimatethemedianincomesofthosesmallareasatagiventimepointinthefuture.OurmodelbasedestimatesseemedtoperformbetterthantheusualBureauofCensusestimateswhicharebasedontheincomeobservationsfromaparticulartimepointandhencearenon-longitudinalinnature.Wehavealsoextendedthesemiparametricmodelingframeworktothebivariatescenarioinestimatingthemedianincomeofvaryingfamilysizesforeachsmallarea.Inboththesecases,thesemiparametricincomeestimatesnotonlyimprovesonthecensusestimatesbutarealsocomparabletoestimatesbasedontimeseriesmodels.Thus,wecanconcludethatsemiparametricmethodology,ifproperlyapplied,holdsalotofpromiseforcomplicateddata-drivensituationsarisingindiversestatisticalsettingsliketheoncementionedabove. Theexibilityandpowerofthenonparametricandsemiparametricproceduresimmediatelyimpliesthatamultitudeofinterestingbutusefulextensionscanbecarried 104

PAGE 105

1.4 ,selectionandproperpositioningofknotsisavitalaspectinanysmoothingprocedureinvolvingsplines.Traditionally,knotsareplacedatequallyspacedsamplequantilesoftheindependentvariablesandthat'swhatwehavedoneinboththecasecontrolandsmallareascenarios.Butthisprocedurehasitsfairshareofdrawbacks-itwasevidentintheunivariatesmallareaproblemwheretheoriginalplacementoftheknotsfailedtoaccountforthelowdensityregionofthedatapatternwherethenon-linearitywasmostlyconcentrated.Thiswasprobablybecauseofthequantiledependentplacementprocedureoftheknots. Recently,therehasbeensomeresearchondata-drivenoradaptiveknotplacementproceduresinwhichthenumberandlocationsoftheknotsarecontrolledbythedataitselfratherthanbeingpre-specied.Theadvantageofthisprocedureisthatfewernumberofknotswouldberequiredwhichwouldbeplacedinoptimallocationsalongthedomain.Thus,theresultingsplinetwillbeexibleenoughtocaptureanyunderlyingheterogeneityinthedatapattern.BothFrequentistandBayesianapproacheshavebeenproposedtowardsthisend.SomeFrequentistcontributionsinclude Friedman ( 1991 )and Stoneetal. ( 1997 )whousedforwardandbackwardknotselectionschemesuntilthebestmodelisidentied. ZhouandShen ( 2001 )usedanalternativealgorithmwhichledtotheadditionofknotsatlocationswhichalreadypossessedsomeknots.Bayesiantreatmentofthisproblemsrevolvesonthenotionoftreatingtheknotnumberandknotlocationsasfreeparameters.SomenotableBayesiancontributionsinclude 105

PAGE 106

( 1998 )whoplacedpriorsonthenumberandlocationsoftheknots.Thentheysampledfromthefullposteriorsoftheparameters(includingknotlocationsandnumbers)usingreversiblejumpMCMCmethods( Green 1995 ).However,theyrestrictedtheknotstobelocatedonlyatthedesignpointsoftheindependentvariable. DiMatteoetal. ( 2001 )followedthesamebasicprocedureas Denisonetal. ( 1998 )buttheydidnotrestricttheknotstobelocatedonlyatthedesignpointsoftheexperiment.Theyalsopenalizedmodelswithunnecessarilylargenumberofknots. BottsandDaniels ( 2008 )proposedaexibleapproachforttingmultiplecurvestosparsefunctionaldata.Indoingso,theytreatedthenumbersandlocationsofknotsofthepopulationaveragedandsubjectspeciccurvesasdistinctrandomvariablesandsampledfromtheirposteriordistributionsusingreversiblejumpMCMCmethods.Theyusedfree-knotb-splinestomodelthepopulationaveragedandsubjectspeciccurves.Inalltheabovecontributions,Poissonpriorsareplacedontheknotnumberswhileatpriorsareplacedontheknotpositions.TheusefulnessandexibilityoftheBayesianapproachliesinthefactthatthenumberandlocationsofknotsareautomaticallydeterminedfromtheMCMCscheme.Thus,thismethodologyisoftenknownasBayesianAdaptiveRegressionSplines.However,thesamplingprocedureisquiteintensivesincetheparameterdimensionvariesateveryiteration.BottsandDanielssubstantiallyreducedthecomputationalburdenbydealingwiththeapproximateposteriordistributionofonlythenumberandpositionsoftheknotsbyintegratingouttheotherparametersbyusingLaplacetransformations. Animmediatebutworthwhileextensiontowhatwehavealreadydonewouldbetoincorporateanadaptiveknotselectionschemeintoboththecasecontrolandsmallareamodelingframeworks.Fortheformersetup,thiswouldcorrespondtodecipheringtheoptimalnumberofknotsforthepopulationmeanPSAtrajectoryandtheinuencefunction.So,dependingontheparticularstudyorthedatasetathand,anyunderlyingpatternintheinuenceprole(oftheexposuretrajectoryonthediseasestate)canbe 106

PAGE 107

Someotherinterestingextensionstoourworkcanbe 1. Incorporatinginformative(non-ignorable)missingness( LittleandRubin 1987 )inthelongitudinalexposure(casecontrol)orincome(smallarea)proles. 2. Incorporatingnon-parametricdistributionalstructureslikemixturesofDirichletprocesses( MacEachernandMuller 1998 ),Polyatrees( HansonandJohnson 2000 )onthesubject(orarea)specicrandomeffects. 3. Extendingthesemi-parametriccasecontrolmodelingframeworktosituationsinvolvingmultiple(>2)orevencategoricaldiseasestates. Now,Ibrieyexplainsomeworkthatwearecurrentlyengagedindoing. 5.2.1IntroductionandBriefLiteratureReview LittleandRubin 1987 ).Broadlytheseareofthreetypesviz: 1. 2. 3. 107

PAGE 108

LittleandRubin ( 1987 ).Theseapproachesdifferinthewaytheyfactorthejointdistributionofthemissingdataandtheresponse.Intheformerapproach,thepopulationisrststratiedbythepatternofdropoutresultinginamodelforthewholepopulationthatisamixtureoverthepatterns.Ontheotherhand,theselectionmodellingapproachrstmodelsthehypotheticalcompletedataandthenamodelforthemissingdataprocess(conditionalonthehypotheticalcompletedata)isappendedtothecompletedatamodel.InthisstudywewillfocusonthePatternmixture(PM)modelingapproach. SupposeourstudyconsistsofNsubjects,eachofwhomcanbemeasuredatTtimepoints.LetYiandtheDirespectivelydenotetheresponsevectoranddropouttimefortheithsubject.DiissuchthatDi=8><>:tiftheithsubjectdropsoutbetweenthe(t1)thandtthobservationtimes.T+1iftheithsubjectisacompleter. So,fortheithsubject,yiandDiareassumedtobeassociatedordependent.Thus,inthisapproachmodelsarebuiltfor[YijDi]butinferencesarebasedonf(y)=XDf(yjD)P(D). AnimportantbutrealisticsituationthatmayariseinlongitudinalstudiesisthatthenumberofuniquedropouttimesT(vis-a-vis,thenumberoftimesasubjectismeasured)maybelarge.Asaresultthenumberofsubjectshavingaparticulardropouttimemaybequitesmall.Thus,straticationbydropoutpatternmayleadtosparse 108

PAGE 109

HoganandLaird ( 1998 )suggestedparameterstobesharedacrosspatterns. Hoganetal. ( 2004 )suggestedwaystogrouptheTdropouttimesintom
PAGE 110

Diggleetal. 2002 )usedtocapturetheserialdependenceintheresponseprocess. Thereexistsanotherclassofmodelsknownasmarginalizedlatentvariablemodelswhichtakescareoftheexchangeableornon-diminishingdependencepatternamongtherepeatedresponseobservationsusingrandomintercepts. SchildcroutandHeagerty ( 2007 )combinedthemarginalizedtransitionandlatentvariablemodelsbyproposingaunifyingmodelthattakesintoaccountbothserialandlongrangedependenceamongtheresponseobservations.Theirmodelcanbeusedinsituationswithmoderatetolargenumberofrepeatedmeasurementspersubjectwherebothserial(shortrange)andexchangeable(longrange)responsecorrelationcanbeidentied. Inthisstudy,wecombinethemethodologiesproposedin Heagerty ( 2002 ), SchildcroutandHeagerty ( 2007 )and RoyandDaniels ( 2008 )andproposeanewmodelwhichaccountsforbothserial(shortterm)andlong-rangedependenceamongtheresponseobservationsinsituationswherethenumberofuniquedropouttimesislarge.Wegroupthedropouttimesusingalatentvariableapproachtakingintoaccounttheuncertaintyinthenumberofgroups.Wealsomodelthemarginalcovariateeffectsofinterest. 110

PAGE 111

Heagerty ( 1999 )proposedmarginallyspeciedlogisticmodelswhichleadtodirectmodelingofthemarginalcovariateeffects.LetYitandXitrespectivelybetheresponseobservationandthecovariatevectorcorrespondingtotheithindividualatthetthtimepoint,i=1,2,...,N;t=1,2,...,T.LetE(YitjXit,)bethemarginalmeanofYit.Itisspeciedas Theabovestructureisthemarginalregressionmodel.Now,inordertospecifythedependenceamong(Yi1,Yi2,...,YiT)thefollowingconditionalmodelisspecied wherebiN(0,).itcanbecomputedbysolvingthefollowingconvolutionequation Thusisafunctionorand.Inthisstudywewillbeproposingamodelwhichwillmarginalizeovertherandomeffectsandthedrop-outdistributiontodirectlymodelthemarginalcovariateeffectsofinteresttakingintoaccountboththeserialandexchangeabledependencestructureamongtheYit's. Letusbrieygooverthenecessarynotationswithrespecttosubjecti.LetYi=(Yi1,Yi2,...,YiT)betheresponsevector.LettheTuniquedropouttimesbegroupedintomclassesbythelatentindicatorsSi=(Si1,...,Sim).HereSijisanindicatorforclassj,j=1,...,m(m<>:1iftheithsubjectisinclassj0otherwise.

PAGE 112

1. Dependencebetweenresponseanddropouttimemodeledbythelatentclasses. 2. Shortrange(serialdependence)betweenYitand(Yit1,...,Yitp)modelledbyaMTM(p). 3. Longrangeornon-diminishingdependenceamongtheYit'smodelledbythesubjectspecicrandomeffectsbi,i=1,...,N. WerstspecifytheMarginalmodelas Theabovemodelmarginalizesoverthesubjectspecicrandomeffectsandoverthelatentclassdistribution(implicitlyoverthedropoutdistribution)aswell.Inordertofullyspecifytheassociationduetorepeatedmeasurementsandnonignorabilityinthemissingnessprocess,wespecifyaconditionalmodelinadditiontothemarginalmodel.Byconditional,wemeanconditionedovertherandomeffectsandlatentclasses.WeassumethattherelevantinformationinthedropouttimesiscapturedbythelatentvariableS-thisisobviousbecausethespeciclatentclassasubjectwouldbelongtowouldsolelydependonhis/herdropouttime.Thus,wespecifyamixturedistributionovertheselatentclasses,asopposedtooverDitself. Beforedelvingintothemodel,itisimportanttonotethattheconditionalmodelparametersarenotofmaininterest,andinfactwillbeviewedasnuisanceparameters.Thisisbecausewearenotinterestedinestimatingeithersubject-speciceffects(i.e.effectsconditionalontherandomeffects)orclass-speciccovariateeffects(i.e.effectsofcovariatesonYgivenaparticulardropoutclass).Moreover,theconditionalmodelshouldbesospeciedthatitiscompatiblewiththemarginalmodel( 5 ).Aswewillseebelow,thisleadstoasomewhatcomplicatedmodel.Specifyingthisconditionalmodel 112

PAGE 113

WeassumethatYit,conditionalontherandomeffectsbiandlatentclassSi,arefromanexponentialfamilywithdistribution where,inthemostgeneralcase,[bijSij=1,Xi]N(0,2j(Xi))andit,k(Sij=1)=V0it,kjkforj=1,2,...,mandk=1,2,...,p,whereVitandZitarebothsubsetsofXit.Thus,thevarianceofbimaydependonthelatentclassandthecovariatevectorfortheithsubject.Moreover,(1k,2k,...,mk)determineshowthedependencebetweenYitandYitkvariesasafunctionofthecovariatesVit,kconditionalonthelatentclasses.Wealsomakethesum-to-zeroconstrainti.em=Pmj=1jforthepurposeofidentiability.Lastly,inthisconditionalmodel,eachsubjecthasitsownintercept,andtheeffectofeachcovariate,isallowedtodifferbydropoutclassviatheregressioncoefcients,(j). Theprobabilitiesofthelatentclassesgiventhedrop-outtimesarespeciedasproportionalodd'smodel( Agresti 2002 )givenby where0,10,2...0,M1and1areunknownparameters.Thustheclassprobabilitiesareassumedtobeamonotonefunctionofdropouttime(infact,linearonthelogitscale). 113

PAGE 114

Lastly,thedrop-outtimesDiareassumedtofollowamultinomialdistributionwithmassateachpossibledrop-outtimes,parameterizedby'.HerewemaketheimportantassumptionthatYitisindependentofDigivenSi.Ourmaintargetofinferencearethecovariateeffectsaveragedovertheclassesi.eMaveragedoverM.Theinterceptitin( 5 )isdeterminedbythefollowingrelationshipbetweenthemarginalandconditionalmodelsE(Yitj)=XDXSp(SijDi)P(Di)ZXAfE(Yitjyit1,...,yitp,bi,Si)p(yit1,...,yitpjbi,Si)gp(bijSi)dbi 114

PAGE 115

Proportionalityin( 5 )holdsbecauseweassumethatthemissingandobservedresponsesfromsubjectiareindependent,givenSiandbi(i.e.[YmijYi,bi,Si]=[Ymijbi,Si]).FollowingtheOPEFformulation,wehaveLi(YijYfig,Sij=1,bi,(j),)=expTXt=1yititTXt=1(it)=(mi)+TXt=1h(Yit,)

PAGE 116

Wecanavoidtheintegral(w.r.tbi)in( 5 )ifwealsosamplethebi'salongwiththeotherparametersfromthefullposterior( 5 ).Inthatcase,thefullposteriormayberewrittenas where Forthemostgeneralcase,wehaveassumedanOPEFstructureforeachYitconditionalonthepast.Sincetheoutcomesarebinary,wecansimplifyittoaBernoullidistributioni.e wherecit=E(Yitjyit1,yit2,...,yitp,bi,Sij=1)=g1it+bi+MXj=1SijZ0ij(j)+pXk=1it,kyitk. 116

PAGE 117

1+e0j+1Di1+e0j1+1Di Now,asmentionedearlier,Diisthedropouttimefortheithsubject.Also,thereareTuniquedropouttimes.Let,fort=1,2,...,Tit=8><>:1iftheithsubjectdropsoutbetweenthe(t1)thandtthobservationtimes0otherwise. 1. LetNq(0,0)assumingthat8i=1,2,...,Nandt=1,2,...,T,Xitisqdimensional. 2. Let(1),(2),...,(m)iidNr(0,0).whererqsinceZitXit8i=1,2,...,Nandt=1,2,...,T. 3. Let21,22,...,2miidU(a,b)where0
PAGE 118

7. Forthetimebeingwekeepthepriorof,()unspecied. Now,combining( 5 5 )andthepriorsspeciedabove,wecanwritedownthefullposteriordistributionofmandw,(w,mjY,X,D)uptoaconstant.Thus,wecangetthefullconditionaldistributionofalltherelevantparametersandproceedwithsamplegenerationusingMCMC. TheassumptionofconditionalindependencebetweenYiandDigivenSiandthecovariatescanbeveriedbyperformingalikelihoodratiotest(Frequentist)orusingBayesfactors(Bayesian).Thenullmodelisgivenby( 5 )andthealternativemodelmaybewrittenas wheref(Di)maybeasmoothbutunspeciedfunctionofDi.Thus,thenullhypothesisofconditionalindependence(betweenYiandDigivenSiandXi)wouldbesimplyf(Di)=0.Thetestcanbecarriedoutbyrstttingthenullmodel(??).Then,theposteriorprobabilityofclassmembershipforeachsubjectcanbeestimatedby^P(Sij=1jDi,Yi,Xi,^w)=RLi(YijYfig,Sij=1,bi,^j,^)p(Sij=1jDi;^)p(Dij^)dF(bijSij,^2j) 5 )usingaweightedlikelihood(theweightsbeingtheaboveposteriorprobabilityofclassmembership).Analternativewayofdoingtheaboveconditionalindependencetestswouldbetousescoretestsbasedonsmoothingsplinesasusedinproportionalhazardsmodelsby Linetal. ( 2006 ). 118

PAGE 119

5 )hasthemostgeneralform.Wecansimplifyitbyassumingalineareffectofdrop-outtimeinwhichcasethealternative(simpler)modelwouldbe whereeachhj()isaknownfunctionandthe'sareparameters.ThenullhypotheseswouldbeH0:1=...=J=0.Thelineardrop-outeffectwouldimplyJ=1andh(Di)=Di.TheLRTcanthenbeperformedasbeforebyttingmodels( 5 )and( 5 )usingthesameweightsgivenabove.WecanalsouseBayesfactorsforcarryingouttheseanalysis. Heagerty ( 1999 )proposedMarginallySpeciedLogisticNormalmodelsforlongitudinalbinarydata.Heproposedtwomodels:therstonewasamarginallogisticregressionmodelwhichlinkstheaverageresponsetothecovariatesbythefollowingequation: HereYijandXijrespectivelydenotethebinaryresponseandtheexogenouscovariatevectorrecordedattimejfortheithsubject,i=1,2,...,N;j=1,2,...,ni.Thesecondmodelisaconditionalmodelwhichexplainsthewithin-subjectdependenceamong 119

PAGE 120

Animportantassumptionthatismadeisthatconditionalonbi=(bi1,bi2,...,bini),thecomponentsofYiareindependent.Finally,itisassumedthat(bijXi)N(0,i)whereimodelsthedependenceamongthebi's(andthus,indirectlyamongtheYi's)andcanbeobtainedasafunctionoftheobservationtimesti=(ti1,ti2,...,tini)andaparametervector. Heagerty ( 1999 )referredtothemodelsgivenin( 5 )and( 5 )asthemarginallyspeciedlogisticnormalmodels. Undertheabovemodellingframework,theparameterijcanbeexpressedasafunctionofboththemarginallinearpredictorij=X0ijandij,thestandarddeviationofbij.WritingbijasijzwherezN(0,1),ijcanbeobtainedasthesolutiontothefollowingconvolutionequation: whereh(.)istheinverseofthelogitlinkand(.)isthestandardnormaldensityfunction.Given(ij,ij),theaboveequationcanbesolvedforijusingnumericalintegrationandNewton-Raphsoniteration. 5 )willbeafunctionofthemarginalmeanparametersandtherandomeffectscovarianceparametersandshouldbecomputedforboththemaximumlikelihoodandestimatingequationmethodology( Heagerty 1999 ).Formaximumlikelihoodestimation,thecontributionoftheithsubjecttotheobserveddatalikelihoodisascertainedbyrstassumingalineartransformationoftheformbi=CiziwhereCiisaniqmatrixandziNq(0,Iqq).Theabovetransformationeffectivelylinksupbitoalowerdimensionalrandomeffectzi.Thecontributionoftheithsubject(totheobserveddatalikelihood)cannowbeexpressedasamixtureovertherandom 120

PAGE 121

whereq(zi)=qYk=1(zik).SinceLi(,)cannotbeevaluatedanalytically,numericalproceduresarerequiredtonditsvalue. Heagerty ( 2002 )usedGauss-HermiteQuadraturetoperformthecalculationbutassumedq=1.Withincreasingvaluesofq,thecomputationalburdenincreasesexponentiallyandisnotfeasibleatall.Wearecurrentlytryingtodevelopalternativeandlesscomputationallyintensivemethodologiestoaccomplishtheaboveobjectives.WeareworkingwithMultivariateLogisticandMultivariatetdistributionsagainstaBayesianframeworkasin O'brienandDunson ( 2004 ).Wehopethatthismethodologywillprovideabetteralternativetothearduousnumericalmethodsmentionedbelow. 121

PAGE 122

logdj=log+dlog#+logj+d0Z0cZj(t)(t)dt Thus,thelikelihoodwillbe A )wehave Differentiating( A )w.r.tand#andsolvingtheresultingequationswehave A )andthenexponentiating,weobtaintheexpressionofL(,)in( 2 ). Again,differentiating( A )w.r.tj,wehave Itiseasytoshowthatifwereplace( A )in( A )andthenexponentiate,wegettheexpressionforL(#,)in( 2 ).Sincetheorderofmaximizationisimmaterial,itfollowsthat,L(,)andL(#,),oncemaximizedoverthenuisanceparameters(#and

PAGE 123

Replacingtheexpressionofdjfrom( 2 ),wehave 2 ). (ii)First,weperformthetransformationfromto(,),where=JXj=1j.Thus,j=j,j=1,...,J.ThejacobianoftransformationwillbeJ1. Usingthistransformationin( A )andaftersomemanipulation,wehave 123

PAGE 124

A )w.r.t#weobtain Integrationof( A )w.r.tyields( 2 )aftersomeminormanipulation. (iii)Theorderinwhichp(#,,jy)isintegratedw.r.ttheparametersdoesnotmakeanydifferenceinthemarginalposteriordensityofp().Thus,integrationofp(w,jy)w.r.tworp(,jy)w.r.twillyieldthesamemarginalposteriordensityp(jy)of. 1. AsinSeamanandRichardson(2004),theassumptionofexistenceandnitenessofE0Z0cZq(t)(t)dtandE0Z0cZr(t)(t)dtisautomaticallysatisedprovidedthepriordensityp()ensuresthatE()existsandisnite. 2. Theposteriorproprietyofp(#,,jy)in( )canbeshowninasimilarwaytothatinSeamanandRichardson(2001). 3. Thepriordistributionp()ofinducesapriordistributionontheinuencefunctionf(t),ct0ginthelogisticcase-controlmodelin( 23 )since(t)=0(t),ct0. LetP(D=djX(t)=Zk(t),ct0)=pdk,(d=0,1,...,r;k=1,...,K)andP(X(t)=Zk(t),ct0jD=0)=k=PKl=1l.LetndkbethenumberofindividualswithD=dandfX(t)=Zk(t),ct0g.ItcanbeshownthatP(X(t)=Zk(t),ct0jD=d)=kpdk=p0k KXl=1lpdl=p0l

PAGE 125

KXl=1lpdl=p0l1CCCCCAndk. KXl=1ldl1CCCCCAndk. TheaugmentedmodelisgivenbyZdkjdkpoisson(dk)wherelog(dk)=log(#d)+log(dk)+log(k),log(0k)=log(k),d=1,...,r;k=1,...,K. 125

PAGE 126

NotingthatZ10expk(1+rXd=1#ddk)!(k)Prd=0ndk1dk/1+rXd=1#ddk!Prd=0ndk,wehave,byintegratingoutin( A ), Now,integratingout(#1,...,#r)from( A ),wehave Next,wemakethetransformationk='kand'=KXl=1lhavingjacobian'1.Hencethepriordistributionin( A )becomes(,#,',)/rYd=1#1d!'1KYk=11k!().

PAGE 127

A )canberewrittenas Integratingout'from( A ),wehave KXl=1ldl1CCCCCAndkKYk=11k!() From( A )and( A ),itisclearthatposteriorinferencefortheparameterofinterest,remainsthesameundereithertheprospectivelikelihoodLportheretrospectivelikelihoodLRaslongastheposteriorisproper.Itcanbeshownthattheposteriorwillbeproperforanyproperpriorforifn0k18k=1,...,K. 127

PAGE 128

WehavetoshowthatIMwhereMisanynitepositiveconstant. Integratingrstw.r.t,wehave 2Xi(iXiZibi1)01(iXiZibi1)d=jXiX0i1Xij1=2exp1 2XiW0i1Wi+Q 2PiW0i1XiPiX0i1Xi1PiX0i1Wi,Wi=iZibi1and1=diag(21,22,...,2t). Now,W0i1Wi=W0i1=21=2Wi=S0iSiwhereSi=1=2Wi.Similarly,W0i1Xi=S0iTi,X0i1Wi=T0iSiandX0i1Xi=T0iTiwhereTi=1=2Xi. 128

PAGE 129

B )becomes1 2XiS0iSiXiS0iTiXiT0iTi1XiT0iSi=1 2S0SS0T(T0T)1T0S=1 2S0IT(T0T)1T0S=Q,say whereS=(S01,...,S0m)0andT=(T01,...,T0m)0.Since(IT(T0T)1T0)isidempotent,S0IT(T0T)1T0Sisnon-negative,implyingQ0andthusexp(Q)1. Next,weconsiderintegrationw.r.t2i.e Assumingmax=max(1,...,t),wehave,8j=1,...,t,2j2max)Xij2jX0ijXij2maxX0ij)Pi,jXij2jX0ij2maxPi,jXijX0ijandthus Combining( B )and( B ),wehaveIjXi,jXijX0ijj1=2Z...Z(2max)(p+1)=2tYj=1(2j)m=2cj+1exp(dj=2j)d21...d2t 2expdk (B) 129

PAGE 130

Combining( B )and( B ),wehave where=().Sinceallthecomponentsoftheintegrandin( B )haveproperdistributions,theaboveintegralwouldbenitethusprovingposteriorpropriety. Fortherandomwalkmodel,theintegrandin( B )willhaveanadditionallikelihoodtermQtj=1L(vjjvj1,2v)andapriorterm(2v).Thederivationwouldthenproceedexactlyasaboveandtheintegrandin( B )willalsocontaintheseadditionalterms.Butsincebothoftheseareproperdistributions(normalandinversegammarespectively),Iwillstillbeniteundertheconditionsstatedinthetheorem. 2Xi,j(ijX0ijZ0ijbi)01j(ijX0ijZ0ijbi)d<1(B) inordertoproveposteriorpropriety. Usingthesametypeofalgebraicmanipulationsasintheunivariatecase,theL.H.Sof( B )canbeshowntobe 2Xi,jW0ij1jWij+1 2Q 130

PAGE 131

Asbefore,theexpressionwithintheexponentin( B )canberewrittenasK=1 2Xi,jS0ijSijXi,jS0ijTijXi,jT0ijTijXi,jT0ijSij=1 2S0IT(T0T)1T0S0. Thus, exp1 2Xi,jW0ij1jWij+1 2Q1 So,inordertoproveposteriorpropriety,wehavetoshow Hereristheorderofj,j=1,2,...,t.(r=2inourcase). Letj1,j2,...,jrbethedistincteigenvaluesof1j,j=1,2,...,t.Sincejisavariancecovariancematrix,itispositivedeniteandsymmetric.Hence,1jalsohasthesameproperties.Thus,jk>0,8k=1,2,...,r. Now,8j=1,2,...,r,

PAGE 132

2jXi,jXijX0ijj1 2 Sincej1jj=rYk=1jk,8j=1,...,t, 2=rYk=1(jk)(m+djr1) 2 Now,replacing( B )and( B )intheexpressionofIin( B ),wehave 2Z..Z(min)p+q+2 2tYj=1rYk=1(jk)(m+djr1) 2exp"TV1j1j whereTdenotestrace.Letmin=lm,l2[1,...,t];m2[1,...,r]. Then,II1I2where 2ZrYfk=1,k6=mg(lk)(m+dlr1) 2(lm)(m+dlpq2)r1 2expTV1l1l 2ZrYfk=1,k6=mg(lk)p+q+2 2j1lj(m+dlpq2)r1 2expTV1l1l 2exp"TV1j1j 2jVjjm+dj whichisnite. Thus,inordertoshowposteriorpropriety,wehavetoprovethatI2<1. 132

PAGE 133

2j1lj(m+dlpq2)r1 2expTV1l1l BytheAM-GMinequality,wehave, 21 2 21 2=1 2=1 2 where(l)kkdenotesthekthdiagonalelementof1l. Since1lhasaWishartdistribution,(l)kkkk2dl,(k=1,...,r)implyingthatPrk=1(l)kk<1. Combining( B )and( B ),wehave,IZ1 2j1lj(m+dlpq2)r1 2expTV1l1l 2ZrXk=1(l)kk(r1)(p+q+2) 2j1lj(m+dlpq2)r1 2expTV1l1l 2whereC=1 2

PAGE 134

Now, 2(r1)(p+q+2) 2r1rXk=1((l)kk)(r1)(p+q+2) 2 2(r1)(p+q+2) 2r1ErXk=1((l)kk)(r1)(p+q+2) 2 whichisnitebecause 2<18k=1,...,r)rXk=1((l)kk)(r1)(p+q+2) 2<1)ErXk=1((l)kk)(r1)(p+q+2) 2<1)ErXk=1(l)kk(r1)(p+q+2) 2<1 ThusIisniteimplyingposteriorpropriety. 134

PAGE 135

1. andisthep+K+1orderpriorvariance-covariancematrixof. 2. 3. andbistheq+M+1ordervariance-covariancematrixofb. 4. andisther+K+2ordervariance-covariancematrixof(,). 5. 2,+(Zi0Mib0iQi) 2where 6. 2NXi=1b2ij,j=0,...,q. 135

PAGE 136

2NXi=1ni+1,1 2NXi=1niXj=1yijp,(aij)0q,(aij)0bi2. 8. 2KXk=12p+k. 9. 2NXi=1q+MXj=q+1b2ij. 10. 2KXk=12r+k. Here,G(x,y)denotesaGammadensitywithshapeparameterxandrateparameteryrespectively. C.2.1SemiparametricUnivariateSmallAreaModel 1.

PAGE 137

20+d 2mXi=1(ijX0ijZ0ijbi)2+dj 2mXi=1b2i+d 1. 137

PAGE 138

10.

PAGE 139

Agresti,A.(2002).Categoricaldataanalysis.Wiley. Albert,J.andChib,S.(1993).Bayesiananalysisofbinaryandpolychotomousresponsedata.JournaloftheAmericanStatisticalAssociation88,669. Althman,P.(1971).Theanalysisofmatchedproportions.Biometrika58,561. Ashby,D.,Hutton,J.,andMcGee,M.(1993).SimpleBayesiananalysesforcase-controlledstudiesincancerepidemiology.Statistician42,385. Battese,G.,Harter,R.,andFuller,W.(1988).Anerrorcomponentmodelforpredictionofcountycropareasusingsurveyandsatellitedata.JournaloftheAmericanStatisticalAssociation83,28. Bell,W.(1999).Accountingforuncertaintyaboutvariancesinsmallareaestimation.BulletinoftheInternationalStatisticalInstitute. Botts,C.andDaniels,M.(2008).AfexibleapproachtoBayesianmultiplecurvetting.ComputationalStatisticsandDataAnalysis52,5100. Bradlow,E.andZaslavsky,A.(1997).CaseinuenceanalysisinBayesianinference.JournalofComputationalandGraphicalStatistics6,314. Breslow,E.T.andDay,N.E.(1980).StatisticalMethodsinCancerResearch,Volume1.InternationalAgencyforResearchonCancer,Lyon. Breslow,E.T.,Day,N.E.,Halvorsen,K.T.,Prentice,R.L.,andSabai,C.(1978).Estimationofmultiplerelativeriskfunctionsinmatchedcase-controlstudies.Ameri-canJournalofEpidemiology108,299. Breslow,N.(1996).Statisticsinepidemiology:Thecase-controlstudy.JournaloftheAmericanStatisticalAssociation91,14. Carroll,R.J.,Wang,S.,andWang,C.Y.(1995).Prospectiveanalysisoflogisticcasecontrolstudies.JournaloftheAmericanStatisticalAssociation90,157. Catalona,W.,Partin,A.,Slawin,K.,andBrawer,M.(1998).Useofthepercentageoffreeprostate-specicantigentoenhancedifferentiationofprostatecancerfrombenignprostaticdisease:Aprospectivemulticenterclinicaltrial.JournaloftheAmericanMedicalAssociation19,1542. Corneld,J.(1951).Amethodofestimatingcomparativeratesfromclinicaldata:applicationstocancerofthelung,breast,andcervix.JournaloftheNationalCancerInstitute11,1269. Corneld,J.,Gordon,T.,andSmith,W.W.(1961).Quantalresponsecurvesforexperimentallyuncontrolledvariables.BulletinoftheInternationalStatisticalInstitute38,97. 139

PAGE 140

Denison,D.,Mallick,B.,andSmith,A.(1998).AutomaticBayesiancurvetting.JournaloftheRoyalStatisticalSociety,SeriesB60,333. Diggle,P.,Heagerty,P.,Liang,K.,andZeger,S.(2002).Theanalysisoflongitudinaldata,2ndEdition.NewYork:OxfordUniversityPress. Diggle,P.,Morris,S.,andWakeeld,J.(2000).Pointsourcemodelingusingmatchedcase-controldata.Biostatistics1,89. DiMatteo,I.,Genovese,C.,andKass,R.(2001).Bayesiancurvettingwithfreeknotsplines.Biometrika88,1055. Durban,M.,Harezlak,J.,Wand,M.,andCarroll,R.(2004).Simplettingofsubjectspeciccurvesforlongitudinaldata.StatisticsinMedicine00,1. Eilers,P.andMarx,B.(1996).FlexiblesmoothingwithB-splinesandpenalties.Statisti-calScience11,89. Ericksen,E.andKadane,J.(1985).Estimatingthepopulationincensusyear:1980andbeyond(withdiscussion).JournaloftheAmericanStatisticalAssociation80,98. Escobar,M.andWest,M.(1995).Bayesiandensityestimationandinferenceusingmixtures.JournaloftheAmericanStatisticalAssociation90,577588. Etzioni,R.,Pepe,M.,Longton,G.,Hu,C.,andGoodman,G.(1999).Incorporatingthetimedimensioninreceiveroperatingcharacteristiccurves:Acasestudyofprostatecancer.MedicalDecisionMaking19,242. Eubank,R.(1988).Splinesmoothingandnonparametricregression.NewYork:MarcelDekker. Eubank,R.(1999).Nonparametricregressionandsplinesmoothing.NewYork:MarcelDekker. Fan,J.andGijbels,I.(1996).Localpolynomialmodelinganditsapplications.ChapmanandHall. Fay,R.(1987).Applicationofmultivariateregressiontosmalldomainestimation,inR.Platek,J.N.K.Rao,C.E.Srndal,andM.P.Singh(Eds).SmallAreaStatistics. Fay,R.andHerriot,R.(1979).Estimationofincomefromsmallplaces:anapplicationofJames-Steinprocedurestocensusdata.JournaloftheAmericanStatisticalAssociation74,269. 140

PAGE 141

Friedman,J.(1991).Multivariateadaptiveregressionsplines.TheAnnalsofStatistics19,1. Gelfand,A.andGhosh,S.(1998).Modelchoice:Aminimumposteriorpredictivelossapproach.Biometrika85,1. Gelfand,A.andSmith,A.(1990).Samplingbasedapproachestocalculatingmarginaldensities.JournaloftheAmericanStatisticalAssociation85,398. Gelman,A.andRubin,D.(1992).Inferencefromiterativesimulationusingmultiplesequences(withdiscussion).StatisticalScience7,457. Ghosh,M.andChen,M.-H.(2002).Bayesianinferenceformatchedcasecontrolstudies.Sankhya,B64,107. Ghosh,M.,Nangia,N.,andKim,D.(1996).Estimationofmedianincomeoffour-personfamilies:ABayesiantimeseriesapproach.JournaloftheAmericanStatisticalAssociation91,1423. Ghosh,M.andRao,J.N.K.(1994).Smallareaestimation:Anappraisal.StatisticalScience9,55. Godambe,V.P.(1976).Conditionallikelihoodandunconditionaloptimumestimatingequations.Biometrika63,277. Green,P.(1995).ReversiblejumpMarkovChainMonteCarlocomputationandBayesianmodeldetermination.Biometrika82,711. Green,P.andSilverman,B.(1994).Nonparametricregressionandgeneralizedlinearmodels:aroughnesspenaltyapproach.ChapmanandHall/CRC. Gustafson,P.,Le,N.,andValle,M.(2002).ABayesianapproachtocase-controlstudieswitherrorsincovariables.Biostatistics3,229. Hampel,F.,Ronchetti,E.,Rousseeuw,P.,andStahel,W.(1987).Robuststatistics:Theapproachbasedoninuencefunctions.Wiley. Hanson,T.andJohnson,W.(2000).Spatiallyadaptivepenaltiesforsplinetting.AustralianandNewZealandJournalofStatistics2,205. Heagerty,P.(1999).Marginallyspeciedlogisticnormalmodelsforlongitudinalbinarydata.Biometrics55,688. Heagerty,P.(2002).Marginalizedtransitionmodelsandlikelihoodinferenceforlongitudinalcategoricaldata.Biometrics58,342. 141

PAGE 142

Hogan,J.andLaird,N.(1998).Mixturemodelsforthejointdistributionofrepeatedmeasuresandeventtimes.StatisticsinMedicine16,239. Hogan,J.,Roy,J.,andKorkontzelou,C.(2004).Tutotialinbiostatistics:Handlingdrop-outinlongitudinalstudies.StatisticsinMedicine23,1455. Jiang,J.andLahiri,P.(2006).Mixedmodelpredictionandsmallareaestimation.Test15,1. Johnson,V.(2004).ABayesian2testforgoodness-of-t.AnnalsofStatistics32,2361. Lewis,M.,Heinemann,L.,MacRae,K.,Bruppacher,R.,andSpitzer,W.(1996).Theincreasedriskofvenomousthromboembolismandtheuseofthirdgenerationprogestagens:Roleofbiasinobservationalresearch.Contraception54,5. Lin,J.,Zhang,D.,andDavidian,M.(2006).Smoothingsplinebasedscoretestsforproportionalhazardsmodels.Biometrics62,803. Lindstrom,M.(1999).Penalizedestimationoffree-knotsplines.JournalofComputa-tionalandGraphicalStatistics8,333. Lipsitz,S.,Parzen,M.,andEwell,M.(1998).Inferenceusingconditionallogisticregressionwithmissingcovariates.Biometrics54,295. Little,R.andRubin,D.(1987).StatisticalAnalysiswithMissingData.NewYork:Wiley&Sons. MacEachern,S.andMuller,P.(1998).EstimatingmixturesofDirichletprocessmodels.JournalofComputationalandGraphicalStatistics2,223. Mantel,N.andHaenszel,W.(1959).Statisticalaspectsoftheanalysisofdatafromretrospectivestudiesofdisease.JournaloftheNationalCancerInstitute22,719. Marshall,R.(1988).Bayesiananalysisofcase-controlstudies.StatisticsinMedicine7,12231230. Morris,C.(1983).ParametricempiricalBayesinference:theoryandapplicaions.JournaloftheAmericanStatisticalAssociation78,47. Muller,P.,Parmigiani,G.,Schildkraut,J.,andTardella,L.(1999).ABayesianhierarchicalapproachforcombiningcase-controlandprospectivestudies.Biometrics55,858. Muller,P.andRoeder,K.(1997).ABayesiansemiparametricmodelforcase-controlstudieswitherrorsinvariables.Biometrika84,523. 142

PAGE 143

O'brien,S.andDunson,D.(2004).Bayesianmultivariatelogisticregression.Biometrics60,739. Opsomer,J.,Claeskens,G.,Ranalli,M.,andBreidt,F.(2008).Non-parametricsmallareaestimationusingpenalizedsplineregression.JournaloftheRoyalStatisticalSociety,SeriesB70,265. Paik,M.andSacco,R.(2000).Matchedcase-controldataanalyseswithmissingcovariates.AppliedStatistics49,145. Park,E.andKim,Y.(2004).Analysisoflongitudinaldataincase-controlstudies.Biometrika91,321. Prentice,R.L.andPyke,R.(1979).Logisticdiseaseincidencemodelsandcasecontrolstudies.Biometrika66,403. Rao,J.N.K.(2003).SmallAreaEstimation.WileyInterScience,NewYork. Rathouz,P.,Satten,G.,andCarroll,R.(2002).Semiparametricinferenceinmatchedcase-controlstudieswithmissingcovariatedata.Biometrika89,905. Robinson,G.(1991).ThatBLUPisagoodthing:theestimationofrandomeffects.StatisticalScience6,15. Roeder,K.,Carroll,R.,andLindsay,B.(1996).Asemiparametricmixtureapproachtocase-controlstudieswitherrorsincovariables.JournaloftheAmericanStatisticalAssociation91,722. Roy,J.(2003).Modelinglongitudinaldatawithnon-ignorabledropoutsusingalatentdropoutclassmodel.StatisticsinMedicine59,829. Roy,J.andDaniels,M.(2008).Ageneralclassofpatternmixturemodelsfornonignorabledropoutswithmanypossibledropouttimes.Biometrics64,538. Rubin,D.(1981).TheBayesianbootstrap.TheAnnalsofStatistics9,130. Ruppert,D.(2002).Selectingthenumberofknotsforpenalizedsplines.JournalofComputationalandGraphicalStatistics11,735. Ruppert,D.andCarroll,R.(2000).Spatiallyadaptivepenaltiesforsplinetting.AustralianandNewZealandJournalofStatistics2,205. Ruppert,D.,Wand,M.,andCarroll,R.(2003).SemiparametricRegression.CambridgeUniversityPress,Cambridge,U.K. Satten,G.andCarroll,R.(2000).Conditionalandunconditionalcategoricalregressionmodelswithmissingcovariates.Biometrics56,384. 143

PAGE 144

Schildcrout,J.andHeagerty,P.(2007).Marginalizedmodelsformoderatetolongseriesoflongitudnalbinaryresponsedata.Biometrics63,322. Seaman,S.R.andRichardson,S.(2001).Bayesiananalysisofcase-controlstudieswithcategoricalcovariates.Biometrika88,1073. Seaman,S.R.andRichardson,S.(2004).EquivalenceofprospectiveandretrospectivemodelsintheBayesiananalysisofcase-controlstudies.Biometrika91,15. Sinha,S.,Mukherjee,B.,andGhosh,M.(2004).Bayesiansemiparametricmodelingformatchedcase-controlstudieswithmultiplediseasestates.Biometrics60,41. Sinha,S.,Mukherjee,B.,Ghosh,M.,Mallick,B.,andCarroll,R.(2005).SemiparametricBayesiananalysisofmatchedcase-controlstudieswithmissingexposure.JournaloftheAmericanStatisticalAssociation100,591. Stone,C.,Hansen,M.,Kooperberg,C.,andTruong,Y.(1997).Polynomialsplinesandtheirtensorproductsinextendedlinearmodeling.TheAnnalsofStatistics25,1371. Wahba,G.(1990).Splinemodelsforobservationaldata.CBMS-NSFRegionalConferenceSeriesinAppliedMathematics. Wand,M.(2003).Smoothingandmixedmodels.ComputationalStatistics18,223. Wand,M.andJones,M.(1995).KernelSmoothing.ChapmanandHall. Zelen,M.andParker,R.(1986).CasecontrolstudiesandBayesianinference.StatisticsinMedicine5,261269. Zhang,D.,Lin,X.,andSowers,M.(2007).Twostagefunctionalmixedmodelsforevaluatingtheeffectoflongitudinalcovriateprolesonascalaroutcome.Biometrics63,351. Zhou,S.andShen,X.(2001).Spatiallyadaptiveregressionsplinesandaccurateknotselectionschemes.JournaloftheAmericanStatisticalAssociation96,247. 144

PAGE 145

DhimanBhadrareceivedhisBachelorofScienceinstatisticsfromPresidencyCollege,Calcutta(India)in2002andMasterofScienceinstatisticsfromCalcuttaUniversityin2004.HejoinedtheDepartmentofstatisticsatUniversityofFloridainJanuary2005forpursuingaPhDinstatistics.HeplanstograduateinAugust2010. 145