UFDC Home  myUFDC Home  Help 



Full Text  
BAYESIAN SEMIPARAMETRIC REGRESSION AND RELATED APPLICATIONS By DHIMAN BHADRA A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010 @ 2010 Dhiman Bhadra To my mother and to the memory of my father ACKNOWLEDGMENTS I had the good fortune to be a student at the Department of Statistics at University of Florida. It is here that I came in close contact with some of the preeminent statisticians of the day and learnt a lot from them. I deeply acknowledge the tremendous help, encouragement and endless support that I received from my advisor Prof. Malay Ghosh, my coadvisor Prof. Michael J. Daniels and Prof. Alan Agresti throughout the highs and lows of doing my research work. They not only taught me statistics or the art of writing papers or solving problems they introduced me to the spirit of discovery and the joy of learning, something that will stay with me forever and would motivate me in ways I can never imagine. However the list doesn't end here since each and every member of the faculty opened up new doors for me through which knowledge flowed past and enriched me along the way. My endless gratitude to each and everyone of them. I also wish to thank Prof. Bhramar Mukherjee (currently at the Department of Biostatistics at University of Michigan) for her help and inspiration over the years. Last but not the least, my unending gratitude to my mother whose sacrifice, unconditional love and blessing was always with me, guiding me along the way. I would end by conveying my deepest respect to the memory of my father he was there with me always throughout this journey. TABLE OF CONTENTS page ACKNOW LEDGMENTS .................... .............. 4 LIST OFTABLES ..................... ................. 8 LIST OF FIGURES .................... ................. 9 ABSTRACT ..................... ............... .... 10 CHAPTER 1 INTRODUCTION .................... ............... 13 1.1 Overview of Dissertation ............................ 13 1.2 Review of CaseControl Studies ..................... 14 1.3 Review of Small Area Estimation ....................... 21 1.4 NonParametric Regression Methodology ............. 25 2 BAYESIAN SEMIPARAMETRIC ANALYSIS OF CASE CONTROL STUDIES WITH TIME VARYING EXPOSURES ........................ 31 2.1 Introduction .................... ............... 31 2.1.1 Setting .................... .............. 32 2.1.2 Motivating Dataset : Prostate Cancer Study ............ 34 2.2 M odel Specification .. .. .. .. .. .. .. .. 35 2.2.1 N otation . .. 35 2.2.2 Model Framework.................... ......... 35 2.3 Posterior Inference . .. 40 2.3.1 Likelihood Function .. .. .. .. .. .. 40 2.3.2 Priors . . 40 2.3.3 Posterior Computation ......................... 41 2.3.3 Posterior Computation . 41 2.4 Bayesian Equivalence ............................. 42 2.5 Model Comparison and Assessment .. ... 46 2.5.1 Posterior Predictive Loss ..... .. .. .. 46 2.5.2 Kappa statistic . 47 2.5.3 Case Influence Analysis ..... .... 48 2.6 Analysis of PSA Data ............................. 49 2.6.1 Constant Influence Model ..... .... 50 2.6.2 Linear Influence Model ......................... 51 2.6.3 Overall Model Comparison ... 53 2.6.4 Model Assessment ......................... 54 2.7 Conclusion and Discussion ... 55 3 ESTIMATION OF MEDIAN HOUSEHOLD INCOMES OF SMALL AREAS : A BAYESIAN SEMIPARAMETRIC APPROACH .. ................ 3.1 Introduction . . 3.1.1 SAIPE Program and Related Methodology . 3.1.2 Related Research ..................... 3.1.3 Motivation and Overview . . 3.2 M odel Specification ........................ 3.2.1 General Notation .. .. .. .. .. .. . 3.2.2 Semiparametric Income Trajectory Models . 3.2.2.1 Model I : Basic Semiparametric Model (SPM) . 3.2.2.2 Model II : Semiparametric Random Walk Model 3.3 Hierarchical Bayesian Inference . . 3.3.1 Likelihood Function .. . 3.3.2 Prior Specification .. .. .. .. .. 3.3.3 Posterior Distribution and Inference . 3.4 Data Analysis .......................... 3.4.1 Comparison Measures and Knot Specification . 3.4.2 Computational Details ................... 3.4.3 Analytical Results . . 3.4.4 Knot Realignment .. . 3.4.5 Comparison with an Alternate Model . 3.5 Model Assessment ......................... 3.6 D discussion . . 4 ESTIMATION OF MEDIAN INCOME OF FOUR PERSON FAMILIES MULTIVARIATE BAYESIAN SEMIPARAMETRIC APPROACH .. ...... . . ...... . . ...... ...... . . . . (SPRWM) . . . . ...... . . ...... . . ...... . . . . . . ...... ...... : A . . ...... . . . . . . ...... ...... . . . . . . . . . . . . . . ...... . . . . . . . . 4.1 Introduction . . 4.1.1 Census Bureau Methodology . 4.1.2 Related Literature. . 4.1.3 Motivation and Overview . 4.2 Model Specification ................ 4.2.1 Notation . . 4.2.2 Semiparametric Modeling Framework . 4.2.2.1 Simple bivariate model . 4.2.2.2 Bivariate random walk model . 4.3 Hierarchical Bayesian Analysis . 4.3.1 Likelihood Function . 4.3.2 Prior Specification . 4.3.3 Posterior Distribution and Inference . 4.4 Data Analysis .................... 4.4.1 Comparison Measures and Knot Specification 4.4.2 Computational Details . 4.4.3 Analytical Results. . 4.5 Conclusion and Discussion . 59 59 61 62 65 65 66 66 67 68 68 68 69 70 71 72 73 74 78 80 82 85 85 85 87 89 90 90 91 91 92 93 93 94 94 95 96 97 98 102 . 5 CONCLUSION AND FUTURE RESEARCH ................... 104 5.1 Adaptive Knot Selection ............................ 105 5.2 Analyzing Longitudinal Data with Many Possible Dropout Times using Latent Class and Transitional Modelling . 107 5.2.1 Introduction and Brief Literature Review ..... 107 5.2.2 Modeling Framework .......................... 110 5.2.3 Likelihood, Priors and Posteriors ... 114 5.2.4 Specification of Priors ......................... 117 APPENDIX A PROOF OF BAYESIAN EQUIVALENCE RESULTS .... 122 B PROOF OF POSTERIOR PROPRIETY FOR THE SMALL AREA MODELS .128 B.1 Univariate Small Area Model ..... .. ... ... 128 B.2 Bivariate Small Area Model .......................... 130 C FULL CONDITIONAL DISTRIBUTIONS .. .. 135 C.1 Semiparametric Case Control Model . 135 C.2 Semiparametric Small Area Models . 136 C.2.1 Semiparametric Univariate Small Area Model .... 136 C.2.2 Univariate Random Walk Model .. .. 137 C.2.3 Bivariate Random Walk Model ..................... 137 R EFER ENC ES . . 139 BIOGRAPHICAL SKETCH ................................ 145 LIST OF TABLES Table page 11 Atypical 2 x 2 table ........... ...... .............. 15 21 Estimates of odds ratios for different trajectory lengths and age at diagnosis for a 0.5 vertical shift of the exposure trajectory for the linear influence model 52 22 Posterior means and 95% confidence intervals of odds ratio for / = (10, 5) for the linear influence model ... 53 23 Posterior predictive losses (PPL) for the constant and linear influence models for varying exposure trajectories and knots ... 54 31 Parameter estimates of SPRWM with 5 knots . ... 74 32 Comparison measures for SPM(5)* and SPRWM(5)* estimates with knot realignment . ... 77 33 Percentage improvements of SPM(5)* and SPRWM(5)* estimates over SAIPE and CPS estimates ..... 77 34 Parameter estimates of SPM(5)* .................. ....... 78 35 Parameter estimates of SPRWM(5)* ..... ....... 78 36 Comparison measures for time series and other m odel estim ates .. .. .. .. .. .. .. .. 79 41 Comparison measures for univariate estimates ..... 99 42 Percentage improvements of univariate estimates over Census Bureau estimates .... 99 43 Comparison measures for bivariate nonrandom w alk estim ates . . 100 44 Percentage improvements of bivariate nonrandom walk estimates over Census Bureau estimates ..... 101 45 Comparison measures for bivariate random walk model ... 102 LIST OF FIGURES Figure page 21 Longitudinal exposure (PSA) profiles of 3 randomly sampled cases (1st column) and 3 randomly sampled controls (2nd column) plotted against age. 36 22 Sensitivity of /3, 0o, 1i and disease probability estimates to casedeletions. 56 31 Longitudinal CPS median income profiles for 6 states plotted against IRS mean and median incomes. (1st column : IRS Mean Income; 2nd column : IRS Median Inco m e ). . . .. 63 32 Plots of CPS median income against IRS mean and median incomes for all the states of the U.S. from 1995 to 1999. ........... ......... 65 33 Exact positions of 5 and 7 knots in the plot of CPS median income against IRS mean income. The knots are depicted as the bold faced triangles at the bottom ........................................ 75 34 Positions of 5 knots after realignment. The knots are the bold faced triangles at the bottom. The region between the dashed and bold lines is the additional coverage area gained from the realignment. ... 76 35 Quantilequantile plot of RB values for 10000 draws from the posterior distribution of the basic semiparametric and semiparametric random walk models. The Xaxis depicts the expected order statistics from a X2 distribution with 9 degrees of freedom .................. .................. 81 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy BAYESIAN SEMIPARAMETRIC REGRESSION AND RELATED APPLICATIONS By Dhiman Bhadra August 2010 Chair: Malay Ghosh Cochair: Michael J. Daniels Major: Statistics CaseControl studies and small area estimation are two distinct areas of modern Statistics. The former deals with the comparison of diseased and healthy subjects with respect to risk factors) of a disease with the aim of capturing disease exposure association specially for rare diseases. The later area is concerned with the measurements of characteristics of small domains regions whose sample size is so small that the usual survey based estimation procedures cannot be applied in the inferential routines. Both these areas are important in their own right. Casecontrol studies forms one of the pillars of modern biostatistics and epidemiology and has diverse applications in various health related issues, specially those involving rare diseases like Cancer. On the other hand, estimates of characteristics for small areas are widely used by Federal and local governments for formulating policies and decisions, in allocating federal funds to local jurisdictions and in regional planning. My dissertation deals with the application of Bayesian semiparametric procedures in modeling unorthodox data scenarios that may arise in case control studies and small area estimation. The first part of the dissertation deals with an analysis of longitudinal casecontrol studies i.e casecontrol studies for which time varying exposure information are available for both cases and controls. In a typical casecontrol study, the exposure information is collected only once for the cases and controls. However, some recent medical studies have indicated that a longitudinal approach of incorporating the entire exposure history, when available, may lead to more precise estimates of the odds ratios of disease. We use semiparametric regression procedures to model the exposure profiles of the cases and controls and also the influence pattern of the exposure profile on the disease status. This enables us to analyze how the present disease status of a subject is influenced by his/her past exposure conditions conditional on the current ones. Analysis is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) algorithms. The proposed methodology is motivated by, and applied to a casecontrol study of prostate cancer where longitudinal biomarker information is available for the cases and controls. The second and third part of my dissertation deals with univariate and multivariate semiparametric procedures for estimating characteristics of small areas across the United States. In the second part, we put forward a semiparametric modeling procedure for estimating the median household income for all the states of the U.S. and the District of Columbia. Our models include a nonparametric functional part for accommodating any unspecified time varying income pattern and also a state specific random effect to account for the withinstate correlation of the income observations. Model fitting and parameter estimation is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) methodology. It is seen that the semiparametric model estimates can be superior to both the direct estimates and the Census Bureau estimates. Overall, our study indicates that proper modeling of the underlying longitudinal income profiles can improve the performance of model based estimates of household median income of small areas. In the third part of the dissertation, we put forward a bivariate semiparametric modeling procedure for the estimation of median income of fourperson families for the different states of the U.S. and the District of Columbia while explicitly accommodating for the time varying pattern in the income observations. Our estimates tend to have better performances than those provided by the Census Bureau and also have comparable performances to some established methodologies specially those involving time series modeling techniques. Based on our findings in parts two and three, we come to the conclusion that semiparametric and nonparametric regression models can be a attractive alternative to the more traditional modeling frameworks specially in situations where information on different characteristics of small areas are available at multiple time points in the past. CHAPTER 1 INTRODUCTION 1.1 Overview of Dissertation My dissertation primarily deals with the application of Bayesian semiparametric methodologies in dealing with unorthodox data scenarios arising in casecontrol studies and small area estimation. So, before going into the details of the specific problems, I will introduce some of the basic principles and techniques that will provide the necessary background to understand the key ideas. I will start by reviewing the existing literature on casecontrol studies (Cornfield, 1951; Breslow et al., 1978, 1980; Breslow, 1996) and small area estimation (Ghosh and Rao, 1994; Pfefferman, 1999; Rao, 2003). I will then give a broad overview of nonparametric regression approaches (Ruppert, Wand and Carroll, 2003; Wand, 2003) specifically related to penalized splines (Eilers and Marx, 1996). In Chapter 2, I present an analysis of a casecontrol study when longitudinal, time varying exposure observations are available for the cases and controls. Semiparametric regression procedures are used to flexibly model the subject specific exposure profiles and also the influence pattern of the exposure profiles on the disease status. This enables us to analyze whether past exposure observations affect the current disease status of a subject conditional on his/her current exposure condition. The proposed methodology is motivated by and applied to a case control study of prostate cancer where longitudinal biomarker information are available for the cases and controls. We also show the details of the hierarchical Bayesian implementation of our models and some equivalence results that have enabled us to use a prospective modeling framework on a retrospectively collected dataset. In Chapter 3, I propose a Bayesian semiparametric modeling procedure for estimating the median household income of small areas when areaspecific longitudinal income observations are available. Our models include a nonparametric functional part for accommodating any unspecified time varying income pattern and also an area specific random effect to account for the dependence in the income observations within each area. Model fitting and parameter estimation is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) sampling schemes. We apply our methodology to estimate the median household income of all fifty U.S. states and the District of Columbia for a particular year. In doing so, we come to the conclusion that proper modeling of the underlying longitudinal income profiles can improve the performance of model based estimates of small areas. Chapter 4 deals with an extension of the methodology in Chapter 3 where a bivariate semiparametric procedure has been used to estimate the median income of families of varying sizes across small areas. This can also be seen as an extension of the time series modeling framework of Ghosh et al. (1996). We show that the semiparametric models generally have better performance than their time series counterparts and in a few situations, the performances are comparable. We want to convey the message that semiparametric regression methodology can provide an attractive alternative to the traditional modeling techniques specially when time varying information are available for small areas. In Chapter 5, we provide an overall discussion of our results and also point to some interesting open problems and areas for future research that may be worth pursuing. 1.2 Review of CaseControl Studies Casecontrol study is one area of Public Health and Epidemiology where statisticians have made far reaching contributions over the years. The fundamental principle of these studies is the comparison of a group of diseased subjects (cases) and a group of diseasefree subjects (controls) with respect to one or more risk factors of the disease. A primary goal of these studies is to analyze whether any or all of the risk factors are associated with the disease in any way. Casecontrol studies are useful for detecting diseaseexposure association for rare diseases like cancer, where following a healthy population (or cohort) over time is often impractical. Thus, case control studies are generally retrospective in nature. Casecontrol studies have consistently attracted the attention of statisticians, and as a result, a rich and voluminous body of work has developed over the years. Notable work in the Frequentist domain include Cornfield (1951) who pioneered the logistic model for the probability of disease given exposure. He was the first to demonstrate that the exposure odds ratio for cases versus controls equals the disease odds ratio for exposed versus unexposed and that the latter in turn approximates the ratio of the disease rates if the disease is rare. Let D and E be dichotomous factors respectively characterizing the disease and exposure status of individuals in a population. A common measure of association between D and E is the (disease) odds ratio P(D= 1IE= 1)/P(D= 01E= 1) P(D= IE = O)/P(D= 0E = 0) By applying the Bayes theorem, the above expression can be rewritten as = P(E = 1D = 1)/P(E= 0D = 1) (12) P(E = l1D = O)/P(E = 0D = 0) which is the exposure odds ratio. Another well known measure of association is the relative risk (RR) of disease for different exposure values given by P(D = 1 E = 1)/P(D = 1IE = 0). For rare diseases, both P(D = 0E = 0) and P(D = 0E = 1) are close to one and the disease odds ratio is approximately equal to the relative risk of disease. The classic paper by Mantel and Haenszel (1959) further clarified the relationship between a retrospective casecontrol study and a prospective cohort study. They considered a series of 2 x 2 tables as in Table 11 Table 11. A typical 2 x 2 table Disease Status Exposed Not Exposed Total Case nli no1i nli Control noii nooi noi Total eli eoi Ni Let there be I tables of the above form. Then, the MantelHaenszel (MH) estimator of the common odds ratio across the tables is given by / Sn11inooi/Ni Omh = i (13) nolinnoi/Ni i= 1 It may be of interest to test for the equality of the odds ratios across the I tables i.e Ho 81 = ... = 01 The test statistic for the above hypotheses is given by the Mantel Haenszel test statistic / {ni", E(n11ilOmh)}2 Tmh = i=1 (14) Var(nllil, mh) which follows an approximate X2 distribution with I 1 degrees of freedom under the null hypotheses. The derivation of the variance of the above estimator initially posed some challenge but was eventually addressed in several subsequent papers (Breslow, 1996). Breslow and Day (1980) marked the development of likelihood based inference methods for odds ratio. Methods to evaluate the simultaneous effects of multiple quantitative risk factors on disease rates were pioneered in the 1960's. In a casecontrol study, the appropriate likelihood is the "retrospective likelihood" of exposure given the disease status. Cornfield et al. (1961) noted that if the exposure distributions in the case and control populations are normal with different means but a common covariance matrix, then the prospective probability of disease (D) given the exposure (X) has the logistic form i.e P(D = 1IX = x) = L(a +O'x) (15) where L(u) = 1/1 + exp(u). However, there is a conceptual complication in using a prospective likelihood based on P(DIX) whereas a casecontrol sampling design naturally leads to a retrospective likelihood i.e involving P(XID). This issue was sorted by Prentice and Pyke (1979) who showed that the maximum likelihood estimators and asymptotic covariance matrices of the logodds ratios obtained from the retrospective likelihood are the same as that obtained from a prospective likelihood under a logistic formulation for the latter. Thus, casecontrol studies can be analyzed using a prospective likelihood which generally involves fewer nuisance parameters than a retrospective likelihood. Carroll et al. (1995) extended the prospective formulation to the situation of missing data and measurement error in the exposure variables. In a case control setup, matching if often used for selecting "comparable" controls to eliminate bias due to confounding. Statistical techniques for analyzing matched case  control data were first developed by Breslow et al. (1978). In the simplest setting, the data consist of m matched sets, say, Si,..., Sm, with Mi controls matched with a case in each set or stratum. A prospective stratified logistic disease incidence model given by P(D = lz, 5,) = L(a, +/3'(z zo)) (16) is assumed. ai's are the stratum specific intercept terms, treated as nuisance parameters and are eliminated by conditioning on the number of cases in each stratum. The generated conditional logistic likelihood yields the optimum estimating function (Godambe, 1976) for estimating P. The classical methods for analyzing unmatched and matched studies suffer from loss of efficiency when the exposure variable is partially missing. Lipsitz et al. (1998) proposed a pseudolikelihood method to handle missing exposure variables. Rathouz et al. (2002) developed a more efficient semiparametric method of estimation which took into account missing exposures in matched case control studies. Satten and Kupper (1993), Paik and Sacco (2000) and Satten and Carroll (2000) addressed the problem of missing exposure from a full likelihood approach by assuming a distribution of the exposure variable in the control population. Although a great amount of work has been done in the frequentist domain, Bayesian modeling for casecontrol studies did not really start until the late 1980's. The development of Markov chain Monte Carlo techniques lead to a rapid progression in this front. Althman (1971) is probably the first Bayesian work which considered several 2 x 2 contingency tables with a common odds ratio and performed a Bayesian test of association based on the common odds ratio. Later, Zelen and Parker (1986), Nurminen and Mutanen (1987) and Marshall (1988) considered identical Bayesian formulations of a case control model with a single binary exposure. These works dealt with inference from the posterior distribution of summary statistics like the log odds ratio, risk ratio and risk difference. Ashby et al. (1993) analyzed a case control study from a Bayesian perspective and used it as a source of prior information for a second study. Their paper emphasized the practical relevance of the Bayesian perspective in a epidemiological study as a natural framework for integrating and updating knowledge available at each stage. Muller and Roeder (1997) introduced a novel aspect to Bayesian treatment of casecontrol studies by considering continuous exposure with measurement error. Their approach is based on a nonparametric model for the retrospective likelihood of the covariates and the imprecisely measured exposure. They chose the nonparametric distribution to be a class of flexible mixture distributions, obtained by using a mixture of normal models with a Dirichlet process prior on the mixing measure (Escobar and West, 1995). The prospective disease model relating disease to exposure is assumed to have a logistic form characterized by a vector of log odds ratio parameters P. This paper pioneered the use of continuous covariates, measurement error and flexible nonparametric modeling of exposures in a Bayesian setting and brought to light the tremendous possibility of modern Bayesian computational techniques in solving complex data scenarios in casecontrol studies. Seaman and Richardson (2001) extended the binary exposure model of Zelen and Parker to any number of categorical exposures. They achieved this by replacing the usual binomial model by a multinomial one and using a MCMC scheme to estimate the log odds ratio of disease at each category with respect to the baseline category. As in Muller and Roeder, they assumed a prospective logistic likelihood and a flexible prior for the exposure distribution and derived the implied retrospective likelihood. Muller et al. (1999) considered any number of continuous and binary exposures. However, in contrast to Seaman and Richardson, they specified a retrospective likelihood and then derived the implied prospective likelihood. They also addressed the problem of handling categorical and quantitative exposures simultaneously. Continuous covariates can be treated in the Seaman and Richardson framework by discretizing them into groups and little information is lost if the discretization is sufficiently fine. Gustafson et al. (2002) treated the problem of measurement errors in exposure by approximating the imprecisely measured exposure by a discrete distribution supported on a suitably chosen grid. In the absence of measurement error, the support is chosen as the set of observed values of the exposure, a device that resembles the Bayesian Bootstrap (Rubin, 1981). They assigned a Dirichlet(1, 1,..., 1) prior on the probability vector corresponding to the grid points. Seaman and Richardson (2004) proved equivalence between the prospective and retrospective likelihood in the Bayesian context. Specifically, they showed that posterior distribution of the logodds ratios based on a prospective likelihood with a uniform prior distribution on the log odds (that an individual with baseline exposure is diseased) is exactly equivalent to that based on a retrospective likelihood with a Dirichlet prior distribution on the exposure probabilities in the control group. Thus, Bayesian analysis of casecontrol studies can be carried out using a logistic regression model under the assumption that the data was generated prospectively. Diggle et al. (2000) introduced Bayesian analysis for matched case controls studies when cases are individually matched to controls. They introduced nuisance parameters to represent the separate effect of matching in each matched set. Ghosh and Chen (2002) developed general Bayesian inferential techniques for matched casecontrol problems in the presence of one or more binary exposure variables. Their framework was more general than that of Zelen and Parker (1986). Unlike Diggle et al. (2000), they based their analysis on unconditional rather than the conditional likelihood after elimination of the nuisance parameters. Their framework included a wide variety of links like complimentary log links and some symmetric and skewed links in addition to the usual logit and probit links. Recently Sinha et al. (2004) and Sinha et al. (2005) proposed a unified Bayesian framework for matched casecontrol studies with missing exposures. They also motivated a semiparametric alternative for modeling varying stratum effects on the exposure distributions. The parameters were estimated in a Bayesian framework by using a nonparametric Dirichlet process prior on the stratum specific effects in the distribution of the exposure variable and parametric priors on all other parameters. The interesting aspect of the Bayesian semiparametric methodology is that it can capture unmeasured stratum heterogeneity in the distribution of the exposure variable in a robust manner. They also extended the proposed method to situations with multiple disease states. In a typical casecontrol study design, the exposure information is collected only once for the cases and controls. However, some recent medical studies Lewis et al. (1996) have indicated that a longitudinal approach of incorporating the entire exposure history, when available, may lead to a gain in information on the current disease status of a subject visavis more precise estimation of the odds ratio of disease. It may also provide insights on how the present disease status of a subject is being influenced by past exposure conditions conditional on the current ones. Unfortunately, proper and rigorous statistical methods of incorporating longitudinally varying exposure information inside the case control framework have not yet been properly developed. In this work, we present a Bayesian semiparametric approach for analyzing case control data when longitudinal exposure information is available for both cases and controls. 1.3 Review of Small Area Estimation Sample survey methodologies are widely used for collecting relevant information about a population of interest. In many surveys it may be of interest to estimate characteristics of small domains within the population of interest. Domains may be geographic areas like state or province, county, school district etc. or can even be identified by a particular sociodemographic characteristic like a specific agesexrace group within a large geographical area. Sometimes, the domainspecific sample size may be too small to yield direct designbased estimates of adequate precision. This led to the development of small area estimation procedures which provide accurate modelbased estimators of various features of small domains. In recent years, there has been a growing demand for small area statistics both from the public and private sectors. This is because these statistics are increasingly been used for formulating policies and decisions, in allocating federal funds to local jurisdictions and in regional planning. Ghosh and Rao (1994) provide a nice review of the different types of estimators and inferential procedures used in survey sampling and small area estimation. Since sample surveys are generally designed for large areas, the estimates of means or totals obtained thereof are reliable for large domains. Direct survey based estimators for small domains often yield large standard errors due to the small sample size of the concerned area. This is due to the fact that the original survey was designed to provide accuracy at a much higher level of aggregation than for local areas. This makes it a necessity to "borrow strength" from adjacent or related areas to find indirect estimators that increase the effective sample size and thus increase the precision of the resulting estimate for a given small area. Broadly speaking, a small area model has a generalized linear form with a mean term, a random areaspecific effect term and a measurement error term which reflects the noise for not sampling the entire domain. Both the random effect and the noise are assumed to be independent realizations from underlying distributions, usually assumed to be Gaussian. In the past decade, newer methodologies have been proposed for analyzing small area level data like empirical Bayes (EB), hierarchical Bayes (HB) and empirical best linear unbiased prediction (EBLUP). These have gone a long way in broadening the scope of application of small area estimation techniques. During the last 1015 years, model based inference has been widely used in the small area context. This is mainly due to the wide range of functionalities that comes with the linear mixed effects modeling framework. Some of the main advantages of this framework are (i) Random areaspecific effects accounting for between area variation above and beyond that explained by auxiliary variables in the model. (ii) Different variations like nonlinear mixed effects models, logistic regression models, generalized linear models can be entertained. (iii) Area specific measures of precision can be associated with each small area estimate unlike the global measures. (iv) Complex data structures like spatial dependence, time series structures, longitudinal measurements can be explored and (v) Recent methodological developments for random effects models can be utilized to achieve accurate small area inferences. Generally, there are two kinds of small area models depending on whether the response is observed at the area or the unit level. 1. Area (or aggregate) level models relate small area means to area specific auxiliary variables. 2. Unit level models relate the unit values of the study variable to unitspecific auxiliary variables. The basic area level model is given by 0, = z', + bivi, i = 1..., m (17) Here 0i is often assumed to be a function of the population mean, Y, of the ith small area, zi = (zi, ..., zi,)' is the corresponding auxiliary data, via's are area specific random effects assumed to be independent and identically distributed with mean 0 and constant variance a Lastly, big's are known positive constants and 3 = (/1, ..., 3p)' is the vector of regression coefficients. In order to infer about the small area means, Y,, direct estimators, Y, are assumed to be known and available. The linear model Oi =g( ) = Oi + e, i = 1,... m (18) is assumed where the sampling errors, e, are independent with Ep(eii0) = 0, Vp(eii0) = i,, i, known which implies that 0, are designunbiased. By setting ov = 0 in (17), we have 0, = z'p which leads to synthetic estimators that does not account for local variation above and beyond that reflected in the auxiliary variables z,. Combining (17) and (18), we have 0, = z' + bivi + e, (19) which is a special case of a linear mixed model. Here, vi and e, are assumed to be independent. Fay and Herriot (1979) studied the above area level model (19) in the context of estimating the per capital income (PCI) for small places in the United States and proposed Empirical Bayes estimator for that case. Ericksen and Kadane (1985) used the same model with bi = 1 and known 2 to estimate the undercount in the decennial census of U.S. The area level model has also been used recently to produce model based county estimates of poor school age children in the United States. In the unit level model, it is assumed that unit specific auxiliary data xy (xil, ..., Xip)' are available for each population element j in each small area i. Moreover, it is assumed that the variable of interest, yy, is related to x, through a onefold nested error linear regression model yU = x,3 + vi + eu, i = 1,..., m;j =1, ..., N (110) Here the area specific effects vi are assumed to be independently and identically distributed (i.i.d) with mean 0 and constant variance o2, e, = kyy where k, is known and ey's are i.i.d random variables independent of v,'s with mean 0 and constant variance a Often normality of vi and e,'s are assumed. For these models, the parameters of interest are the small area means Y, or the totals Y,. Battese et al. (1988) studied the nested error regression model (110) in estimating the area under corn and soyabeans for counties in NorthCentral Iowa using sample survey data and satellite information. In doing so, they came up with an empirical best linear unbiased predictor (EBLUP) for the small area means. Over the years, numerous extensions have been proposed for the above modeling frameworks including multivariate FayHerriot models, generalized linear models, spatial models and models with more complicated randomeffects structure etc. Rao (2003) presented a nice overview of the different estimation methods while Jiang and Lahiri (2006) reviewed the development of mixed model estimation in the small area context. A proper review of model based small area estimation will be incomplete without explaining the EBLUP, EB and HB approaches that are being widely used in this context. As shown above, small area models are special cases of general linear mixed models involving fixed and random effects such that small area parameters can be expressed as linear combinations of these effects. Henderson (1950) derived the BLUP estimators of small area parameters in the classical frequentist framework. These are so called because they minimize the mean squared error among the class of linear unbiased estimators and do not depend on normality. So, they are similar to the best linear unbiased estimators (BLUEs) of fixed parameters. The BLUP estimator takes proper account of the between area variation relative to the precision of the direct estimator. An EBLUP estimator is obtained by replacing the parameters with the asymptotically consistent estimator. Robinson (1991) gives an excellent account of BLUP theory and some applications. In an EB approach, the posterior distribution of the parameters of interest given the data is first obtained assuming that the model parameters are known. The model parameters are estimated from the marginal distribution of the data and inferences are based on the estimated posterior distribution. An excellent account of the EB approach along with its important applications is given in Morris (1983). Last but not the least, in the HB approach, a prior distribution is specified on the model parameters and the posterior distribution of the parameter of interest is obtained. Inferences about the parameters are based on the posterior distribution. The parameter of interest is estimated by its posterior mean while its precision is estimated by its posterior variance. Recent advances in Markov chain Monte Carlo technique, specifically Gibbs and Metropolis Hastings samplers have considerably simplified the computational aspect of HB procedures. The Small Area Income and Poverty Estimates (SAIPE) program of the U.S. Census Bureau was established with the aim of providing annual estimates of income and poverty statistics for all states, counties and school districts across the United States. The resulting estimates are generally used for the administration of federal programs and the allocation of federal funds to local jurisdictions. The SAIPE program also provides annual state and county level estimates of median household income. Generally, observations on various characteristics of small areas that are collected over time may possess a complicated underlying timevarying pattern. It is likely that models which takes into account this longitudinal pattern in the observations may perform better than classical small area models which do not utilize this information. In this study, we present a semiparametric Bayesian framework for the analysis of small area level data which explicitly accomodates for the longitudinal time varying pattern in the response and the covariates. 1.4 NonParametric Regression Methodology Nonparametric regression methods provide a powerful and flexible alternative to parametric approaches when the relationship between response and predictor variables is too complex to be expressible using a known functional form. One of the main differences between parametric and nonparametric regression methodologies is that, in the former, the true shape of the functional pattern is determined by the model while in the latter, the shape is determined by the data itself. Suppose, the response y and the covariate x are related as yi = f(xi) e, = 1,2 ...n. (111) where f(x) is an unknown and unspecified smooth function of x and ei N(O0, o2). The basic problem of "nonparametric regression" is to estimate the function f() using the data points (xi, yi). In doing so, it is typically assumed that beneath a rough observational data pattern there is a smooth trajectory. This underlying smooth pattern is estimated by various smoothing techniques. Broadly, there are four major classes of smoothers used to estimate f(.) viz Local polynomial kernel smoothers (Fan and Gijbels (1996); Wand and Jones (1995)), Regression splines (Eubank (1988), Eubank (1999)), Smoothing splines (Wahba (1990); Green and Silverman (1994)) and Penalized splines (Eilers and Marx (1996); Ruppert et al. (2003)). Each smoother has its own strengths and weaknesses. For example, local polynomial smoothers are computationally advantageous for handling dense regions while smoothing splines may be better for sparse regions. Here, we will briefly review the main characteristics of splines in general and penalized splines in particular. The basic idea behind splines is to express the unknown function f(x) using piecewise polynomials. Two adjacent polynomials are smoothly joined at specific points in the range of x known as "knots". The knots, say, ( 7,.... K) partition the range of x into K distinct subintervals (or neighborhoods). Within each such neighborhood, a polynomial of certain degree is defined. A polynomial spline of degree p has (p 1) continuous derivatives and a discontinuous pth derivative at any interior knot. The pth derivative reflects the "jump" of the splines at the knots. Thus, a spline of degree 0 is a step function, a spline of degree 1 is a piecewise linear function and so on. For example, f(.) can be represented as a linear combination of a pth degree truncated polynomial basis having K knots, given by 1,x ..., xP, (x 'T)P ..., (X TK)P. (112) Here (x Tk)P is the function (x Tk)1/{x>T}. Using the above basis, a spline of degree p can be expressed as K f(xl3, 7) = 0 + ix+ ... + PXPk+ 7(x Tk) (113) k=l Here, (/3, ..., /p) and (7, .... TK) are the coefficients of the polynomial and spline portions of the above structure and must be estimated. p = 1, 2, 3 corresponds to a linear, quadratic or cubic spline respectively. The above basis constitutes one of the most commonly used basis functions while other bases like radial basis or Bsplines can also be used. It can be shown that there exists a very rich class of splinegenerating functions which in turn greatly increases the scope and applicability of splines in various modeling frameworks. Moreover, the very structure of the splines makes them extremely good at capturing local variations in a pattern of observations, something which cannot be achieved using Fourier or Polynomial bases. One of the most important aspect of smoothing is the proper selection and positioning of the knots. This is because the knots act as "sensors" in relaying information about the underlying "true" observational pattern. Too few knots often lead to a biased fit while an excessive number of knots leads to overfitting visavis overparametrization and may even worsen the resulting fit. Thus, a sufficient number of knots should be used and they should be placed uniformly throughout the range of the independent variable. Generally, the knots are placed on a grid of equally spaced sample quantiles of x and a maximum of 35 to 40 knots suffices for any practical problem (Ruppert, 2002). Recently, there have been interesting contributions on knot selections which are more "data driven" or "adaptive" in nature both in the frequentist and Bayesian domains (Friedman (1991); Stone et al. (1997); Denison et al. (1998); Lindstrom (1999); DiMatteo et al. (2001); Botts and Daniels (2008)). The flexibility and wide applicability of splines is due to the fact that provided the knots are evenly spread out over the range of x, f(xl, 7) can accurately estimate a very large class of smooth functions f(.) even if the degree of the spline is kept relatively low (say, 1 or 2). The spline coefficients (71 ...,7K) in (113) correspond to the discontinuous pth derivative of the spline thus, they measure the jumps of the spline at the knots (7, ..., K). Thus, they contribute to the roughness of the resulting spline. In order to "smoothout" the fit, a "roughness penalty" is placed on these parameters. This is often done by minimizing the expression n K S= y f(x,3, 7))}2 +' (114) i=1 k=l where A is known as the smoothing parameter. This is synonymous to minimizing the first part of (114) subject to the constraint 7'7 < A. A plays a crucial role in the smoothing process since it controls the goodness of fit and roughness of the fitted model. Decreasing A, the spline will tend to overfit, becoming an interpolating curve as A  0. Increasing A, the spline will become smoother and will tend to the least squares fit as A oc. There are different methods for choosing the optimal A like crossvalidation, generalized crossvalidation, Mallow's Cp criterion etc. Broadly speaking, there are three main types of splines : Regression splines, Smoothing splines and Penalized splines (or Psplines). All of them are based on the same principle as detailed above but differ in the specific manner in which smoothing is done or the knots are selected. In regression splines, smoothing is achieved by the deletion of nonessential knots or equivalently, by setting the jumps at those knots to zero keeping the jumps at the other knots undisturbed. In smoothing and penalized splines, smoothing is achieved by shrinking the jumps at all the knots towards zero using a penalty function as shown in (114). A major difference between smoothing splines and penalized splines is that, in the former, all the unique data points are used as knots but in the latter the number of knots are much smaller resulting in more flexibility. Infact, penalized splines can be seen as a generalization of regression and smoothing splines. The wide applicability of penalized splines in diverse settings is mainly due to its correspondence with linear mixed effects models. Infact, penalized splines can be shown to be best linear unbiased predictors (BLUP)'s in a mixed model framework. To see this, we rewrite (114) as n S = {yi f (xi 7)2 + AO'D (115) i=1 where 0 = (, (/3, )', =. ( 1 3p )',7 (71, 72 ..., 7K)' and D is a known positive semidefinite penalty matrix such that D 0(p+l)x(p+l) 0(p+l)x(K) 0(K)x(p+l) K1 Different types of penalties can be accommodated by specifying different forms of D. For example, the penalty / f (2(x 1, 7) used for smoothing splines can be achieved with D being the sample second moment matrix of the second derivatives of the spline basis functions. However, the above form of D only penalizes the spline coefficients (71 ..., 7K). Specifically, the penalty in (114) corresponds to setting : = I. Let X be the matrix with the ith row Xi = (1, xi, ..., x) and Z be the matrix with the ith row Zi = {(xi Ti), ..., (x, Ti)P). Using this formulation in (115) with the basis function in (112) and dividing by the error variance ao, we have 1 S y = I XP z112 + 11712 (116) oe oe By assuming that is a vector of random effects with Cov() = oI where o2 = o. /A while 0 as the set of fixed effects parameters, the above penalized spline framework yields the following well known mixed effects model representation : y = X + Z +e (117) where Cov(e) = ol and 7 and e are independent. Bayesian Psplines have recently become popular because they combine the flexibility of nonparametric models and the exact inference provided by the Bayesian inferential procedure. This is even more true because of the seamless fusion of penalized splines into the mixed model framework (Wand, 2003) as shown above. This equivalence also carries over to the manner in which smoothing is done. Smoothing can be achieved by imposing penalties on the spline coefficients, 7 as shown in (114) or by assuming a distributional form for 7, for example 7 ~ NK(O, 721K). In the Bayesian context, priors are placed on 2 and the other parameters and usual posterior sampling is carried out. Since samples are generated from the smoothing parameter alongside the other parameters, this method is also known as automatic scatterplot smoothing. In all the problems tackled in this dissertation, we will be using Bayesian inferential procedures on penalized splines as shown above. CHAPTER 2 BAYESIAN SEMIPARAMETRIC ANALYSIS OF CASE CONTROL STUDIES WITH TIME VARYING EXPOSURES 2.1 Introduction The fundamental problem of case control studies is the comparison of a group of subjects having a particular disease (cases) to a group of diseasefree subjects (controls) with respect to some potential risk factors (or exposures) of the disease. Typically, in a case control study, the exposure information is collected only once for the cases and controls. However, some recent medical studies (Lewis et al., 1996) have indicated that a longitudinal approach of incorporating the entire exposure history, when available, may lead to a gain in information on the current disease status of a subject and more precise estimation of the odds ratio of disease. It may also provide insights on how the present disease status of a subject is being influenced by past exposure conditions conditional on the current ones. In this work, we present a Bayesian semiparametric approach for analyzing case control data when longitudinal exposure information is available for both cases and controls. Statistical analysis of casecontrol data was pioneered by Cornfield (1951), Cornfield et al. (1961) and Mantel and Haenszel (1959). Since then, important and far reaching contributions have been made in virtually every aspect of the field. Some of the notable ones are equivalence of prospective and retrospective likelihood (Prentice and Pyke, 1979), measurement error in exposures (Roeder et al., 1996) and matched casecontrol studies (Breslow et al., 1978). Important contributions in the Bayesian paradigm include binary exposures (Zelen and Parker, 1986), continuous exposures (Muller and Roeder, 1997), categorical exposures (Seaman and Richardson, 2001), equivalence (Seaman and Richardson, 2004) and matching (Diggle et al. (2000); Ghosh and Chen (2002)). The analysis of complex data scenarios in a case control framework is a relatively new area of research. Specifically, analysis of longitudinal case control studies has only started to receive some attention. Park and Kim (2004) are one of the first contributors to this area. They proposed an ordinary logistic model to analyze longitudinal case control data but ignored the longitudinal nature of the cohort. They also showed that ordinary generalized estimating equations (GEE) based on an independent correlation structure fails in this framework. 2.1.1 Setting Casecontrol study designs generally incorporate exposure information for a single time point in the past. In some situations however, an entire exposure history may be available for the cases and controls containing relevant exposure information collected at multiple time points in the past. However, proper and rigorous statistical methods of incorporating longitudinally varying exposure information inside the case control framework have not yet been adequately developed. This may be due to the obvious complications in properly handling a longitudinal exposure profile and thereby integrating it in an existing case control framework. But once done, there may be significant payoffs notably, the ability to learn how the present disease status of a subject is being influenced by his/her past exposure conditions conditional on the current ones. It can also lead to valuable insights about differences in the exposure patterns between the cases and controls over a long time span. In analyzing the effect of a longitudinally varying exposure profile on a binary outcome variable (like disease status), some of the possible challenges are : (1) The longitudinal exposure observations may be unbalanced in nature i.e the number of observations and also the observation times may differ from subject to subject; (2) The exposure trajectory may be highly nonlinear; (3) The exposure observations may be subject to considerable measurement error and (4) The effect of the exposure profile on the disease outcome may itself be complex and can even change over time. In view of the above challenges, we propose to use functional data analytic techniques, specially nonparametric regression methodology to model both the time varying exposure profile and also the influence pattern of the exposure profile on the binary outcome. Specifically, we model the underlying exposure trajectory using penalized splines or psplines (Eilers and Marx (1996); Ruppert et al. (2003)). We also express the effect of the exposures on the current disease state as a penalized spline to account for any possible time varying patterns of influence. Analysis is carried out in a hierarchical Bayesian framework. Our modeling framework is quite flexible since it can accommodate any possible nonlinear time varying pattern in the exposure and influence profiles. It is difficult to achieve the same goal in a purely parametric setting. In a casecontrol study, the natural likelihood is the retrospective likelihood, based on the probability of exposure given the disease status. Prentice and Pyke (1979) showed that the maximum likelihood estimators and asymptotic covariance matrices of the logodds ratios obtained from a retrospective likelihood are the same as that obtained from a prospective likelihood (based on the probability of disease given exposure) under a logistic formulation for the latter. Thus, casecontrol studies can be analyzed using a prospective likelihood which generally involves fewer nuisance parameters than a retrospective likelihood. Seaman and Richardson (2004) proved a similar result in the Bayesian context. Specifically, they showed that posterior distribution of the logodds ratios based on a prospective likelihood with a uniform prior distribution on the log odds (that an individual with baseline exposure is diseased) is exactly equivalent to that based on a retrospective likelihood with a Dirichlet prior distribution on the exposure probabilities in the control group. Thus, Bayesian analysis of casecontrol studies can be carried out using a logistic regression model under the assumption that the data was generated prospectively. We show that the results of Seaman and Richardson (2004) applies for the proposed semiparametric framework thus enabling us to perform the analysis based on a prospective likelihood even though a case control study is retrospective in nature. We perform model checking based on the posterior predictive loss criterion (Gelfand and Ghosh, 1998). Once the optimal model is identified, model assessment is carried out using case deletion diagnostics (Bradlow and Zaslavsky, 1997). 2.1.2 Motivating Dataset : Prostate Cancer Study We illustrate our methodology on a data set from the Beta Carotene and Retinol Efficacy Trial conducted by the Fred Hutchinson Cancer Research Center in Seattle, Washington (Etzioni et al., 1999). This data set is based on a biomarker based screening procedure for prostate cancer to elucidate the association between prostate cancer and prostatespecific antigen (PSA). The effectiveness of biomarker based screening procedures for prostate cancer is currently a topic of intense debate and investigation in the realms of health care practice, policy and research. Since the discovery of prostatespecific antigen (PSA) and the observation that serum PSA levels maybe significantly increased in prostate cancer patients, a lot of effort has been dedicated to identifying effective PSA based testing programs with favorable diagnostic properties. In this study, the levels of free and total PSA were measured in the sera of 71 prostate cancer cases and 70 controls. Participants in this study included men aged 50 to 65 at high risk of lung cancer. They were randomized to receive either placebo or Beta Carotene and Retinol. The intervention had no noticeable effect on the incidence of prostate cancer, with similar number of cases observed in the intervention and control arms. Several PSA measurements recorded for the cases were taken as long as 10 years prior to their diagnosis. The 71 prostate cancer cases were diagnosed between September 1988 and September 1995 inclusive. The individuals deemed "controls" were selected among individuals not yet diagnosed as having cancer by the time of analysis. As the exposure variable, we use the natural logarithm of the total PSA (Ptotal) although the negative logarithm of the ratio of free to total PSA (Pratio) can also be considered. In addition to the above measurements, observations were collected on time (years) relative to prostate cancer diagnosis and age at blood draw for the cases and controls at each of the time points. Figure 21 shows the PSA trajectory against age for some randomly chosen cases and controls. Etzioni et al. (1999) analyzed this data set by modeling the receiver operating characteristic (ROC) curves associated with both the biomarkers (Ptotal and Pratio) as a function of the time with respect to diagnosis. They observed that although the two markers performed similarly eight years prior to diagnosis, Ptotal was superior to Pratio at times closer to diagnosis. The rest of the chapter is organized as follows. In Section 2.2, we introduce the semiparametric modeling framework. Section 2.3 describes the details of posterior inference. In Section 2.4, we discuss relevant Bayesian equivalence results for our framework. Section 2.5 outlines the model comparison and model assessment procedures we performed. We describe the data analysis results based on the prostate cancer data set in Section 2.6 and end with a discussion in Section 2.7. 2.2 Model Specification 2.2.1 Notation Let Y, be the jth exposure (PSA) observation recorded for the ith subject, a, the age of the ith subject when the jth PSA observation is collected while t, denote the time of thejth PSA measurement relative to the time of diagnosis for the ith subject (i = 1,..., N;j = 1,..., ni). For cases, time of diagnosis is the time when cancer was detected and no PSA measurement is available at that time. For controls, time of diagnosis is synonymous to the last observation time or the time of normal digital rectal examination (DRE). Denoting the age at diagnosis of the ith subject by ad, we have the simple linear relation a, = t, + ad. This relationship will be used to simplify notation below. 2.2.2 Model Framework Our framework is composed of two models (1) A trajectory model for the longitudinal exposure trajectory and (2) a disease model for the effect of the exposure 56 58 Age 60 62 59 60 61 62 63 64 65 66 59 60 61 62 63 64 65 66 48 50 52 54 56 58 60 Age 58 60 62 Age 62 64 64 66 58 60 62 64 66 Longitudinal exposure (PSA) profiles of 3 randomly sampled cases (1st column) and 3 randomly sampled controls (2nd column) plotted against age. Figure 21. v trajectory on the binary disease outcome. Inference on these two models will be done simultaneously and is described in Section 3. Our modeling framework bears some resemblance to that of Zhang et al. (2007) who used a two stage functional mixed model approach for modeling the effect of a longitudinal covariate profile on a scalar outcome. They proposed a linear functional mixed effects model for modeling the repeated measurements on the covariate. The effect of the covariate profile on the scalar outcome was modeled using a partial functional linear model. In doing so, they treated the unobserved true subjectspecific covariate time profile as a functional covariate. For fitting purposes, they developed a twostage nonparametric regression calibration method using smoothing splines. Thus, estimation at both the stages was conveniently cast into a unified mixed model framework by using the relation between smoothing splines and mixed models. The key differences between their framework and ours is that we use Bayesian inferential techniques to simultaneously estimate the parameters of the exposure and disease models. Moreover, instead of a linear modeling framework, we use a combination of linear and logistic models since our response is binary. Exposure Trajectory Model The exposure trajectory model is given by vy = Xi(ay) + e = f(ay) + gi(ay) + e (21) where e ~ N(0, o2), f(a) is the population mean function modeling the overall PSA trend as a function of age for all the subjects while gi(a) is the subject specific deviation function reflecting the deviation of the ith subject specific profile from the mean population profile. The reason for modeling exposure as a function of age is that for a randomly chosen subject with unknown disease status, the PSA value at a certain time point should depend on the subject's age at that time point controlling for the time with respect to diagnosis. In other words, the same exposure observation recorded at the same time relative to diagnosis for two subjects with widely different age ranges should have different significance. We represent both f(ay) and gi(ay) using psplines as follows K f(ad) = o+ 1 ad + ... /3pay /Wa, T/)p =+ pO,(a ) + k=l M gi(ay) = bio+ bila + ...+ bqa bi,qm(ao m) =q,,(aJ)'bi, (22) m=l where p,,P(a [) = [1, a, ..., ay, (ad 7i)P.. (a rK)]' and Pq,,(ad) = [1, a, ..., a, (a K1),..., (a KM)+]' are truncated polynomial basis functions of degrees p and q with knots (Tr,..., TK) and (, ..., KM) respectively (Durban et al., 2004). Generally, M < K. Disease Model The prospective disease model is given by P(Di = lXi(t +a), c < t < 0) = L(a + Xi(t + a)7(t + a)dt (23) where L(.) is the logistic distribution function, X,(t+ a) is the true, errorfree unobserved subjectspecific exposure profile modeled as f(t + ad) + gi(t + ad) while y(t + ad) is an unknown smooth function of age which reflects the time pattern of the effect of the PSA trajectory on the current disease status for the ith subject. In (23), we use the relation ay = t + afd to model the exposure trajectory X(.) and the influence function 7(.) as a function of time with respect to diagnosis. In doing so, we can easily assess the effect of the trajectory on the current disease state at any given point before diagnosis for a particular subject. "c" is the time by which we go back in the past to record the exposure history for the ith subject; e.g. c = 8 would imply that, for the ith subject, the exposure observations recorded since eight years prior to diagnosis are being considered for analysis. Thus, by changing the value of c, the effect of differential lengths of PSA trajectories on the current disease status can be studied. In the most general case, y(t + a) can also be modeled as a Pspline i.e K* 7(t +a,) = o+01(t+a )+... + 0(t + )a O+kr(t + a (): k=l = r.dC(t+ a)' (24) where Vrc(t + a) = [1, (t + a),..., (t + af), (t a ) ,..., (t + af K) ] S= (40 .... OK* r)' and ((1, ..., K*) are the knots. As special cases of (24), we may consider y(t + a) = 0, in which case the covariate is the area under the PSA process {X,(t a), c < t < 0} and ao is its effect on the disease probability (or logit of the disease probability). We can also assume 7(t + ad) = Oo + 01(t + ad) which signifies a linear pattern of the effect of the exposure trajectory on the disease probability. In the above models, the knots can be chosen on a grid of equally spaced quantiles of the ages. Replacing (22) and (24) in the R.H.S of (23), we have P(Di = 1X,(t +ad),c< t = L (a+ (Pp,(t+ a )'Pi+q,(t+a)'b,)((ta + a)'dt) = L(a +'Mi+ bQi) (25) where M, = p,,'(t a+ a)I ( t a)'dt and Q, = f (t + ai)'rc(t + a)'dt. For prechosen degrees of the basis functions and the knots, both Mi and Q, are matrices and are available in closed forms. We assume normal distributional forms for the spline coefficients in (22) and (24) in order to penalize the jumps of the spline at the knots. Thus, we have 3p+k ~ N(O, a)(k = 1,... K); b,q+m N(O, j)(m = 1...M) and /k+r ~ N(0, o)(k = 1.... K*). Finally, the random subject specific deviation function g,(ay) is modeled as b, ~ N(0, oj)(i = 1 ..., N;j = 0 ..., q). 2.3 Posterior Inference 2.3.1 Likelihood Function Let Yi = (Y,1 ..., Yi)' and Di be the exposure vector and disease status while a, = (ai, ..., ain,)' and ti = (ti, ..., tin)' be the observed values of age and time with respect to diagnosis for the ith subject respectively. So, the response vector for the ith subject will be the pair (Yi, Di). Let 02 = (c, 3, 2fa, b, ,, ac, a e ..., C}) be the parameter space corresponding to the ith subject. Thus, the full parameter space will be given by 0 = E2 u ,2 U ... U vN. The likelihood for to the ith subject, conditional on the random effects is given by L(Y,, Di, ail i) oc p(Yi,, a,, bi, 0 )p(Dia, /, 4P )p(3(2)l )p( (2) ) N q xp(b(2) o) l p(b 1iJ2) (26) il j 0 where p(Yi,/, a,, bi, o ) is the probability distribution corresponding to the trajectory model, p(Dla, 0, 4) denotes the logistic distribution corresponding to the disease model while the rest deals with the distributional structures on the spline coefficients and random effects. Since the trajectory model (21) has a normal distributional structure while the disease model (23) has a logistic structure, the likelihood function and hence the posterior have a complicated form. To alleviate this problem, we approximate the logistic distribution as a mixture of normals using a well known data augmentation algorithm proposed by Albert and Chib (1993). This is briefly explained in Section 3.3. 2.3.2 Priors To complete the Bayesian specification of our model, we need to assign prior distributions to the unknown parameters. We assume diffuse normal priors for the polynomial coefficients (/3, ..., /p) and (a, o,..., ,). For the variance components (o, ao, oa o {o, ..., ao}), we assume uniform priors with large upper bounds. The prior distributions are assumed to be mutually independent. We choose large values for the normal prior variances to make the priors diffuse in nature so that inference is mainly controlled by the data distribution. 2.3.3 Posterior Computation Likelihood Approximation As mentioned in Section 3.1, we have used the data augmentation algorithm of Albert and Chib (1993) to approximate the likelihood and thus simplify posterior inference. They showed that a logistic regression model on binary outcomes can be well approximated by an underlying mixture of normal regression structure on latent continuous data. In doing so, it can be shown that a logit link is approximately equivalent to a Studentt link with 8 degrees of freedom. As in Albert and Chib (1993), we introduce latent variables Z, Z,, ..., ZN such that Di = 1 if Z, > 0 and Di = 0 otherwise. Let Z, be independently distributed from a t distribution with location Hi = a O/'Mi, + b'Qif, scale parameter 1 and degrees of freedom v. Equivalently, with the introduction of the additional random variable A,, the distribution of Z, can be expressed as scale mixtures of normal distribution ZiA, N(Hi, A, 1), A, Gamma(v/2, 2/v) where the Gamma pdf is proportional to A /2lexp(vAi/2). Using this approximation, we can replace the logit link by a mixture of normals and can rewrite (2 6) as L(Yi,Di,aiji) oc f{p(YuS, Sa} A Ip(ZHi, 1/Ai) G(Ai\v/2, 2/v)dzidA j= 1 N q x p(j3(2)),o)p( (2)jaia)p(b(2) ,) 1 gP(b2 i ) i=1 j=0 where, p(Ula, b) denotes a normal density with mean a and variance b while G(Vla, b) denotes a gamma density with shape a and rate b. Moreover, S, = p,,(a,)'3 + Cq,,(a,)'b, and A, = {/(Z, > 0)I(Di = 1)+ I(Z, < 0)I(Di = 0)}. Posterior Sampling The full posterior of the parameters is given by m q p(2Y, D, a) N x L(Yi, Di, ai, l )() ( )( ) () where P(1) = ( /3 ...j3p) and (1) = ( ,0 ...., r). The full posterior can be factorized as [Q2Y,D,a] oc [Y3, b, a] [f / {A[Z Di, Ai,,/,a,f ,bi,][Ail,]}dzidA] [(2) ] =1. " N q q ] H\}[b b^[bUlj2][0(1)][ 1 ] ] ] 3e] i lj O j=O where 0 is the entire parameter space. Our main parameter of interest is 0 in (25). Since, the marginal posterior distribution of 0 is analytically intractable, we construct an MCMC algorithm to sample from its full conditionals. In doing so, we use multiple chains and monitor convergence of the samplers using Gelman and Rubin diagnostics (Gelman and Rubin, 1992). 2.4 Bayesian Equivalence As mentioned in Section (1.2), Seaman and Richardson (2004) showed that for certain choices of the priors on the log odds, posterior inference for the parameter of interest based on a prospective logistic model can be shown to be equivalent to that based on a retrospective one. As a result, a prospective modeling framework can be used to analyze casecontrol data which are generally collected retrospectively. Here we show that the Bayesian equivalence results of Seaman and Richardson (2004) can be extended to the semiparametric framework we have proposed. This enables us to use a prospective logistic framework (as described in Section (2.2.2)) to analyze the PSA dataset. Our modeling framework hinges on the idea that for every subject, instead of a single exposure observation, a series of past exposure observations are available. We use this "exposure trajectory" or "exposure profile" in analyzing the present disease status of a subject. In the spirit of our dataset, we assume that the exposure observations are continuous. Let the exposure profile for the ith subject be X,(t) = {X,, ...,X,,, i = 1, ..., N; c < t < 0} where X, is thejth exposure observation recorded for the ith subject. Let X = {X1, ..X,1 ...X XNv ..., Xvnn} be the set of all exposure observations. Since an exposure trajectory is composed of a finite set of exposure observations, the discretizing mechanism proposed by Rubin (1981) and later by Gustafson et al. (2002) can be applied to the trajectory as a whole i.e {X,(t), c < t < 0} can be assumed to be a discrete random variable with support {Z,(t),..., Zj(t), c < t < 0}, the set of all observable exposure trajectories where {Z(t), c < t < 0,j = 1,...,J} is a finite collection of elements in the support of the X,'s. Let Yoj and Yj be the number of controls and cases having exposure profile {Z,(t), c < t < 0}. We denote the "Null" or "baseline" trajectory as X(t) =0,c < t <0}. The odds ratio of disease corresponding to Zj(t), c < t < 0} with respect to baseline exposure is exp (/ Zj(t)7(t)dt) Assuming that a control has exposure profile {Zj(t), c < t < 0} with probability 6/ J=16k, it can be easily shown that 6,exp Z( t) ( t)dt P(X(t) = Z(t), c < t < 0D = 1) = Z(t)(t)d S kexp (J_ Zk(t)7(t)dt) k=c Thus, the retrospective likelihood is Ydj i j 5exp (dJ Zj(t)7(t)dt) L(56, ) = co _j  (27) d=O 1 6Skexp(d Zk()7(t)dt) k=1 c yd j exp do a Z(t)Wq(t)dt = co (28) d= Oj1 6kexp(do l Zk(t)I(t)tt since 7(t) = I(t)'4 = 4')(t) by (24). We assume 1 = 1 for identifiability. Here d = 0 and 1 stands for controls and cases respectively. Assuming 0 to be the baseline odds of disease, the prospective likelihood is given by l Ydi 1 Od exp (d Z(t)Wq(t)dt L( f)= = 10 (29) d=Cj= 1 i kexp (koI t) Zk(,t) dt k=0 \ Jc / Based on the above setup, we have the following equivalence results : Theorem 1. The profile likelihood of ) obtained by maximizing L(6, 4) in (28) with respect to 6 is the same as that obtained by maximizing L(O, 4) in (29) with respect to Theorem 2. Let Ydj (d = 0, 1; j = 1,..., J) be independently distributed as Poisson(Adj) where logAd = d/log, + logz + d4' f Z ,(t)(t) dt (210) JC We assume independent priors, p(O) oc 01 and p(6j) oc 5ja1 for i and 6. The prior for 4, p(4) is chosen to be independent of ) and 6 such that for some q and r such that yq > 1 and yor > 1, E ( Zq(t)q (t) dt and E 4' Z,(t)q(t)dt exists and are c c J finite (i.e p(4) is such that E(4) exists and is finite). Let y + = yoj + y and yd = Ydj j=1 Then the following two statements hold : (i) Assuming w = logO, the posterior density of (w, 4) is *0 A] j {exp (w+ 4'c Zj(t)W(t)dt)} p*,w, 01y) N p(O) H 0 J+^ (211) 1 + exp w+f Z/ (t)W(t)dt J (ii) Assuming 0 = (60, ..., 0) and j = 6j/ 6k, the posterior density of (0, 4) is k=1 J i1 Oieexp (d Zj(t)W4t))dt) p(, 0y) N p() H j Yd+ (212) j{1 d=o Yexp df Zj(At)Wt)dt j1 \ c / (iii) The marginal posterior densities of 4 obtainable from p(w, 0y) and p(0, 0y) are the same. The proofs of the above theorem are similar in nature to those in Seaman and Richardson (2004) and are given in the Appendix A. Since we have considered near uniform prior for a and our prior on 4 ensures the existence and finiteness of E(O), the conditions of Theorem 2 are essentially satisfied for our framework. Based on the above results, it can be concluded that the marginal posterior distribution of 4 the parameter of interest, will be the same regardless of whether we fit a prospective or retrospective model. Thus, we can analyze the PSA data using the prospective semiparametric modeling framework described above. Bayesian equivalence can also be shown in the more general case of multicategory case control setup, i.e when there are multiple (> 2) disease states. We have the following result Theorem 3. Let, {X(t), c < t < 0} be any exposure trajectory with support {Z1(t),..., ZK(t), c < t < 0}, the set of all observable exposure trajectories. Let there are r + 1 disease categories. Suppose Ydk (d = 0, 1, ..., r; k = 1,..., K) be independently distributed as Poisson(Adk) where log(Adk) = Ig(d) + og(rdk) + log(6k), log(Aok) = Iog(k). 0d being the baseline odds for disease category d and rl being the parameter of interest. Assume independent improper priors, r (Od) oc 1d, Tr(6k) oc 61 for 0 and 6 and a prior 7r(rl) forrl that is independent of 6 and 6 and proper i.e E(rl) exists and is finite. Let ndk be the number of individuals with D = d and {X(t) = Zk(t), c < t < 0}. Then the following two statements holds (i) The posterior density of (rl, 6) is r K K r =o ndk 7(7, VIn) x nn( d)"dk)ndk d+ nldk 1 7r(77) d lk 1 k=1 d=1 d=1 K (ii) Assuming 0 = (6, ..., OK) and Ok = k/ 6 6, the posterior density of (0, r1) is = 1 ndk o(, 0n) N K nOk r kdk K ) 1 k=1 d=1 k=1 /d k=1 (iii) The marginal posterior densities of rl obtainable from w(1r, O n) and (rl1, 0 n) are the same. The proof of the above theorem is given in Appendix A. 2.5 Model Comparison and Assessment 2.5.1 Posterior Predictive Loss We performed model comparison using the posterior predictive loss (PPL) criterion proposed by Gelfand and Ghosh (1998). This criterion is based on the idea that an optimal model should provide accurate prediction of a replicate of the observed data. Gelfand and Ghosh (1998) obtained this criterion by minimizing the posterior loss for a given model and then, for all models under consideration, selecting the one which minimizes this criterion. For a general loss function, this criterion can be expressed as a linear combination of two distinct parts i.e a goodnessoffit part and a penalty part. For our framework, the posterior predictive loss can be written as N k N PPL = (D, )2 +k 1 Var(Di) (213) ii 1 where D, = E(Dreply, D) and Var(Di) = Var(D eply, D) = E(Dreply, D) (E(Dreply, D))2. For our framework, Drep = (Dep, Drep) is the replicated disease status vector for all the subjects. It is straightforward to calculate the expected value of the above criterion using the posterior samples obtained from the Gibbs sampler. Lower values of this criterion would imply a better model fit. We assume k = oo and obtain the values of posterior predictive loss for different lengths of exposure trajectories and different number of knots. The results are given in Table 23 and explained in Section 2.6. For the optimal model selected using the posterior predictive loss criterion, model assessment was performed using Kappa measures of agreement and case deletion diagnostics. The methodology is described below. 2.5.2 Kappa statistic We formed 2 x 2 tables cross classifying the observed and predicted number of cases and controls for different combinations of trajectory lengths and number of knots. We summarized the agreement in these tables using the Kappa statistic (K) (Agresti, 2002) which compares agreement against that which might be expected by chance. The value of K ranges from 1 to 1; K = 1 implies perfect agreement while K = 1 implies complete disagreement. A value of 0 indicates no agreement above and beyond that expected by chance. Suppose nab be the number of subjects for whom (D = a, D = b; a = 0, 1; b = 0, 1), D and D being the observed and predicted disease status for a particular subject. Then, no00 n11 (n n ll\( n01 nl n ( no n0 (no + nio n nn n ny I n 1 ( f nil noi + nill noo + noi nor + ni n n n n where n = noo + no0 + n10 + n11. The observed disease status (visavis case or control status) of a subject is obtained from the dataset while the predicted disease status is calculated from the posterior estimates of the parameters. At iteration n of the Gibbs sampler, we can calculate the quantity p(n) = (n)(D, = 1lX,(t+ ad), t e [c, 0]) = L(n)(a +/'M,+ b'Qi,) where L(.) can be either the exact logit cdf or the approximate Studentt cdf (with 8 degrees of freedom). Based on the value of ,n), we can assign b if fn) > 0.5 0 if (n) < 0.5 Based on the values of {(Di, bi} ); i = 1,..., N}, we can form a 2 x 2 table, and hence can calculate a value of kappa, say, K(n) at iteration n of the Gibbs sampler. The posterior means and 95% credible intervals of K provide a measure of the amount of agreement that our model provides. 2.5.3 Case Influence Analysis Case influence (or case deletion) diagnostics are often used as a tool for model assessment in various statistical problems. The procedure hinges on the idea that the influence of a particular observation on a parameter can be measured by the difference in the parameter estimate based on the full data and the data with that observation deleted (Hampel et al., 1987). These diagnostics can be used to detect observations with an unusual effect on the fitted model and thus may lead to identification of data or model errors. Bradlow and Zaslavsky (1997) applied case influence tools in Bayesian hierarchical modeling. The basic tenet is that samples obtained from the full posterior of parameters, when importance weighed, can reflect the effect of deleting a particular observation from the dataset. They presented an easily applicable graphical technique for checking influential observations based on importance weighing. The local dependence structure, often present in hierarchical models, makes the importance weights inexpensive to calculate. Let H, = a +3'Mio + b',Qio and S, = Pp, (a,)'/3+ q,,(a,)'b,. Suppose L(YISy, a ) be the density function corresponding to the trajectory model, while L(Di Hi) be the one for the disease model. We worked with the following three types of weighting schemes based on those proposed by Bradlow and Zaslavsky (1997) ni (n),* 72 (yd)i = L(YS, a)L(DiH,)'. j=1 W(n) (=n) )iN(bi, 0, a2)1. (y,d,Q)i = (y,d) ni (n)* (n) 72 W )i = W(yd,)i f L *( SU, )L*(DiHi). (214) j= 1 Here n denote the nth iteration of the Gibbs sampler, the subscript i denote the deletion of y, and the superscript denote unnormalized weights. In the last weighing scheme, L*(Y, ISy, a ) and L*(DiHi,) are the usual likelihood with the population level parameters i.e (a, 3, 0, o2) replaced by the full data posterior medians. Here "full data posterior" is the posterior distribution obtained from the complete dataset i.e the one having all the subjects. 2.6 Analysis of PSA Data We have used the semiparametric framework explained in Section 2.2 to analyze the prostate cancer dataset described in Section 2.1.2. Multiple observations on free and total PSA were obtained for 71 prostate cancer cases and 70 controls. For some subjects, observations were collected as far as 10 years prior to diagnosis. We use the natural logarithm of total PSA (Ptotal) as our exposure of interest. Our principle aim is to examine whether past exposure observations can contribute significantly towards predicting the current disease status of a subject given his/her current exposure information. In doing so, we will also test how differential lengths of the PSA trajectories affect the current probability of disease for a particular individual. For the purpose of our analysis, we have used a linear pspline (p = 1) with a subject specific slope parameter to model the exposure trajectory as follows K Y =/3o + 31(t + ai) + /3,k+(td + ad 7 )+ + bi(tu + a) + e, (215) k=l For the prospective disease model (23), we considered two specific scenarios viz. constant influence, 7(t + af) = 0o and linear influence, 7(t + af) = Oo + 0l(t + af). The results for these two cases are summarized below. 2.6.1 Constant Influence Model In this parametrization, the area under the PSA process, {X,(t + af), c t < 0} acts as the covariate and 0o signifies its effect on the disease probability. We have used different values of "c" (time, in years, by which we go back in the past to record the exposure history of a subject) to analyze the effect of differential areas under the PSA process on the current disease state. On fitting the above model, we observed that for all trajectory lengths, 0o is significant (its 95% credible interval does not contain 0). For any particular interval (i.e choice of c), the posterior means and 95% credible intervals of 00 do not change much with the number of knots (K). In addition, 0o increases as the trajectory length decreases i.e as we move closer to the point of diagnosis. This is likely related to the scale of the area under the PSA process but it also seems to support the well known medical fact that total PSA is a better discriminator of prostate cancer at times closer to diagnosis than at times further off (Catalona et al., 1998). To assess the impact of only the past PSA observations on the current disease state, we considered the exposure interval I = (10, 5) and 3 knots in the trajectory. The posterior mean of 0o is 0.298 and the 95% credible interval is (0.196, 0.421). Thus, even exposure observations recorded as far as 510 years prior to diagnosis seem to have a significant influence on the current disease status of a subject. We formally compared the different models using the PPL criterion in Section 2.6.3. 2.6.2 Linear Influence Model We next fitted the model permitting a linear pattern of influence of the exposure trajectory on the disease outcome. For all trajectory lengths, 0o and 01 were significant since the 95% credible intervals excluded 0. To better understand the influence of differential lengths of exposure trajectories on disease status, we calculated the odds ratios for different age at diagnosis and trajectory lengths. Suppose the exposure trajectory for the ith subject changes from {X,(t + ad), c < t < O} to {Z,(t + ad), c < t < 0}. The corresponding odds ratio of disease is given by P(Di = 1Z,(t+af), c< t <) P(D X(t a),c < t < 0) Di =0IZit a ,ct0 P(Di = O1Xit +a , t0 P(Di = OlZ,(t+ af),C < t <0) XP(D, lX,(t ad),C < t < O) = exp [ {Z,(t a d) X(t+ ad)}I (t a )dt. (216) Parameterizing {Z,(t + af), c < t < 0} as p,r,(t + af)'l q+ (q,(t + a)'d,, as in (22), we can rewrite (216) as exp [( )' ( p,(t + afd),c(t + ad)'dt) (0c xexp [(d, b,)' ( q, (t + ad r,)((dt + ad'dt . If there is an uniform increase of "m" in the trajectory i.e {Z,(t + af) X,(t + a) = m, t e [c, 0]}, (this can also be looked upon as a vertical shift of the trajectory upwards by "m"), the above odds ratio simplifies to exp m 7(t+ af)dt = exp [cm(o0 + (af c/2)01)] (217) S C Table 21 shows the posterior means and 95% credible intervals of the odds ratios corresponding to different trajectory lengths and age at diagnosis when m = 0.5. For a fixed trajectory length, the odds ratios decrease as age at diagnosis increases. This Table 21. Estimates of odds ratios for different trajectory lengths for a 0.5 vertical shift of the exposure trajectory for the Age (3,0) (5,0) (8,0) 50 3.96 (2.10, 7.63) 4.57 (2.32, 8.73) 5.26 (2.41, 11.01) 55 3.34 (2.02, 5.78) 3.75 (2.15, 6.43) 4.19 (2.24, 7.77) 60 2.83 (1.92, 4.39) 3.08 (2.00, 4.77) 3.36 (2.08, 5.46) 65 2.41 (1.79, 3.35) 2.55 (1.83, 3.59) 2.70 (1.90, 3.91) 70 2.06 (1.62, 2.70) 2.12 (1.64, 2.77) 2.19 (1.68, 2.89) 75 1.78 (1.41,2.32) 1.77 (1.41, 2.24) 1.79 (1.41, 2.31) 80 1.54 (1.16, 2.12) 1.48 (1.13, 1.98) 1.46 (1.11, 1.99) and age at diagnosis linear influence model (10,0) 5.46 (2.46, 11.24) 4.32 (2.29, 7.95) 3.44 (2.11,5.63) 2.76 (1.92, 4.02) 2.22 (1.69, 2.97) 1.80 (1.41, 2.38) 1.47 (1.09, 2.07) seems to support the notion that younger subjects tend to have more aggressive form of prostate cancer than older ones and thus are most likely to be benefited from early detection (Catalona et al., 1998). For most ages at diagnosis, the odds ratios steadily increase as longer exposure trajectories are considered i.e as past exposure observations are taken into account. However, the rate of increase is higher for lower age at diagnosis. Thus, consideration of past exposure observations in addition to recent ones result in a significant gain in information about the current disease status of a subject. Finally, for the highest age at diagnosis considered (80), the odds ratios decrease as longer exposure trajectories are considered. This may imply that for a subject with very high age at diagnosis, his/her past exposure observations may not contain significant amounts of information about the present disease status. As before, we fitted the disease model on the interval / = (10, 5). The posterior mean and 95% credible interval of po and 01 are respectively 1.24 (0.29, 2.19) and 0.015 (0.029, 0.003) implying that exposure observations recorded 510 years prior to diagnosis also has a significant effect on the current disease status. The posterior means and 95% credible intervals of the odds ratios shown in Table 22 corroborate the above conclusion. Table 22. Posterior means and 95% confidence intervals of odds ratio for / = (10, 5) for the linear influence model Age at Diagnosis 50 60 70 80 Mean 4.99 3.27 2.22 1.56 95% C.1 (1.96, 10.41) (1.91, 5.36) (1.67, 2.98) (1.10, 2.29) 2.6.3 Overall Model Comparison For both the constant and linear influence models, we calculated the PPL criterion (described in Section 5.1) corresponding to different trajectory intervals and number of knots. These values are given in Table 23. The PPL values for the linear model were smaller than those corresponding to the constant influence model. Thus, we can conclude that for the prostate cancer data, the class of linear influence models fit better than the class of constant influence models. For both setups, the model with 0 knots has the worst fit (highest PPL criterion) across all trajectory lengths. For a given trajectory, the models tend to improve with an increase in the number of knots until a certain number of knots is reached. Further increase of knots tend to worsen the fit; this agrees with the findings of Ruppert (2002). The important point to note here is that the number of knots and the length of the exposure trajectory seem to interact in their effect on model fit. The best fitting constant influence model seem to be the one with exposure trajectory (10, 0) and 3 knots. For the linear influence setup, the PPL criterion has a decreasing trend as longer exposure trajectories are taken into account. Thus, inclusion of past exposures result in an improvement of model fit. This may be indicative of the fact that past exposure observations contain significant amount of information about the current disease status. In addition, for the trajectory interval / = (10, 5), the PPL criteria corresponding to the linear and constant influence models are moderately small. Thus, exposure observations recorded 510 years prior to diagnosis also provide a modest amount of information toward predicting the current disease status, corroborating the conclusions Table 23. Posterior predictive losses (PPL) for the constant and linear influence models for varying exposure trajectories and knots Knots Model (2,0) (5,0) (8,0) (10,0) (10,5) 0 Constant 47.54 47.02 47.20 47.65 47.81 Linear 43.61 43.32 43.17 43.33 43.82 1 Constant 46.61 46.65 46.77 46.57 45.29 Linear 42.80 42.83 42.91 42.90 42.94 2 Constant 45.83 45.50 45.72 46.32 44.69 Linear 43.20 43.01 42.74 42.66 43.33 3 Constant 45.47 45.23 45.24 44.82 45.17 Linear 43.47 43.05 42.72 42.73 43.43 4 Constant 45.35 45.67 45.27 45.31 45.54 Linear 43.70 43.13 42.56 42.61 43.47 5 Constant 46.67 46.06 45.42 45.75 46.01 Linear 43.91 43.20 43.12 42.93 43.48 reached earlier. For the linear setup, the model with exposure trajectory I = (8, 0) and 4 knots perform the best (has the lowest PPL criterion among all the models considered). 2.6.4 Model Assessment As mentioned before, the number of knots and length of exposure trajectory tend to interact in influencing the fit of the constant and linear influence models. Thus, for a fixed trajectory length, the optimal model can be selected as the one with the lowest value of the PPL criterion across all the knot choices. For the linear influence model, the lowest PPL value was recorded for / = (8, 0) and 4 knots. So, we perform our model assessment procedure on this model. For this model, the posterior mean of K was about 0.6 with 95% credible interval (0.535, 0.680) which indicates substantial agreement beyond what is expected by chance. We next performed case deletion analysis. We deleted each subject (with all the observations) rather than each observation for a subject. Figure 22 (a)(c) shows the case deleted posterior means and 95% credible intervals for 31, 0o and 01. (In these figures, the solid and dashed horizontal lines respectively indicates the estimated posterior mean and 95% credible intervals of the respective parameters based on the full data posterior. The solid points denote the importance weighted casedeleted posterior mean while the vertical lines segments are the 95% casedeleted posterior intervals). None of the subjects seem to be very influential on the parameter estimates. For every subject, we also looked at the difference in the predicted probability of disease based on the full data and with that subject deleted. Figure 22 (d) shows the plot of the posterior means of the difference probabilities and the corresponding confidence intervals. (In this figure, the solid line represents zero difference. The solid points represents the difference in disease probabilities based on the full and case deleted posteriors. The vertical line segments are the 95% posterior intervals of the differences). Surprisingly, the observation for case number 108 has a significant departure from the rest. On analyzing this subject, it was found that it had the unique combination of very high age and very high values of PSA. In fact it had the highest mean age in the sample, the highest age at diagnosis while the third highest mean Ptotal value. These characteristics may have contributed to the exceptionally high difference in the predicted probability of disease. We also performed case deletion analysis of the intercept parameters of the disease and trajectory models and the variance components. None of the subjects were found to be influential on the posterior estimates of these parameters. Thus, based on the above two measures, we may conclude that the semiparametric linear influence model with trajectory I = (8, 0) and 4 knots seems to fit the observed data relatively well. 2.7 Conclusion and Discussion Case control studies have witnessed a wide variety of research over the years. Fundamental and far reaching contributions have been made both in the Frequentist and Bayesian domains. Generally, the bulk of research have dealt with the standard I I I I I I I I 0 20 40 60 80 100 120 140 Deleted Case A 1p I I I I I I I I 0 20 40 60 80 100 120 140 Deleted Case B Yo 1 ,7 77~l,,l~l 1 T ''I 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Deleted Case Deleted Case C 1 D Disease Probability Figure 22. Sensitivity of 31, 0 Q i and disease probability estimates to casedeletions. L A ll. IJ LJA A 1 ,LLL kILJL h 2 iJ,I o N E s LUJ E 0 0 o o. ' '1' I' *1.I I" 1 I I' o oC 0  0 LC 0 62 o C LO I II I . t 1, 1 i r.r i[r ,'t r .i 0 .. I "1I d situation where exposure observations for cases and controls are collected at a single time point in the past. Some medical studies however have suggested that it may be worthwhile to take into account an entire exposure history, if available, in assessing the diseaseexposure relationship. Casecontrol studies involving longitudinal exposure trajectories is a relatively unexplored area. At the same time, it is a promising one given the wide variety of longitudinal data analytic tools that are now available. Moreover, recent developments in the area of semiparametric and nonparametric regression analysis have added more flexibility in this direction specially when exposure trajectories have complicated and unknown functional forms. In this work, we have applied semiparametric regression techniques in analyzing longitudinal case control studies. We have used penalized regression splines in modeling the exposure trajectories for the cases and the controls. Thus our framework can be used even when exposure observations are collected at different time points across subjects i.e when exposures are unbalanced in nature. The exposure trajectory is used as the predictor in a prospective logistic model for the binary disease outcome. We have also modeled the slope parameter of the disease model as a pspline to account for any time varying influence pattern of the exposure trajectory on the current disease status. In doing so, we have summarized the exposure history for the cases and controls in a flexible way which allowed us to consider differential lengths of the exposure trajectory in analyzing its effect on the current disease status. In order to simplify the analysis, we used the logitmixture of normal approximation (Albert and Chib, 1993). We showed that the Bayesian equivalence results of Seaman and Richardson (2004) essentially holds for our framework, thus allowing us to use a prospective logistic model having fewer nuisance parameters although the dataset was collected retrospectively. Analysis have been carried out in an hierarchical Bayesian framework. Parameter estimates and associated credible intervals are obtained using MCMC samplers. We have applied our methodology to a longitudinal case control study dealing with the association between prostate specific antigen (PSA) and prostate cancer. We analyzed our model using differential lengths of exposure trajectories. In doing so, we have concluded that past exposure observations do provide significant information towards predicting the current disease status of a subject. Specifically, we have shown that across all age at diagnosis groups, the odds of disease steadily increase as past exposure observations are taken into account in addition to the recent ones. We also observed that for a fixed trajectory length, the odds of disease steadily decrease as the age at diagnosis increases corroborating the medical fact that younger subjects tend to have more aggressive form of prostate cancer and thus are most likely to be benefitted from early detection. We performed model comparison using posterior predictive loss (Gelfand and Ghosh, 1998). This criterion indicated that models with longer exposure trajectories tend to perform better than those with shorter trajectories. Lastly, model assessment was performed on the optimal model using the kappa statistic and case deletion diagnostics. Both these tools suggested that our model fits relatively well to the data. Some interesting extensions can be done to our setup. For richer datasets, it will be interesting to model the subject specific deviation functions as psplines. In addition, we have only assumed constant and linear parameterizations of the influence function of the prospective disease model. For a larger data set, a pspline formulation can also be used for the influence function which may bring out any underlying nonlinear pattern of influence of the exposure trajectory on the current disease status. Although we have used a binary disease outcome, it will be interesting to extend our framework to accommodate multicategory disease states. Our modeling framework can also be generalized by incorporating a larger class of nonparametric distributional structures (like Dirichlet processes or Polya trees) for the subject specific random effects. CHAPTER 3 ESTIMATION OF MEDIAN HOUSEHOLD INCOMES OF SMALL AREAS : A BAYESIAN SEMIPARAMETRIC APPROACH 3.1 Introduction Sample survey methodologies are widely used for collecting relevant information about a population of interest over time. Apart from providing population level estimates, surveys are also designed to estimate various features of subpopulations or domains. Domains may be geographic areas like state or province, county, school district etc. or can even be identified by a particular sociodemographic characteristic like a specific agesex group. Sometimes, the domainspecific sample size may be too small to yield direct estimates of adequate precision. This led to the development of small area estimation procedures which specifically deal with the estimation of various features of small domains. Generally, observations on various characteristics of small areas are collected over time, and thus, may possess a complicated underlying timevarying pattern. It is likely that models which exploit the time varying pattern in the observations may perform better than classical small area models which do not utilize this feature. In this study, we present a semiparametric Bayesian framework for the analysis of small area level data which explicitly accomodates for the longitudinal pattern in the response and the covariates. 3.1.1 SAIPE Program and Related Methodology The Small Area Income and Poverty Estimates (SAIPE) program of the U.S. Census Bureau was established with the aim of providing annual estimates of income and poverty statistics for all states, counties and school districts across the United States. The resulting estimates are generally used for the administration of federal programs and the allocation of federal funds to local jurisdictions. There are also many state and local programs that depend on these estimates. Prior to the creation of the SAIPE program, the decennial census was the only source of income and poverty statistics for households, families and individuals related to small geographic areas like counties, cities and other substate areas. Due to the ten year lag in the release of successive census values, there was a large gap in information concerning fluctuations in the economic situation of the country in general and local areas in particular. The establishment of the SAIPE program has largely mitigated this issue. The current methodology of the SAIPE program is based on combining state and county estimates of poverty and income obtained from the American Community Survey (ACS) with other indicators of poverty and income using the FayHerriot class of models (Fay and Herriot, 1979). The indicators are generally the mean and median adjusted gross income (AGI) from IRS tax returns, SNAP benefits data (formerly known as Food Stamp Program data), the most recent decennial census, intercensal population estimates, Supplemental Security Income Receipiency and other economic data obtained from the Bureau of Economic Analysis (BEA). Estimates from ACS are being used since January 2005 on the recommendation of the National Academy of Sciences Panel on Estimates of Poverty for Small Geographic Areas (2000). Income and poverty estimates until 2004 were based on data from the Annual Social and Economic Supplement (ASEC) of the Current Population Survey (CPS). Apart from various poverty measures, the SAIPE program provides annual state and county level estimates of median household income. At this point, direct ACS estimates of median household income are only available for the period 20052008. Thus, for illustration purpose, we have considered data from ASEC for the period 19951999 in order to estimate the state level median household income for 1999. This is because, the most recent census estimates correspond to the year 1999 and these census values can be used for comparison purposes. The SAIPE regression model for estimating the median household income for 1999 use as covariates, the median adjusted gross income (AGI) derived from IRS tax returns and the median household income estimate for 1999 obtained from the 2000 Census. The response variable is the direct estimate of median household income for 1999 obtained from the March 2000 CPS. Bayesian techniques are used to weigh the contributions of the CPS median income estimates and the regression predictions of the median income based on their relative precision. The standard deviations of the error terms are estimated by fitting a model to the estimates of sampling error covariance matrices of the CPS median household income estimates for several years. The mean function in this model is referred to as a "generalized variance function" (Bell, 1999). Noninformative prior distributions are placed on the regression parameter corresponding to the IRS median income since it was found to be statistically significant even in the presence of census data, both in the 1989 and 1999 models. 3.1.2 Related Research Estimation of median income for small areas contributes to the policy making process of many Federal and State agencies. Before the establishment of the SAIPE program, the estimation of median income for fourperson families was of general interest. The Census Bureau used the ideas suggested by Fay (1987) in this regard. Estimation was carried out in an empirical Bayes (EB) framework suggested by Fay et al. (1993). Later, Datta et al. (1993) extended the EB approach of Fay (1987) and also put forward univariate and multivariate hierarchical Bayes (HB) models. The estimates from their EB and HB procedures significantly improved over the CPS median income estimates for 1979. Ghosh et al. (1996) exploited the repetitive nature of the statespecific CPS median income estimates and proposed a Bayesian time series modeling framework to estimate the statewide median income of fourperson families for 1989. In doing so, they used a time specific random component and modeled it as a random walk. They concluded that the bivariate time series model utilizing the median incomes of four and five person families performs the best and produces estimates which are much superior to both the CPS and Census Bureau estimates. In general, the time series model always performed better than its nontime series counterpart. Semiparametric regression methods have not been used in small area estimation contexts until recently. This was mainly due to methodological difficulties in combining the different smoothing techniques with the estimation tools generally used in small area estimation. The pioneering contribution in this regard is the work by Opsomer et al. (2008) in which they combined small area random effects with a smooth, nonparametrically specified trend using penalized splines. In doing so, they expressed the nonparametric small area estimation problem as a mixed effects regression model and analyzed it using restricted maximum likelihood. Theoretical results were presented on the prediction mean squared error and likelihood ratio tests for random effects. Inference was based on a simple nonparametric bootstrap approach. The methodology was used to analyze a nonlongitudinal, spatial dataset concerning the estimation of mean acid neutralizing capacity (ANC) of lakes in the north eastern states of U.S. 3.1.3 Motivation and Overview The motivation of our work also originates from the repetitive nature of the CPS median income estimates. But, in contrast to the approach of Ghosh et al. (1996), we have viewed the state specific annual household median income values as longitudinal profiles or "income trajectories". This gained more ground because we used the state wide CPS median household income values for only five years (1995 1999) in our estimation procedure. Figure 31 shows sample longitudinal CPS median household income profiles for six states spanning 1995 to 2004 while Figure 32 shows the plots of the CPS median income against the IRS mean and median incomes for all the states for the years 1995 through 1999. It is apparent that CPS median income may have an underlying nonlinear pattern with respect to IRS mean income, specially for large values of the latter. The above two features motivated us to use a semiparametric regression approach. In doing so, we have modeled the income trajectory using penalized spline (or Pspline) (Eilers and Marx, 1996) which is a commonly used but powerful function estimation tool in nonparametric inference. The Pspline is 0  o o o o 0 0 0 34000 36000 38000 40000 42000 44000 48000 IRS mean IRS mean 40000 45000 50000 55000 60000 IRS mean I I I I I I I 24000 25000 26000 27000 28000 29000 30000 IRS median 24000 26000 28000 30000 32000 IRS median 28000 30000 32000 34000 36000 IRS median Figure 31. Longitudinal CPS median income profiles for 6 states plotted against IRS mean and median incomes. (1st column : IRS Mean Income; 2nd column : IRS Median Income). expressed using truncated polynomial basis functions with varying degrees and number of knots, although other types of basis functions like Bsplines or thin plate splines can also be used. We have worked with two types of models viz. a regular semiparametric model and a semiparamteric random walk model. For each of these models, analysis has been carried out using a hierarchical Bayesian approach. Since we chose noninformative improper priors for the regression parameters, propriety of the posterior has been proved before proceeding with the computations. Markov chain Monte Carlo methodologies, specifically, Gibbs sampling (Gelfand and Ghosh, 1998) has been used to obtain the parameter estimates. We have compared the statespecific estimates of median household income for 1999 with the corresponding decennial census values in order to test for their accuracy. In doing so, we observed that the semiparametric model estimates improve upon both the CPS and the SAIPE estimates. Interestingly, the positioning of the knots had significant influence on the results as will be discussed later on. We want to mention here that the SAIPE model had a considerable advantage over ours in that they used the census estimates of the median income for 1999 as a predictor. In small area estimation problems, the census estimates are regarded as the "gold standard" since these are the most accurate estimates available with virtually negligible standard errors. So, using those as explanatory variables was an added advantage of the SAIPE state level models. The fact that our estimates still improve on the SAIPE model based estimates is a testament to the flexibility and strength of the semiparametric methodology specially when observations are collected over time. It also indicates that it may be worthwhile to take into account the longitudinal income patterns in estimating the current income conditions of the different states of the U.S. The rest of the chapter is organized as follows. In Section 3.2 we introduce the two types of semiparametric models we have used. Section 3.3 goes over the hierarchical Bayesian analysis we performed. In Section 3.4, we describe the results of the data 0 0 C> 00 50 6 00 0 0 * Eo E I S Me o e 8r 32. o. P m I C, 0 0 c>. 10100 *0 ** 1 1 S. ana s e e e e. S of our models. We end with a discussion in Section 3.6. The appendix contains the The target of inference is generally By or some function of it. Specifically, in our context, incomes at times v and u i.e ()'. We denote by X the covariate corresponding to the ith Mtate and jthI CM 30000 40000 50000 60000 70000 20000 25000 30000 35000 IRS Mean Income IRS Median Income A IRS mean income plot B IRS median income plot Figure 32. Plots of CPS median income against IRS mean and median incomes for all the states of the U.S. from 1995 to 1999. analysis with regard to the median household income dataset. In Section 3.5, we discuss the Bayesian model assessment procedure we used to test the goodnessoffit of our models. We end with a discussion in Section 3.6. The appendix contains the proofs of the posterior propriety and the expressions of the full conditional distributions. 3.2 Model Specification 3.2.1 General Notation Let Y = (Y,4,..., Y,js)' be the sample survey estimators of some characteristics OY = (01, ...0,)' for the ith small area at the jh time (/ = 1,2,...,m;j = 1,2,...,t). The target of inference is generally 0, or some function of it. Specifically, in our context, 0, = O, which denotes the median household income of the ith state at the jth year. We are interested in estimating (, ..., Omu,,)' i.e the median household income for all the states at time u. We may also want to estimate the difference in median household incomes at times v and u i.e (0v Oiu, ..., Omv Omu,,)'. We denote by X, the covariate corresponding to the ith state and jth year. * * 3.2.2 Semiparametric Income Trajectory Models We assume the following two semiparametric models : 3.2.2.1 Model I : Basic Semiparametric Model (SPM) Let Y, and X, denote the CPS median household income and the IRS mean (or median) income recorded for the ith state at the jth year. The basic semiparametric model can be expressed as Y, = f(x) + bi + u. + eu (31) where f(xy) is an unspecified function of x reflecting the unknown responsecovariate relationship. We approximate f(xy) using a Pspline and rewrite (31) as K k=l X= 3 + Z.7 + b, + u. + ey = 0 + e (32) where 0. = X0/3 + Z'y + bi + u, is our target of inference. Here X, = (1, x,..., xf)', Zy = {(x. Ti)P, ..., (xd TK)P}',0 = (/3 ..., /p)' is the vector of regression coefficients while 7 = (71,..., 7K)' is the vector of spline coefficients. The above spline model with degree p can adequately approximate any unspecified smooth function. Typically, linear (p = 1) or quadratic (p = 2) splines serves most practical purposes since they ensure adequate smoothness in the fitted curve. m and t respectively denote the number of small areas and the number of time points at which the response and covariates are measured. Thus, in our case, m = 51, for all the 50 states of the U.S. and the District of Columbia and t = 5 for the years 19951999. bi is a statespecific random effect while u, represents an interaction effect between the ith state and thejth year. We assume b, ~_.id N(0, oj) and 7 ~ N(0, 2IK). o controls the amount of smoothing of the underlying income trajectory. Moreover, it is assumed that ud and ed are mutually independent with u ~ N(0, are the sampling standard deviations corresponding to the CPS direct median income estimates obtained using the "generalized variance function" technique mentioned in Section 3.1.1. In the datasets provided by the Census Bureau, these estimates are given for all the states at each of the time points. The knots (, ..., rK) are usually placed on a grid of equally spaced sample quantiles of xj's. From (31) and (32), we have OU = f(x) + bi + ud which reflects our basic assumption that the true unknown household median income may have an unspecified variational pattern with the IRS mean (or median) income. Thus, the covariate effect is expressed by the unspecified nonparametric function f(xy) which reflects the possible nonlinear effect of xy on 6y. 3.2.2.2 Model II : Semiparametric Random Walk Model (SPRWM) Since, for each state, the response and the covariates are collected over time, there may be a definite trend in their behavior. Thus, we added a time specific random component to (31) and modeled it as a random walk as follows Yu =X' + Z',>y + bi + v, + u, + eu = 0 + e (33) where 0y = X1/ + Z', + b, + vj + u, Here, vj denotes the time specific random component. We assume that, (vjv_ _, O) ~ N(vj_, O) with vo = 0. Alternatively, we may write, vj = vj_+ wj where wj ~ N(0, ov). This is the socalled random walk model and is similar to the systems equations used in dynamic linear models. Before proceeding to the next section, we may note that unlike the models of Ghosh et al. (1996), the models given in (32) and (33) incorporate state specific random effects (bi). This rectifies a limitation of the former as pointed out in Rao (2003). 3.3 Hierarchical Bayesian Inference 3.3.1 Likelihood Function Let Yi = (Y,1 ..., Y,)' be the response and Xi = (Xi, ...,Xit)' and Zi = (Zi, ..., Zit) be the covariates for the ith state. Let 0, = (0,, 3, 7, bi, b2, o2, o2) be the parameter space corresponding to the ith state where 0, = (Oi, ..., Ot)' and b2 = ( ..., )'. Thus, the full parameter space will be given by Q = i2 x 22 x ... x ,,. For the ith state, the likelihood corresponding to Model I (SPM) can be written as L(Y,, Xi,, Zil i) oc L(Y, li)L(0i,/3, bi, 2, X i, Z,)L(b l )L(7o ) t S {L(YO, O,)L(0 lX'0/3 Z' b, ,)} L(b, )L(7o) j= 1 (34) Here, L(Ula, b) denotes a normal density with mean a and variance b while L(bi oj) and L(y 1o7) denotes a normal distribution with mean 0 and variances oa and o2 respectively. For the random walk model, the parameter space for the ith state would be 0i = (0i, /3,, bi, v, b2, a 2, ao) where v = (v1, ..., vt) is the vector of time specific random effects. Thus, the likelihood function for the ith state will have an extra component corresponding to v as follows t L(Yi, Xi, Zilii) = {L(Y\6, 2 o)L(OeX + Z.' + bi, )L(vv_, a2)} x j= 1 x L(ba2) )L(,,2) (35) where L(vji vj_, o2) denotes a normal distribution with mean vj_i and variance 2 where vo = 0. 3.3.2 Prior Specification To complete the Bayesian specification of our model, we need to assign prior distributions to the unknown parameters. We assume noninformative improper uniform prior for the polynomial coefficients (or fixed effects) / and proper conjugate gamma priors on the inverse of the variance components ( ..... o, a ,o ). The prior distributions are assumed to be mutually independent. We choose small values (0.001) for the gamma shape and rate parameters to make the priors diffuse in nature so that inference is mainly controlled by the data distribution. Thus, we have the following priors : 3 ~ uniform(RP++), (pj)1 ~ G(cj, d)(j = 1 ... t), (j)1 ~ G(c, d), (7)1 G(c,, d,) and (o)1 ~ G(cv, dv). Here X ~ G(a, b) denotes a gamma distribution with shape parameter a and rate parameter b having the expression f(x) oc xalexp(bx), x > 0. Since we have chosen improper priors for 0, posterior propriety of the full posterior have been shown. We have the following theorem Theorem 1. Let 2x = max(, ...,.2) = '.7. say, for some k e [1,..., t]. Then, posterior propriety holds if the following conditions are satisfied 1. (m p 5)/2 + ck > 0 and dk > 0 2. m/2 + cj 2 > 0 and dj >0,j = 1,..., t;j 4 k 3.3.3 Posterior Distribution and Inference The full posterior of the parameters given the data is obtained in the usual way by combining the likelihood and the prior distribution as follows m t p(f2Y, X, Z) x H L(Yi, Xi, Zili)7(/3)7(o)7(o) () (36) i=1 j=1 For the random walk model, there will be an additional term 7r(a2). By the conditional independence properties, we can factorize the full posterior as [0, ,b, a a2, { ..., }Y, X,Z] o [Ylo ][0/3,, b,{ ..., X, Z][bl ] x t [7 1 1 [/3]7[ [ ]nb] j= 1 Our target of inference is {0,, i = 1,..., m;j = 1, ...t}, the true median household income of all the states. Since the marginal posterior distribution of 0, is analytically intractable, high dimensional integration needs to be carried out in a theoretical framework. However, this task can be easily accomplished in an MCMC framework by using Gibbs sampler to sample from the full conditionals of 0,i and other relevant parameters. In implementing the Gibbs sampler, we follow the recommendation of Gelman and Rubin (1992) and run n (> 2) parallel chains. For each chain, we run 2d iterations with starting points drawn from an overdispersed distribution. To diminish the effects of the starting distributions, the first d iterations of each chain are discarded and posterior summaries are calculated based on the rest of the d iterates. The full conditionals for both the models are given in the appendix. 3.4 Data Analysis We applied the semiparametric models in Section 3.2.2 to analyze the median household income dataset referred to in Section 3.1.3. The response variable Y, and the covariates X, respectively denote the CPS median household income estimate and the corresponding IRS mean (or median) income estimate for the ith state at thejth year (i = 1,..., 51;j = 1,..., 5). The statespecific mean or median income figures are obtained from IRS tax return data. The Census Bureau gets files of individual tax return data from the IRS for use in specifically approved projects such as SAIPE. For each state, the IRS mean (median) income is the mean (median) adjusted gross income (AGI) across all the tax returns in that state. Like other SAIPE model covariates obtained from administrative records data, these variables do not exactly measure the median income across all households in the state. One of the reasons for this is that the AGI would not necessarily be the same as the exact income figure and the tax return universe does not cover the entire population i.e some households do not need to file tax returns, and those that do not are likely to differ in regard to income than those that do. However, the use of the mean or median AGI as a covariate only requires it to be correlated with median household income, not necessarily be the same thing. Specifically for this study, we have used IRS mean income as our covariate. This is because, it seems to possess an underlying nonlinear relationship with the CPS median income (Figure 32A), and so it is more suited to a semiparametric analysis. 3.4.1 Comparison Measures and Knot Specification Our dataset originally contained the median household income of all the states of the U.S. and the District of Columbia for the years 19952004. However, we only used the information for the five year period 19951999 since our target of inference are the state specific median household incomes for 1999. We evaluated the performance of our estimates by comparing them to the corresponding census figures for 1999. This is because, in small area estimation problems, the census estimates are often treated as "gold standard" against which all other estimates are compared. However, such a comparison is only possible for those years which immediately precede the census year e.g. 1969, 1979, 1989 and 1999. In order to check the performance of our estimates, we plan to use four comparison measures. These were originally recommended by the panel on small area estimates of population and income set up by the Committee on National Statistics in July 1978 and are available in their July 1980 report (p. 75). These are * Average Relative Bias (ARB) = (51)1 Y Ici eil i Ci 2 Average Squared Relative Bias (ASRB) = (51)1 Y Ici e12 Ci Average Absolute Bias (AAB) = (51)1 1 c, e, Average Squared Deviation (ASD) = (51)1 'i1(c, e,)2 Here c, and e, respectively denote the census and model based estimate of median household income for the ith state (i = 1,..., 51). Clearly, lower values of these measures would imply a better model based estimate. The basic structure of our models would remain the same as in Section 3.2.2. We have used truncated polynomial basis for the Pspline component in both the models. Since Fig 2a doesn't indicate a high degree of nonlinearity, we have restricted ourselves to a linear spline (p = 1). The selection of knots is always a subjective but tricky issue in these kind of problems. Sometimes experience on the subject matter may be a guiding force in placing the knots at the "optimum" locations where a sharp change in the curve pattern can be expected. Too few or too many knots generally create problems in terms of worsening the fit. This is because, if too few knots are used, the complete underlying pattern may not be captured properly, thus resulting in a biased fit. On the other hand, once there are enough knots to fit important features of the data, further increase in the number of knots have little effect on the fit and may even degrade the quality of the fit (Ruppert, 2002). Generally, at most 35 to 40 knots are recommended for effectively all sample sizes and for nearly all smooth regression functions. Following the general convention, we have placed the knots on a grid of equally spaced sample quantiles of the independent variable (IRS mean income). 3.4.2 Computational Details We implemented and monitored the convergence of the Gibbs sampler following the general guidelines given in Gelman and Rubin (1992). We ran three independent chains each with a sample size of 10,000 and with a burnin sample of another 5,000. We initially sampled the O6's from tdistributions with 2 df having the same location and scale parameters as the corresponding normal conditionals given in the Appendix. This is based on the GelmanRubin idea of initializing certain samples of the chain from overdispersed distributions. However, once initialized, the successive samples of O6's are generated from regular univariate normal distributions. Convergence of the Gibbs sampler was monitored by visually checking the dynamic trace plots, acf plots and by computing the GelmanRubin diagnostic. The comparison measures deviated slightly for different initial values. We chose the least of those as the final measures presented in the tables that follows. 3.4.3 Analytical Results Data on CPS median income and IRS mean incomes were available for 50 states and the District of Columbia for the time span 19952004. CPS median income ranged from $24,879.68 to $52,778.94 with a mean of $36,868.48 and standard deviation of $5954.94 while IRS mean annual income ranged from $27,910 to $72,769.38 with a mean of $41,133.45 and standard deviation of $7196.56. We fitted Model I (SPM) with all possible knot choices from 0 to 40 but the best results were achieved with 5 knots. The estimates (with 5 knots) improved significantly over the CPS estimates based on all the four comparison measures. Addition of more knots seemed to degrade the fit of the model. This may happen as pointed out in Ruppert (2002). On the other hand, the SAIPE model based estimates were slightly superior to the SPM estimates. Next, we fitted the semiparametric random walk model (SPRWM) to our data. Overall, the random walk structure lead to some improvement in the performance of the estimates. However, for the model with 5 knots, the performance of the estimates remained nearly the same. This may be because 5 knots is sufficient to capture the underlying pattern in the income trajectory and the random walk component doesn't lead to any further improvement. Last but not the least, the random walk model estimates, although generally better than those of the basic semiparametric model, still cannot claim to be superior to the SAIPE estimates for all the comparison measures. Table 31 reports the posterior mean, median and 95% Cl for the parameters of the SPRWM with 5 knots. It is of interest that the 95% Cl for 71, 74 and 75 doesn't contain 0 indicating the significance of the first, fourth and fifth knots. This is indicative of the relevance of knots in the penalized spline fit on the CPS median income observations. The same is true for the coefficients of SPM. Table 31. Parameter estimates of SPRWM with 5 knots Parameter Mean Median 95% Cl 0o 4677.71 4660.08 (4633.31, 4758.7) /1 0.8156 0.816 (0.814, 0.817) 71 0.154 0.154 (0.158, 0.149) 72 0.02 0.024 (0.016, 0.040) 73 0.008 0.016 (0.056, 0.066) 4 0.093 0.119 (0.127, 0.037) 5 0.165 0.173 (0.187, 0.139) 3.4.4 Knot Realignment As mentioned in Section 3.1.1, the SAIPE state models use the census estimates of median income (for 1999) as one of the predictor which essentially gives them a big edge over us. This may be one of the reasons why the estimates obtained from the semiparametric models are atmost comparable, but not superior to the SAIPE estimates. But that doesn't rule out the fact that the semiparametric models have room for improvement. In this section, we will look for any possible deficiencies in the our models and will try to come up with some improvements, if there is any. As mentioned in Section 3.4.1, selection and proper positioning of knots plays a pivotal role in capturing the true underlying pattern in a set of observations. Poorly placed knots does little in this regard and can even lead to an erroneous or biased estimate of the underlying trajectory. Ideally, a sufficient number of knots should be selected and placed uniformly throughout the range of the independent variable to accurately capture the underlying observational pattern. Figures 33A and 33B shows the exact positions of 5 and 7 knots in the plot of CPS median income against IRS mean income. In both the cases, the knots are placed on a grid of equally spaced sample quantiles of IRS mean income. In both the figures, the knots lie on the left of IRS mean = 50000, the region where the density of observations is high. The knots tend to lie in this region because they are selected based on quantiles which is a densitydependent measure. Thus, in both the figures, the coverage area of knots (i.e the part of the observational pattern which is captured by the knots) is the o * * S *g S 0) S 4 5 6 400 5 0 6 0 IRS Mean Income IRS Mean Income A Positioning of 5 Knots B Positioning of 7 Knots Figure 33. Exact positions of 5 and 7 knots in the plot of PS median income against region to the left of the dotted vertical lines. On the other hand, the nonlinear pattern 8 <," 8 **<"* o A 0 AAAA A 0 I< II ICM I '] I 30000 40000 50000 60000 70000 30000 40000 50000 60000 70000 IRS Mean Income IRS Mean Income A Positioning of 5 Knots B Positioning of 7 Knots Figure 33. Exact positions of 5 and 7 knots in the plot of CPS median income against IRS mean income. The knots are depicted as the bold faced triangles at the bottom. region to the left of the dotted vertical lines. On the other hand, the nonlinear pattern is tangible only in the low density area of the plot i.e the region lying to the right of IRS mean = 50000. Evidently, none of the knots lie in this part of the graph. Thus, we can presume that in both the cases (5 and 7 knots), the underlying nonlinear observational pattern is not being adequately captured. As a natural solution to this issue, we decided to place half of the knots in the low density region of the graph while the other half in the high density region. The exact boundary line between the high density and low density regions is hard to determine. We tested different alternatives and came up with IRS mean = 47000 as a tentative boundary because it gave the best results. In both the regions, we placed the knots at equally spaced sample quantiles of the independent variable. Figure 34 shows the new knot positions for 5 knots. It is clear from Figure 34 that the new knots are more dispersed throughout the range of IRS mean than the old ones. The region between the bold and dashed vertical lines denotes the additional coverage that has been achieved with the knot "* C 8 0 * o 0 oI I I I I C)* 0 E 30000 40000 50000 60000 70000 IRS Mean Income Figure 34. Positions of 5 knots after realignment. The knots are the bold faced triangles at the bottom. The region between the dashed and bold lines is the additional coverage area gained from the realignment. rearrangement. Based on the number of data points inside this region, it is clear that a much larger proportion of observations has been captured with the knot realignment. No knots are in the region beyond the bold vertical lines (i.e beyond IRS mean 56000) possibly due to the very low density of the observations in that area. Overall, it seems that, the new knots can capture some of the underlying nonlinear pattern in the dataset which the old knots failed to achieve. We also experimented by placing all the knots in the low density region (beyond IRS mean = 47000) but the results were not satisfactory. This indicates that the knots should be uniformly placed throughout the range of the independent variable to get an optimal fit. We have worked with 5 knots because it performed consistently well for both the SPM and SPRW models. On fitting the semiparametric models with the new knot alignment, we did achieve some improvement in the results. Table 32 reports C'j I I I 11 30000 40000 50000 60000 70000 IRS Mean Income Figure 34. Positions of 5 knots after realignment. The knots are the bold faced triangles at the bottom. The region between the dashed and bold lines is the additional coverage area gained from the realignment. rearrangement. Based on the number of data points inside this region, it is clear that a much larger proportion of observations has been captured with the knot realignment. No knots are in the region beyond the bold vertical lines (i.e beyond IRS mean 56000) possibly due to the very low density of the observations in that area. Overall, it seems that, the new knots can capture some of the underlying nonlinear pattern in the dataset which the old knots failed to achieve. We also experimented by placing all the knots in the low density region (beyond IRS mean = 47000) but the results were not satisfactory. This indicates that the knots should be uniformly placed throughout the range of the independent variable to get an optimal fit. We have worked with 5 knots because it performed consistently well for both the SPM and SPRW models. On fitting the semiparametric models with the new knot alignment, we did achieve some improvement in the results. Table 32 reports the comparison measures for the raw CPS estimates, SAIPE estimates and the semiparametric estimates with the knot realignment while Table 33 depicts the percentage improvement of the semiparametric estimates over the CPS and SAIPE estimates. Here, SPM(5)* and SPRWM(5)* respectively denote the semiparametric models with the realigned 5 knots. Table 32. Comparison SPRWM(5)* Estimate ARB CPS 0.0415 SAIPE 0.0326 SPM(5)* 0.028 SPRWM(5)* 0.0295 Table 33. Estimate SAIPE CPS measures for SPM(5)* and estimates with knot realignment ASRB AAB ASD 0.0027 1,753.33 5,300,023 0.0015 1,423.75 3,134,906 0.0012 1173.71 2,334,379 0.0013 1256.08 2,747,010 Percentage improvements of SPM(5)* and SPRWM(5)* estimates over SAIPE and CPS estimates Model ARB ASRB AAB ASD SPM(5)* 14.11% 20.00% 17.56% 25.54% SPRWM(5)* 9.51% 13.33% 11.78% 12.37% SPM(5)* 32.53% 55.55% 33.06% 55.96% SPRWM(5)* 28.92% 51.85% 28.36% 48.17% It is clear that, with the knot realignment, the comparison measures corresponding to the semiparametric estimates have decreased substantially, specially so for the SPM. The new comparison measures for the semiparametric models are quite lower than those corresponding to the SAIPE estimates. Thus, we may say that the semiparametric model estimates performs better than the SAIPE estimates with the realigned knots. This improvement is apparently due to the additional coverage of the observational pattern that is being achieved with the relocation of the knots. As a result of this increased coverage, a larger proportion of the underlying nonlinear pattern in the observations in being captured by the new knots. Although we have done this exercise with only 5 knots, it would be interesting to experiment with other types of knot alignment and with different number of knots. Table 34 and Table 35 report the posterior mean, median and 95% Cl for the parameters in SPM(5)* and SPRWM(5)* respectively. Table 34. Parameter estimates of Parameter Mean Median /3o 4767.48 4769.04 31 0.811 0.810 71 0.189 0.191 72 0.0389 0.0395 73 0.104 0.102 74 0.240 0.253 75 0.127 0.155 Table 35. Parameter estimates of Parameter Mean Median 0o 4826.28 4824.39 31 0.806 0.809 71 0.159 0.156 72 0.014 0.012 3 0.08 0.08 74 0.237 0.244 75 0.225 0.183 SPM(5)* 95% Cl (4743.33, 4791.67) (0.809, 0.812) (0.198, 0.180) (0.0189, 0.059) (0.099, 0.126) (0.305, 0.179) (0.181, 0.081) SPRWM(5)* 95% Cl (4806.77, 4860.56) (0.801, 0.810) (0.183, 0.151) (0.004, 0.039) (0.027, 0.123) (0.369, 0.125) (0.538, 0.085) It is of interest to note that, with the knot realignment, all the knot coefficients (i.e the 7's) are significant for both SPM and SPRWM. For the old configuration, some of the knot coefficients were not significant for the models. This corroborates the fact that, with the knot realignment, all the five knots are significantly contributing to the curve fitting process in terms of capturing the true underlying nonlinear pattern in the observations. 3.4.5 Comparison with an Alternate Model We also compared the semiparametric models (with 5 knots) with the model proposed by Ghosh et al. (1996), henceforth referred to as the GNK model. Their univariate model is as follows Y =o + 3xi+ b + u+ eu (37) where (b b,) N(0, o), u, ~ N(0, rb) and e, ~ N(0, 72). One of the major qualitative difference between the above model and our semiparametric models is that the former doesn't have a state specific random effect. In fact, it would also be interesting to compare the above model with the basic semiparametric model (SPM) with 0 knots i.e Y, = 3o + ixj + bi + u. + ey (38) where bi ~i.i.d N(0, o) while ud and ed have the same distribution as above. Clearly, the only difference between (37) and (38) is that the former contains a time specific random component while the latter contains a area specific random component. Ghosh et al. (1996) showed that the estimates from the bivariate version of the GNK model (37) performs much better than the census bureau estimates in estimating the median household income of 4person families in the United States. Table 36 depicts the comparison measures corresponding to the above models. Table 36. Comparison measures for time series and other model estimates Estimate ARB ASRB AAB ASD CPS 0.0415 0.0027 1,753.33 5,300,023 SAIPE 0.0326 0.0015 1,423.75 3,134,906 GNK 0.0397 0.0025 1709.58 5,229,869 SPM(0) 0.0337 0.0017 1408.7 3,137,978 SPM(5)* 0.028 0.0012 1173.71 2,334,379 SPRWM(5)* 0.0295 0.0013 1256.08 2,747,010 It is clear that, although the estimates from the GNK model perform slightly better than the CPS, those are quite inferior to the semiparametric and SAIPE estimates. This may be because the state specific random effects in the semiparametric models can account for the withinstate correlations in the income values, something which the GNK model fails to do. Since the comparison measures for SPM(0) are much lower than those for the GNK model, we can also conclude that the area specific random effect is much more critical than a time specific random component in this situation. 3.5 Model Assessment To examine the goodnessoffit of the semiparametric models, we used a Bayesian Chisquare goodnessoffit statistic Johnson (2004). This is essentially an extension of the classical Chisquare goodnessoffit test where the statistic is calculated at every iteration of the Gibbs sampler as a function of the parameter values drawn from the respective posterior distribution. Thus, a posterior distribution of the statistic is obtained which can be used for constructing global goodnessoffit diagnostics. To construct this statistic, we form 10 equally spaced bins ((k 1)/10, k/10), k = 1,..., 10, with fixed bin probabilities, pk = 1/10. The main idea is to consider the bin counts mk(O) to be random where 0 denotes a posterior sample of the parameters. At each iteration of the Gibbs sampler, bin allocation is made based on the conditional distribution of each observation given the generated parameter values i.e YU would be allocated to the kth bin if F(YU) e ((k 1)/10, k/10), k = 1,..., 10. The Bayesian chisquare statistic is then calculated as R8(&)= m/k() npk 2 For the purpose of model assessment, two summary measures can be used, both derived from the posterior distribution of RB(O). First one is the proportion of times the generated values of RB exceeds the 0.95 quantile of a X distribution. Values quite close to 0.05 would suggest a good fit. The second diagnostic is the probability that RB(O) exceeds a X2 deviate i.e A = PI(RB() > X), X X Since the nominal value of this probability is 0.5, values close to 0.5 would suggest a good fit. The only assumptions for this statistic to work are that the observations should be conditionally independent and the parameter vector should be finite dimensional. The 00 0 10 20 30 40 0 5 10 15 20 25 30 0 0 o distribution of the basic semiparametric and semiparametric random walk 0 I I I I I I I I I 0 10 20 30 40 0 5 10 15 20 25 30 Theoretical Quantiles of ChiSquare (9) Theoretical Quantiles of ChiSquare (9) A Basic Semiparametric Model B Semiparametric RW Model Figure 35. Quantilequantile plot of RB values for 10000 draws from the posterior distribution of the basic semiparametric and semiparametric random walk models. The Xaxis depicts the expected order statistics from a X2 distribution with 9 degrees of freedom. second assumption naturally holds in our case. Regarding the first one, since we have multiple observations over time for every state, there may be withinstate dependence between those. Thus, instead of taking all the observations (i.e the CPS median income values), we decided to use the last observation for each state. For the basic semiparametric model (SPM), the above summary measures were respectively 0.049 and 0.5 while for the random walk model (SPRWM), these were 0.047 and 0.51. These measures suggest that both SPM and SPRWM fits the data quite well. Figure 35A and 35B shows the quantilequantile plots of RB values obtained from 10000 samples of SPM and SPRWM with 5 knots. Both the plots demonstrate excellent agreement between the distribution of RB and that of a X2(9) random variable. Johnson points out that the Bayesian chisquare test statistic is also an useful tool for code verification. If the posterior distribution of RB deviates significantly from its null distribution, it may imply that the model is incorrectly specified or there are coding errors. Since the summary measures are quite close to the corresponding null values, we think that our models provide a satisfactory fit to the data set and also that there are no coding errors. 3.6 Discussion The proper estimation of median household income for different small areas is one of the principal goals of the U.S. Census Bureau. These estimates are frequently used by the Federal Government for the administration and maintenance of different federal programs and also for the allotment of federal grants to local jurisdictions. Although these estimates are available annually for every state, the U.S. Census Bureau generally uses a nonlongitudinal approach in their estimation procedure based on the FayHerriot model (Fay and Herriot, 1979). In this study, we have proposed a semiparametric class of models which exploit the longitudinal trend in the statespecific income observations. In doing so, we have modeled the CPS median income observations as an "income trajectory" using penalized splines (Eilers and Marx, 1996). We have also extended the basic semiparametric model by adding a time series random walk component which can explain any specific trend in the income levels over time. We have used as our covariate, the mean adjusted gross income (AGI) obtained from IRS tax returns for all the states. Analysis has been carried out in a hierarchical Bayesian framework. Our target of inference has been the median household incomes for all the states of the U.S. and the District of Columbia for the year 1999. We have evaluated our estimates by comparing those with the corresponding census estimates of 1999 using some commonly used comparison measures. Our analysis has shown that information of past median income levels of different states do provide strength towards the estimation of state specific median incomes for the current period. In fact, if there is an underlying nonlinear pattern in the median income levels, it may be worthwhile to capture that pattern as accurately as possible and use that in the inferential procedure. In terms of modeling the underlying observational pattern, the positioning of knots proved to be both important and interesting. The quality (in terms of their "closeness" to the census estimates) of the estimates tended to improve as the knots were positioned more uniformly throughout the range of the independent variable. It became apparent that the contribution of the knots towards deciphering the underlying observational pattern improved substantially when those were properly placed with an optimal coverage area. This in turn improved the approximation of the curve visavis the true unknown observational pattern. This proved interesting because, still now, there is no absolute rule which controls the positioning of knots. Our final estimates proved to be superior, not only to the raw CPS estimates, but also to the current U.S. Census Bureau (SAIPE) estimates. Although the basic semiparametric model performed much better that the semiparametric random walk model with 5 knots, more experiments need to be done with different knot positions and number before anything conclusive can be said about their relative performance as a whole. But, it seems that, if adequate knots are used and if those are placed uniformly throughout the range of the independent variable, then a random walk component may not improve the fit any further provided there is no strong trend in the income levels. The main advantage of our modeling procedure is that it can be used for any possible patterns in the response (income, poverty etc) observations of small areas. In a subsequent work related to the estimation of median incomes of 4person families, we have shown that the multivariate version of the basic semiparametric model perform quite well too and provide estimates which are consistently superior to the U.S. Census Bureau estimates. The above models can be extended in various ways based on the nature of the observational pattern and the quality (or richness) of the dataset. Some obvious extensions are given as follows : (1) In the models considered above, the spline structure f(xi) represents the population mean income trajectory for all the states combined. The deviation of the ith state from the mean is modeled through the random intercept b,. This implies that the statespecific trajectories are parallel. A more flexible extension would be to model the statespecific deviations as unspecified nonparametric functions as follows YY = f(xy) + gi(xiy)+ u, + e K* where gi(xy) = bil + b,2xy + wikXi K{)+ (39) k=1 Here gi(x) is an unspecified nonparametric function representing the deviation of the ith statespecific trajectory from the population mean trajectory f(x). gi(x) is also modeled using Pspline with a linear part, bil + b,2x and a nonlinear one, ,K1 Wik(X Kk) thus allowing for more flexibility. Both these components are random with (bi,, bi2)' ~ N(0, Z) (Z being unstructured or diagonal) and wik ~ N(0, o). This extension is particularly relevant in situations where the statespecific income trajectories are quite distinct from the population mean curve and thus need to be modeled explicitly. We plan to pursue this extension if we can procure a richer dataset with longer state specific income trajectories. (2) Sometimes the function to be estimated (here the median income pattern) may have varying degrees of smoothness in different regions. In that case, a single smoothing parameter may not be proper and a spatially adaptive smoothing procedure can be used (Ruppert and Carroll, 2000). (3) We used the truncated polynomial basis function to model the income trajectory but other types of bases like Bsplines, radial basis functions etc can also be used. (4) Although we used a parametric normal distributional assumption for the random state and time specific effects, a broader class of distributions like the mixtures of Dirichlet processes (MacEachern and Muller, 1998) or Polya trees (Hanson and Johnson, 2000) may be tested. Last but not the least, we think that semiparametric modeling approach holds a lot of promise for small domain problems specially when observations for each domain are collected over time. The associated class of semiparametric models can well be an attractive alternative to the models generally employed by the U.S. Census Bureau. CHAPTER 4 ESTIMATION OF MEDIAN INCOME OF FOUR PERSON FAMILIES :A MULTIVARIATE BAYESIAN SEMIPARAMETRIC APPROACH 4.1 Introduction Small area estimation techniques have been widely used for estimating various features of small domains domains for which the sample size is prohibitively small for the application of direct survey based estimation procedures. Small domains can be specific regions like a state, county or school district or can even be identified by a particular sociodemographic characteristic like a specific ethnic group. The U.S.. Census Bureau has always been concerned with the estimation of income and poverty characteristics of small areas across the United States. These estimates play a vital role towards the administration of federal programs and the allocation of federal funds to local jurisdictions. For example, state level estimates of median income for fourperson families are needed by the U.S. Department of Health and Human Services (HHS) in order to formulate its energy assistance program to low income families. Since income characteristics for small areas are generally collected over time, there may well be a time varying pattern in those observations. Neglecting those patterns may lead to biased estimates which doesn't reflect the true picture. In this study, we put forward a multivariate Bayesian semiparametric procedure for the estimation of median income of fourperson families for the different states of the U.S. while explicitly accommodating for the time varying pattern in the observations. 4.1.1 Census Bureau Methodology The estimation of median incomes for different family sizes used to be carried out by the U.S. Census Bureau until a few years ago. More recently, they have established the Small Area Income and Poverty Estimates (SAIPE) program which exclusively deals with the estimation of median household income and poverty estimates for small areas across the United States. But the estimation of the median income of fourperson families remains interesting nevertheless. Now, we will briefly discuss the estimation procedure that the U.S. Census Bureau used to follow towards that end. In estimating the median income of fourperson families, the U.S. Census Bureau relied on data from three sources. The basic source was the annual demographic supplement to the March sample of the Current Population Survey (CPS) which used to provide the state specific median income estimates for different family sizes. The second source was the decennial census estimates for the year proceeding the census year i.e 1969, 1979, 1989 and so on. Lastly, the Census Bureau also used the annual estimates of per capital income (PCI) provided by the Bureau of Economic Analysis (BEA) of the U.S. Department of Commerce. Each of the above data sources (and the resulting estimates) have some disadvantages which neccesiated an estimation procedure that used a combination of all three to produce the final median income estimates. The CPS estimates were based on small samples which resulted in substantial variability. On the other hand, decennial census estimates, although having negligible standard errors, were only available every 10 years. Due to this lag in the release of successive census estimates, there was a significant loss of information concerning fluctuations in the economic situation of the country in general and small areas in particular. Lastly, the per capital income estimates didn't have associated sampling errors since they were not obtained using the usual sampling techniques. The details of the estimation procedure appears in Fay et al. (1993). The Census Bureau based their estimation procedure on a bivariate regression model suggested by Fay (1987). In doing so, they used median income observations for three and five person families in addition to those of four person families. The basic dataset for each state was a bivariate random vector with one component the CPS median income estimates of four person families and the other component being the weighted average of CPS median incomes of three and five person families, with weights 0.75 and 0.25 respectively. Both the regression equations used the base year census median (b) and the adjusted census medians (c) corresponding to four person families and the weighted average of three and five person families as covariates. The base year census median denotes the median income estimate obtained from the most recent decennial census while the adjusted census median (c) for the current year is obtained by the relation Adjusted census median (c) = PC ) x census median (b) PCI (b) Here PCI(c) and PCI(b) denotes the per capital income estimates produced by the BEA for the current and base years respectively. Thus, in the above expression, the current year adjusted census median estimate is obtained by adjusting the base year census median by the proportional growth in the PCI between the base year and the current year. In the regression equation, the base year census median adjusts for any possible overstatement of the effect of change in the PCI in estimating the current median incomes. Finally, the Census Bureau used an empirical Bayesian (EB) technique (Fay (1987); Fay et al. (1993)) to calculate the weighted average of the current CPS median income estimate and the estimates obtained from the regression equation. 4.1.2 Related Literature The estimation of median incomes for small areas have received sustained attention over the years. Datta et al. (1993) extended and refined the ideas of Fay (1987) and proposed a more appealing empirical Bayesian procedure. They also performed an univariate and multivariate hierarchical Bayesian analysis of the same problem and showed that both the EB and HB procedures resulted in significant improvement over the CPS median income estimates for the univariate and multivariate models. However, the multivariate model resulted in considerably lower standard error and coefficient of variation than the univariate model although the point estimates were similar. Later, Ghosh et al. (1996) (henceforth referred to as GNK) presented a Bayesian time series analysis of the same problem by exploiting the inherent repetitive nature of the CPS median income estimates. In doing so, they estimated the statewide median income estimates of fourperson families for 1989 using 1979 as the base year. They compared their estimates with the CPS median income estimates and Bureau of Census estimates by treating the decennial census values as "gold standard". They used both univariate and bivariate model formulations. In all the cases, the time series model with the adjusted census median income as covariates performed better than the ones with either the base year census median as covariates or both the base year and adjusted census medians as covariates. In all the cases, the time series model performed better than the nontime series one which only utilized the census median income figures for 1979, the CPS median income estimates for 1989 and the per capital income incomes for 1979 and 1989. Last but not the least, the bivariate time series model using the median incomes of four and five person families performed the best and outperformed both the CPS and Bureau of Census estimates of median income. Semiparametric regression methods have not been used in small area estimation contexts until recently. This was mainly due to methodological difficulties in combining the different smoothing techniques with the estimation tools generally used in small area estimation. The pioneering contribution in this regard is the work by Opsomer et al. (2008) in which they combined small area random effects with a smooth, nonparametrically specified trend using penalized splines (Eilers and Marx, 1996). In doing so, they expressed the nonparametric small area estimation problem as a mixed effects regression model and analyzed it using restricted maximum likelihood. They also presented theoretical results on the prediction mean squared error and likelihood ratio tests for random effects. Inference was based on a simple nonparametric bootstrap approach. They applied their model to a nonlongitudinal, spatial dataset concerning the estimation of mean acid neutralizing capacity (ANC) of lakes in the north eastern states of U.S. 4.1.3 Motivation and Overview The motivation of our work also originates from the longitudinal nature of the CPS median income estimates. However, instead of viewing the observations as a time series as in Ghosh et al. (1996), we have treated the state specific median income observations as longitudinal profiles or "income trajectories". As with any longitudinally varying observations, the income profiles (both statespecific and overall) may have a nonlinear pattern over time. Moreover, the successive income observations may be unbalanced in nature. These features motivated us to use a semiparametric regression approach in our modeling framework. In doing so, we have modeled the income trajectory using penalized spline (or Pspline) which is a commonly used but powerful function estimation tool in nonparametric inference. The Pspline is expressed using truncated polynomial basis functions with varying degrees and number of knots although other types of basis functions like Bsplines or thin plate splines can also be used. As covariates, we have used the adjusted census median incomes since it was found to be the most effective covariate by Ghosh et al. (1996). We tested four different regression models viz (1) A univariate model with only the CPS median income of fourperson family as the response variable; (2) A bivariate model with the CPS median incomes of three and four person families as the response variables; (3) A bivariate model with the CPS median incomes of four and five person families as the response variables; and lastly (4) A bivariate model with the CPS median incomes of four person family and weighted average of the CPS median incomes of three and five person families (with weights 0.75 and 0.25) as the response variables. In all the cases, our primary objective has been the estimation of median incomes of fourperson families of all the 50 U.S. states and the District of Columbia for 1989. For each of these models, analysis has been carried out using a hierarchical Bayesian approach. Since we chose noninformative improper priors for the regression parameters, propriety of the posterior has been rigorously proved before proceeding with the computations (see Theorem 3 in the appendix). Markov chain Monte Carlo methodologies, specifically, Gibbs sampling (Gelfand and Smith, 1990) has been used to obtain the parameter estimates. We have compared the statespecific estimates of median household income for 1989 with the corresponding decennial census values in order to test for their accuracy. In doing so, we observed that the semiparametric model estimates improve upon both the CPS and the Census Bureau estimates. Interestingly, for all the above models, the semiparametric estimates are generally superior or at least comparable to the corresponding estimates from the time series models of Ghosh et al. (1996). This is a testament to the flexibility and strength of the semiparametric methodology specially when observations are collected over time. It also indicates that it may be worthwhile to take into account the longitudinal income patterns in estimating the current income conditions of the U.S. states. Lastly, the semiparametric modeling framework is very general and can be applied to any situation where various characteristics of small areas are collected over time. The rest of the chapter is organized as follows. In Section 4.2 we introduce the bivariate semiparametric modeling framework. Section 4.3 goes over the hierarchical Bayesian analysis we performed. In Section 4.4, we describe the results of the data analysis with regard to the median household income dataset. Finally, we end with a discussion and some references towards future work in Section 4.5. The appendix contains the proofs of the posterior propriety and the expressions of the full conditional distributions for our models. 4.2 Model Specification 4.2.1 Notation Let Y, = (Y, ..., ,s)' be the sample survey estimators of some characteristics 8, = (01, ..., s)' for the ith small area at the jth time (i = 1, 2,..., m;j = 1,2,..., t). In this study, we are concerned with the estimation of 0, or some function of it. For example, 0y, may be the median income of fourperson families for the ith state at the jth year. In that case, we may be interested in estimating (01,1 ..., 0,,O)' the median income of fourperson families for all the states at time u. We may also want to estimate the difference in median incomes of fourperson families at times v and u i.e (O1vi 01,1 ... *Omvi Om,,)'. Correspondingly, let X, = (Xyi, ..., Xs)' be the predictors corresponding to the ith state and jth year. 4.2.2 Semiparametric Modeling Framework We consider both univariate and bivariate income trajectory models for the familysize dataset. The univariate modeling framework is exactly the same as explained in Chapter 3. Here, we will explain the bivariate framework which is of two types viz a simple bivariate model and a bivariate random walk model. These can also be seen as extensions of the univariate models explained in Section 3.2.2. 4.2.2.1 Simple bivariate model The bivariate nonrandom walk model is given by KI Yi = Ao + a11xi + ... + a + 7kl(X kl + b; + Uiy + eiL k=l K2 Yij2 = 302 + /12Xy2 ... + ,2,2 + k2 (Xi2 7k2) + bi2 + UJ2 + e6 (41) k=l This is the most general structure since the degrees of the spline as well as the number and position of the knots are different for the two models. If for / = 1, 2,..., m;j = 1, 2,..., t, { Yi, X 1} and { Y,2, Xo2} have similar relationship, we can assume p = q and rkl = k2, k = 1, 2,..., K (= K2). Equation (41) can be rewritten as Y = U~ Z bi +u+ ey (42) = Oy +e., where 0o = U0/3 + Zy + bi + u. 6 6rL IU Here By = (01, 0y2), uy = (u6i, U2)', ey = (edl, e2)', bi = (bil, bi2)', = (/01, ... /pl, 02, q2 = (711, ... 7K1i, 712 ... 7K22), x1 ... x 0 0 ... 0 0 ... 0 1 Xi2 ... XU and Z ((X Tr )P ... (X TK11)p 0 ... 0 0 ... 0 (X2 712)q ... (X2 TK22) Analogous to the univariate case, we assume bi i'nd N(0, Xo), and 7 ~ N(0, 1,). e. and u. are mutually independent with e. 'ind N(0, :y) and u ~ind N(0, qIj). For simplification purposes, we assume that Yo = diag(o7, o,), and 1E = diag(o71, o,) where o is assumed to be known and is estimated from the data as in the univariate framework. The above bivariate model can easily be generalized to a multivariate framework if the need arise. 4.2.2.2 Bivariate random walk model In order to model any conspicuous trend in the income observations for a specific family size and/or a specific state, we add a time specific random component to the simple bivariate model (42) as follows Yu = U' Z'7 + b + v + u + e = o,+ e (43) where 0y = U 0/ + Z'y bi + vj + uy. As in Section 3.2.2.2, we assume that (v jvj_ Ev) N(vj_, Ev) with vo = 0. Alternatively, we may write vj = vj_ + wj where wj /i.i.d N(0, Iv). 4.3 Hierarchical Bayesian Analysis In this section, the notations and expressions would correspond to the bivariate setup. The expressions for the univariate setup would be analogous and is mentioned in detail in Chapter 3. 4.3.1 Likelihood Function Let Yi = (Y ..., Yl)' be the response and Ui = (Ui, ..., Uit)' and Z, = (Zi, ..., Zt) be the covariate vectors corresponding to the ith state. Here, Y, = (Yy, Y,2)' and the expressions for U, and Z, are given above. Let 0, = (0i, 0, 7, bi, { i1,.... qjt} o, Y) be the parameter space corresponding to the ith state where i0 = (0i, ..., 0t)'. Thus, the full parameter space will be given by 0 = i2 x ... x ,,. For the bivariate nonrandom walk model, the likelihood function for the ith state would be given by L(Y,, Ui, Zil i) oc L(Y, il )L(OI/3, 7, bi, { 1, .... It}, Ui, Zi)L(bil o)L(yl, ) t = {L(Yg 0o, Xy)L(Oy U' Z'7 bi, V)} L(bil o)L(7y,) j=1 (44) Here, L(Xl, 1) denotes a multivariate normal density with mean vector p and variance covariance matrix X. For the bivariate random walk model, the parameter space for the ith state would be fi = (,O, 0, 7, bi, v, {i' ...., t}, o,, :, ~) where v = (v, ..., v)' is the vector of time specific random effects. The hierarchical Bayesian framework is given by 1. (Y0e,) N/(eO, 0:) 2. (06/1, 7, bi, vj, qj) ~ N(X' + Z',7 + bi + vj, qjj) 3. (v lv_~, ZE) ~ N(vj_, ZE), assuming vo = 0 4. (bil, o) ~ N(0, Zo) 5. 7 ~ N(0, Z.) Thus, the likelihood function for the ith state, (44) will have an extra component corresponding to v given by L(v lvj_ v,) which has a normal distribution with mean vj_1 and covariance matrix 1,. 4.3.2 Prior Specification To complete the Bayesian specification of our model, we need to assign prior distributions to the unknown parameters. We assume noninformative improper uniform prior for the polynomial coefficients (or fixed effects) 3 and proper conjugate Inverse Wishart priors on the variance covariance matrices ({f1,..., q}, 01, ). The prior distributions are assumed to be mutually independent. We choose the inverse Wishart parameters in such a way that the priors are diffuse in nature so that inference is mainly controlled by the data distribution. Thus, we have the following priors : 3 ~ uniform(RP ++2), v ~_ IW(Sj, dj)(j = 1, ... t~, IW(S, d7), 1o IW(So, do) and I, IW(S,, d,) Here X ~ IW(A, b) denotes a inverse Wishart distribution with scale matrix A and degrees of freedom b having the expression f(X) oc IXI(b+p+1)/2exp(tr(AX1)/2), p being the order of A. 4.3.3 Posterior Distribution and Inference The full posterior of the parameters given the data is obtained in the usual way by combining the likelihood and the prior distribution as follows m t p(fY, U, Z) oc n L(Yi, Ui, Zin,)7(0)7r(o)7(I ) [H (q) (45) i=1 j= 1 For the random walk model there will be an additional term 7r(,). By conditional independence properties, we can factorize the full posterior as [0, 3, 7, b, o {, i1, .... 't} Y, U, Z] oc [Y le][el 3, 7, b, { Wi,..., W }, X, Z] t x [bl E][7l E[/3][E0] f[5o[[L] j= 1 Our target of inference is {06,, i = 1,..., m;j = 1, ...t}, the true median income for of fourperson families for all the states. Since the marginal posterior distribution of 0y is analytically intractable, high dimensional integration have to be carried out in a theoretical framework. However, this task can be easily accomplished in an MCMC framework by using Gibbs sampler to sample from the full conditionals of 0y and the other relevant parameters. In implementing the Gibbs sampler, we follow the recommendation of Gelman and Rubin (1992) and run n (> 2) parallel chains. For each chain, we run 2d iterations with starting points drawn from an overdispersed distribution. To diminish the effects of the starting distributions, the first d iterations of each chain are discarded and posterior summaries are calculated based on the rest of the d iterates. The full conditionals for both the models are given in the appendix. Once posterior samples are generated from the full conditionals of the parameters, RaoBlackwellization yields the following posterior means and variances of 6, n 2d E(O y) (nd)1 (1 i+ jk)l (EI lY i (X/i + ZYk, + bik,)) (46) k= 1 d+l and n 2d n 2d V(6,y) = (nd) (1 ^l' jkI)1 (nd)l > (E '+'k)1 k= 1= d+ k= 1= d+ x (I1Y + k1(X0k/ + Z kl/ bik)) (1Y + (X/ + Z'/ + bik)) n 2d ( 1 +ijkl) 1 (nd)_2 E > (EI1 + )1 (+YL k= 1 =d+l + k/(X Zk/+ ZYk/+ bik/)) n 2d x (1Y + 1 YU i (X k/ + ZJk/ + bk/)) (47) k 1= d 1 4.4 Data Analysis We applied the semiparametric models in Section 4.2.2. to analyze the median income dataset referred to in Section 4.1.3. The basic dataset for our problem is the triplet (Y1, Y62, Y63) and the associated variance covariance matrix Zy (i = 1,..., 51;j = 1,..., 11). Here Y,4, Y,2 and Y,3 respectively denote the CPS median incomes of four, three and five person families for the ith state and thejth year. Y, is assumed to estimate the true unknown median income Oi, (u = 1, 2, 3). The corresponding adjusted census medians are denoted by X,y, Xy and X,3. The years correspond to 1979,...,1989. For the univariate setup, the response and covariates are respectively Y,i and X,6. For the bivariate setup, the basic data vector is a duplet with first component YU1 and second component is either Y,2, Y 3 or 0.75Y,2 + 0.25Y,3. The adjusted census medians are chosen analogously. As mentioned before, our target of inference are the state specific median incomes of four person families for 1989. 4.4.1 Comparison Measures and Knot Specification In this study, our target of inference is the state specific median income corresponding to fourperson families for the year 1989. We judged our estimates by comparing those to the corresponding census figures for 1989. In small area estimation problems, the census estimates are often treated as "gold standard" against which all other estimates are compared. However, such a comparison is only possible for those years which immediately precede the census year i.e 1969, 1979, 1989 and 1999. In order to check the performance of our estimates, we plan to use four comparison measures. These were originally recommended by the panel on small area estimates of population and income set up by the Committee on National Statistics in July 1978 and is available in their July 1980 report (p. 75). These are * Average Relative Bias (ARB)= (51)1 Zic' ic e C 51 I e12 2 Average Squared Relative Bias (ASRB) = (51)1 i1 c2 Ci Average Absolute Bias (AAB) = (51)1 1  c, e, Average Squared Deviation (ASD) = (51)1 51(c, e,) Here c, and ei respectively denote the census and model based estimate of median income for the ith state (i = 1, ...,51). Clearly, lower values of these measures would imply a better model based estimate. The basic structure of our models would remain the same as in Section 4.2.2. We have used linear truncated polynomial basis functions for the Pspline component in our models since the median income profiles didn't exhibit a high degree of nonlinearity. For highly nonlinear profiles a quadratic or cubic polynomial basis function representation can be used. In nonparametric regression problems, the proper selection of knots plays a critical role. Ideally, a sufficient number of knots should be selected and placed uniformly throughout the range of the independent variable so that the underlying observational pattern is properly captured. Too few or too many knots generally degrades the quality of the fit. This is because, if too few knots are used, the complete underlying pattern may not be captured properly, thus resulting in a biased fit. On the other hand, once there are enough knots to fit important features of the data, further increase in the knots have little effect on the fit and may lead to overparametrization (Ruppert, 2002). Generally, at most 35 to 40 knots are recommended for effectively all sample sizes and for nearly all smooth regression functions. Following the general convention, we have placed the knots on a grid of equally spaced sample quantiles of the independent variable (adjusted census median income). 4.4.2 Computational Details We implemented and monitored the convergence of the Gibbs sampler following the general guidelines given in Gelman and Rubin (1992). We ran three parallel chains, with varying lengths and burnins. We initially sampled the 6''s from multivariate tdistributions with 2 df having the same location and scale matrices as the corresponding multivariate normal conditionals given in the Appendix. This is based on the GelmanRubin idea of initializing the chain at overdispersed distributions. However, once initialized, the successive samples of O 's are generated from regular multivariate normal distributions. Convergence of the Gibbs sampler was monitored by visually checking the dynamic trace plots, acf plots and by computing the GelmanRubin diagnostic. To diminish the effect of the starting distributions, the first d iterations of each chain are discarded and the posterior summaries are based on the subsequent iterates. The comparison measures deviated slightly for different initial values. We chose the least of those as the final measures presented in the tables that follows. 4.4.3 Analytical Results Data on CPS median income and adjusted census median incomes were available for 50 states and the District of Columbia for the time span 19791989. CPS median income ranged from $24,879.68 to $52,778.94 with a mean of $36,868.48 and standard deviation of $5954.94 while adjusted census median income ranged from $27,910 to $72,769.38 with a mean of $41,133.45 and standard deviation of $7196.56. We fitted both the univariate and bivariate models to the median income dataset. In doing so, we worked with all possible knot choices from 0 to 40. Here, we would only show the results corresponding to the best performing model i.e the model with the lowest values of the comparison measures. In the univariate framework, the model with 3 knots in the income trajectory performed the best. Table 41 reports the comparison measures for this model (denoted as USPM(3)) along with those of the CPS estimates (CPS), Census Bureau estimates (Bureau), and the univariate GNK time series (GNK.TS) and nontime series (GNK.NTS) estimates. Table 42 reports the percentage improvement of the time series, nontime series and the semiparametric estimates over the census bureau estimates. From Table 41, it is clear that the semiparametric estimates significantly improve upon the CPS, time series and nontime series estimates with respect to all the comparison measures. Infact, the semiparametric estimates perform slightly better than the bivariate Census Bureau estimates too with respect to ARB and AAB. This Table 41. Comparison measures for univariate estimates Estimate ARB ASRB AAB ASD CPS 0.0735 0.0084 2,928.82 13,811,122.39 Bureau 0.0296 0.0013 1,183.90 2,151,350.18 GNK.TS 0.0338 0.0018 1,351.67 3,095,736.14 GNK.NTS 0.0363 0.0021 1,457.47 3,468,496.61 USPM(3) 0.0289 0.0014 1169.74 2,549,698.26 Table 42. Percentage improvements of univariate estimates over Census Bureau estimates Estimate ARB ASRB AAB ASD GNK.TS 14.19% 38.46% 14.17% 43.90% GNK.NTS 22.64% 61.54% 23.11% 61.22% USPM(3) 2.37% 7.69% 1.2% 18.52% is also reflected in Table 42 where the semiparametric estimates marginally improve upon the Bureau estimates for the above two comparison measures. Overall, the degree of dominance of the Bureau estimates on the time series and non time series estimates is much larger compared to that on the semiparametric estimates. These results indicate that, in the univariate framework, the semiparametric model with 3 knots perform significantly better than the time series and nontime series models of Ghosh et al. (1996). Now, we move on to the bivariate nonrandom walk setup. First, we consider the model with response vector the CPS median income of 4 and 3 person families i.e (Y,4 and Y,2). The covariates are the corresponding adjusted census medians. Since we assumed inverse Wishart priors for the variance covariance matrices, the values of the comparison measures were dependent on the degrees of freedom of the Wishart distribution and the number of knots in the income trajectory. We worked with different combinations of the two in fitting these models. The best results (lowest comparison measures) were obtained for two models, both with 6 knots but with degrees of freedoms 7 and 9 respectively. These models are denoted by BSPM(1)(4,3) and BSPM(2)(4,3) respectively. When we consider the median incomes of 4 and 5 person Table 43. Comparison measures for bivariate nonrandom walk estimates Estimate ARB ASRB AAB ASD CPS 0.0735 0.0084 2,928.82 13,811,122.39 Bureau 0.0296 0.0013 1,183.90 2,151,350.18 GNK.TS(4,3) 0.0295 0.0013 1,171.71 2,194,553.67 GNK.NTS(4,3) 0.0323 0.0016 1,287.78 2,610,249.94 BSPM(1)(4,3) 0.0274 0.0013 1079.63 2,182,669.56 BSPM(2)(4,3) 0.0286 0.0011 1131.61 1,880,089.29 GNK.TS(4,5) 0.0230 0.0009 932.51 1,618,025.33 GNK.NTS(4,5) 0.0295 0.0013 1,179.94 2,216,738.06 BSPM(4,5) 0.0255 0.0010 1033.12 1,859,373.98 GNK.TS(4,3+5) 0.0287 0.0013 1,150.24 2,116,692.71 GNK.NTS(4,3+5) 0.0324 0.0015 1,297.12 2,530,938.06 BSPM(1)(4,3+5) 0.0271 0.0012 1078.5 2,128,679.65 BSPM(2)(4,3+5) 0.0289 0.0012 1132.10 1,838,598.30 families, the lowest comparison measures were obtained for the model with 4 knots in the income trajectory and 7 degrees of freedom. We denote this model by BSPM(4,5). Lastly, for the model with median incomes of 4 person families and the weighted average incomes of 3 and 5 person families (with weights 0.75 and 0.25) as response vectors, the best results were obtained for two models, both with 6 knots and with degrees of freedoms 7 and 9 respectively. We denote these models as BSPM(1)(4,3+5) and BSPM(2)(4,3+5) respectively. Table 43 reports the comparison measures for these models along with those of CPS, Bureau, and the corresponding bivariate GNK time series and nontime series estimates. Table 44 reports the percentage improvement of the above estimates over the census bureau estimates. From Table 43 and Table 44, it is clear that both BSPM(4,3) and BSPM(4,3+5) estimates improve upon the bivariate time series and non time series estimates with respect to nearly all the four comparison measures. The semiparametric estimates also improves upon the Census Bureau estimates and the raw CPS estimates. For the model with median income of four and five person families as response, the semiparametric estimates falls well behind the bivariate time series estimates of Ghosh et al. (1996) but significantly improves upon the CPS and Census Bureau estimates. 100 Table 44. Percentage improvements of bivariate nonrandom walk estimates over Census Bureau estimates Estimate ARB ASRB AAB ASD GNK.TS(4,3) 0.48% 2.52% 1.03% 2.01% GNK.NTS(4,3) 8.99% 22.45% 8.77% 21.33% BSPM(1)(4,3) 7.43% 0.00% 8.81% 1.46% BSPM(2)(4,3) 3.38% 15.38% 4.42% 12.61% GNK.TS(4,5) 22.19% 30.52% 21.23% 24.79% GNK.NTS(4,5) 0.31% 0.18% 0.33% 3.04% BSPM(4,5) 13.85% 23.08% 12.74% 13.57% GNK.TS(4,3+5) 2.94% 3.56% 2.84% 1.61% GNK.NTS(4,3+5) 9.36% 17.18% 9.56% 17.64% BSPM(1)(4,3+5) 8.45% 7.69% 8.90% 1.05% BSPM(2)(4,3+5) 2.37% 7.69% 4.37% 14.54% Now let us consider the bivariate random walk model. For the case with 4 and 3 person families, the lowest comparison measures were obtained for three models with degrees of freedoms and number of knots (3, 6), (5, 6) and (9, 1) respectively. We denote these models as BRWM(1)(4,3), BRWM(2)(4,3) and BRWM(3)(4,3) respectively. Each of these models significantly improves upon the CPS and Census Bureau estimates and are also superior to the bivariate time series and nontime series models proposed by Ghosh et al. (1996) (GNK). The random walk estimates also seem to improve marginally over those corresponding to the nonrandom walk semiparametric model. When we consider the median income estimates of 4 and 5 person families, the random walk model with degrees of freedom 5 and 1 knot in the trajectory seems to perform the best. The comparison measures are significantly better than the CPS, Bureau and the nontime series model of GNK. However, they fall marginally short of the time series estimates but fare better than the corresponding estimates obtained from the nonrandom walk model (BSPM(4, 5)). We denote this model as BRWM(4, 5). Lastly, for the model with median incomes of 4 person families and the weighted average incomes of 3 and 5 person families (with weights 0.75 and 0.25) as response vectors, the best results were obtained for the model with 5 degrees of freedom and 1 knot in the trajectory. The comparison measures were significantly better than the CPS, Table 45. Comparison measures for bivariate random walk model Estimate ARB ASRB AAB ASD BRWM(1)(4,3) 0.0261 0.0011 1043.33 1,902,416.1 BRWM(2)(4, 3) 0.0274 0.0010 1094.25 1,804,969.06 BRWM(3)(4, 3) 0.0258 0.0012 1037.03 2,114,599.65 BRWM(4,5) 0.0245 0.0010 978.12 1,672,183.6 BRWM(4, 3 5) 0.0244 0.0011 990.50 1,941,833.29 Bureau and GNK (both time series and nontime series) while it also improved upon the nonrandom walk semiparametric model. We denote this model as BRWM(4,3+5). Table 45 reports the comparison measures for the random walk models. 4.5 Conclusion and Discussion Estimates of various characteristics of small areas are frequently used by the federal government for formulating important policy decisions and to provide developmental funds to different states and local jurisdictions. These are also used by various local agencies to formulate business policies and other important decisions. Often observations on these characteristics (for example, income and poverty estimates) are available at multiple time points in the past thus resulting in a longitudinal profile or trajectory. Taking proper account of these time varying profiles may result in a significant improvement in the estimates of the same characteristics at some current or future time points. In this scenario, spline based semiparametric procedures have a clear edge over the usual parametric procedures since the former can take in account virtually any possible pattern in the underlying profile and can also handle unbalanced observations with ease. Estimation of median incomes of four person families for different states of U.S. (here playing the role of small areas) is of interest to the U.S. Bureau of the Census. Towards this end, the Bureau of Census collected annual median income estimates of 3, 4 and 5 person families for all the states and the District of Columbia for every year. But the methodology used by the Census Bureau doesn't take into account the longitudinal nature of the statespecific median income observations. 102 In this study, we put forward a multivariate Bayesian semiparametric procedure for the estimation of median income of fourperson families for the U.S. states while explicitly accommodating for the time varying pattern in the income observations. We used a bivariate semiparametric modeling framework in which we modeled the median incomes of a given pair of family sizes (for every state) as penalized splines. In doing so, we came up with estimates of median incomes of 4 person families which were significantly better than that obtained by the U.S. Bureau of Census and were comparable to those obtained by the time series methodology of Ghosh et al. (1996). We also extended the basic semiparametric framework by incorporating a time series (random walk) component to account for the within state dependence in the successive income observations. The class of random walk models seemed to improve upon their nonrandom walk counterparts but more studies are required to be done before reaching a definite conclusion about their relative performance. Overall, we strongly think that semiparametric procedures holds a lot of promise for small area estimation problems, specifically in situations where multiple time varying observations of some characteristic are available for the small areas. 103 CHAPTER 5 CONCLUSION AND FUTURE RESEARCH In my dissertation, I have concentrated on the application of semiparametric methodologies in analyzing unorthodox data scenarios originating in diverse fields like case control studies and small area estimation. In the former scenario, I have used penalized splines to model longitudinal exposure profiles and its influence pattern on the current disease status for a group of cases and controls. In doing so, I have come to the conclusion that past exposure observations may have significant effect on the present disease status. Our modeling framework is quite general and flexible in the sense that it can be used to model any possible patterns of exposure profiles and also it can capture complex time varying patterns of influence of the exposure history on the current disease status. We applied our modeling framework on a nested case control study of prostate cancer where the exposure was the Prostate Specific Antigen (PSA). In the second scenario, we have used semiparametric procedures to model the income trajectories of different small areas and have used that information to estimate the median incomes of those small areas at a given time point in the future. Our model based estimates seemed to perform better than the usual Bureau of Census estimates which are based on the income observations from a particular time point and hence are nonlongitudinal in nature. We have also extended the semiparametric modeling framework to the bivariate scenario in estimating the median income of varying family sizes for each small area. In both these cases, the semiparametric income estimates not only improves on the census estimates but are also comparable to estimates based on time series models. Thus, we can conclude that semiparametric methodology, if properly applied, holds a lot of promise for complicated datadriven situations arising in diverse statistical settings like the once mentioned above. The flexibility and power of the nonparametric and semiparametric procedures immediately implies that a multitude of interesting but useful extensions can be carried out over what has been already done above. I will briefly go over some of the possible extensions below. These extensions are independent of the specific area or setting where they are applied i.e these equally apply to the case control and small area scenarios we have mentioned before. 5.1 Adaptive Knot Selection As mentioned before, we have used penalized splines to model the exposure and influence profiles in the case control framework and the income trajectories in the small area estimation problem. As explained in Section 1.4, selection and proper positioning of knots is a vital aspect in any smoothing procedure involving splines. Traditionally, knots are placed at equally spaced sample quantiles of the independent variables and that's what we have done in both the case control and small area scenarios. But this procedure has its fair share of drawbacks it was evident in the univariate small area problem where the original placement of the knots failed to account for the low density region of the data pattern where the nonlinearity was mostly concentrated. This was probably because of the quantile dependent placement procedure of the knots. Recently, there has been some research on datadriven or "adaptive" knot placement procedures in which the number and locations of the knots are controlled by the data itself rather than being prespecified. The advantage of this procedure is that fewer number of knots would be required which would be placed in "optimal" locations along the domain. Thus, the resulting spline fit will be flexible enough to capture any underlying heterogeneity in the data pattern. Both Frequentist and Bayesian approaches have been proposed towards this end. Some Frequentist contributions include Friedman (1991) and Stone et al. (1997) who used forward and backward knot selection schemes until the "best" model is identified. Zhou and Shen (2001) used an alternative algorithm which led to the addition of knots at locations which already possessed some knots. Bayesian treatment of this problems revolves on the notion of treating the knot number and knot locations as free parameters. Some notable Bayesian contributions include 105 Denison et al. (1998) who placed priors on the number and locations of the knots. Then they sampled from the full posteriors of the parameters (including knot locations and numbers) using reversible jump MCMC methods (Green, 1995). However, they restricted the knots to be located only at the design points of the independent variable. DiMatteo et al. (2001) followed the same basic procedure as Denison et al. (1998) but they didn't restrict the knots to be located only at the design points of the experiment. They also penalized models with unnecessarily large number of knots. Botts and Daniels (2008) proposed a flexible approach for fitting multiple curves to sparse functional data. In doing so, they treated the numbers and locations of knots of the population averaged and subject specific curves as distinct random variables and sampled from their posterior distributions using reversible jump MCMC methods. They used freeknot bsplines to model the population averaged and subject specific curves. In all the above contributions, Poisson priors are placed on the knot numbers while flat priors are placed on the knot positions. The usefulness and flexibility of the Bayesian approach lies in the fact that the number and locations of knots are automatically determined from the MCMC scheme. Thus, this methodology is often known as Bayesian Adaptive Regression Splines. However, the sampling procedure is quite intensive since the parameter dimension varies at every iteration. Botts and Daniels substantially reduced the computational burden by dealing with the approximate posterior distribution of only the number and positions of the knots by integrating out the other parameters by using Laplace transformations. An immediate but worthwhile extension to what we have already done would be to incorporate an adaptive knot selection scheme into both the case control and small area modeling frameworks. For the former setup, this would correspond to deciphering the optimal number of knots for the population mean PSA trajectory and the influence function. So, depending on the particular study or the dataset at hand, any underlying pattern in the influence profile (of the exposure trajectory on the disease state) can be 106 automatically captured. For the second framework, an adaptive knot selection scheme would result in a spline fit that would adequately reflect any underlying heterogeneity in the time varying income trajectory of the different small areas. Some other interesting extensions to our work can be 1. Incorporating informative (nonignorable) missingness (Little and Rubin, 1987) in the longitudinal exposure (case control) or income (small area) profiles. 2. Incorporating nonparametric distributional structures like mixtures of Dirichlet processes (MacEachern and Muller, 1998), Polya trees (Hanson and Johnson, 2000) on the subject (or area) specific random effects. 3. Extending the semiparametric case control modeling framework to situations involving multiple (> 2) or even categorical disease states. Now, I briefly explain some work that we are currently engaged in doing. 5.2 Analyzing Longitudinal Data with Many Possible Dropout Times using Latent Class and Transitional Modelling 5.2.1 Introduction and Brief Literature Review Longitudinal studies deal with repeated measurement of individuals over time. As a result, missingness is an integral part of these studies. Missingness can result from different causes like dropout or withdrawal from the course of treatment, intermittent absence from a visit, death due to unrelated causes etc. In this study we will only consider missingness induced by dropout. Depending on the precise nature or causes of dropout, different missingness (or dropout) mechanisms have been formulated (Little and Rubin, 1987). Broadly these are of three types viz : 1. Missing completely at random (MCAR) : Missingness induced by dropout is said to be MCAR if it is completely independent of the response. 2. Missing at random (MAR) : Missingness induced by dropout is said to be MAR if dropout only depends on the observed data i.e dropout is unrelated to the unobserved data conditional on the observed data. 3. Missing not at random (MNAR) : This occurs when missingness depends on the unobserved response at the time of dropout or at future times, even after conditioning on the observed data. This type of missingness is also known as Nonignorable or Informative dropout. 107 Nonignorable missingness can be handled by two distinct classes of models viz patternmixture and selection models, first formulated by Little and Rubin (1987). These approaches differ in the way they factor the joint distribution of the missing data and the response. In the former approach, the population is first stratified by the pattern of dropout resulting in a model for the whole population that is a mixture over the patterns. On the other hand, the selection modelling approach first models the hypothetical complete data and then a model for the missing data process (conditional on the hypothetical complete data) is appended to the complete data model. In this study we will focus on the Pattern mixture (PM) modeling approach. Suppose our study consists of N subjects, each of whom can be measured at T time points. Let Yi and the Di respectively denote the response vector and dropout time for the ith subject. Di is such that Di t if the ith subject drops out between the (t l)th and tth observation times. T 1 if the ith subject is a complete. Here we assume that a subject is first measured at baseline (t = 0). Thus, there be T unique dropout times. In the PM approach, it is assumed that subjects with different dropout times have different response distribution i.e f (y I ) D f (Yi) = f(yi, Di) f(yi) f (Di) (51) So, for the ith subject, yi and Di are assumed to be associated or dependent. Thus, in this approach models are built for [Y, Di] but inferences are based on f(y) = Sf(ylD)P(D). D An important but realistic situation that may arise in longitudinal studies is that the number of unique dropout times T (visavis, the number of times a subject is measured) maybe large. As a result the number of subjects having a particular dropout time may be quite small. Thus, stratification by dropout pattern may lead to sparse 108 patterns which may result in unstable parameter estimates in those patterns since some of the parameters maybe unidentifiable. There are different ways to get around this problem Hogan and Laird (1998) suggested parameters to be shared across patterns. Hogan et al. (2004) suggested ways to group the T dropout times into m < T groups in an adhoc fashion. Roy (2003) proposed an automated mechanism to do the above grouping using a latent variable approach within the context of normal models for continuous data. This approach assumes the existence of a discrete latent variable that explains the dependence between the response vector and the dropout time and allows incorporation of uncertainty about the groupings, conditional on a fixed number of groups. Roy and Daniels (2008) extended the above approach by incorporating uncertainty in the number of classes through approximate Bayesian model averaging. In their approach, the marginal mean is assumed to follow a generalized linear model, while the mean conditional on the latent class and random effects is specified separately. Since the dimension of the parameter vector of interest (the marginal regression coefficients) does not depend on the assumed number of latent classes, they treat the number of latent classes as a random variable. A prior distribution is assumed for the number of classes and approximate posterior model probabilities are calculated. In order to avoid the complications with implementing a fully Bayesian model, they propose a simple approximation to these posterior probabilities. Lastly, they apply their methodology to a dataset dealing with the longitudinal study of depression in HIVinfected women. Heagerty (1999) proposed marginally specified logistic normal models for longitudinal binary data. In doing so, he proposed an alternative parametrization of the logistic normal random effects model and studied both likelihood and estimation equation approaches to parameter estimation. A notable feature of his approach was that the marginal regression parameters still permit individual level predictions or contrasts. Heagerty (2002) also proposed a general parametric class of serial 109 dependence models which permits likelihood based marginal regression analysis of binary response data. These are known as marginalized transition models. Basically, it is a combination of a marginal regression model used to characterize the dependence of the response on covariates and a conditional regression model or transition model (Diggle et al., 2002) used to capture the serial dependence in the response process. There exists another class of models known as marginalized latent variable models which takes care of the exchangeable or nondiminishing dependence pattern among the repeated response observations using random intercepts. Schildcrout and Heagerty (2007) combined the marginalized transition and latent variable models by proposing a unifying model that takes into account both serial and long range dependence among the response observations. Their model can be used in situations with moderate to large number of repeated measurements per subject where both serial (short range) and exchangeable (long range) response correlation can be identified. In this study, we combine the methodologies proposed in Heagerty (2002), Schildcrout and Heagerty (2007) and Roy and Daniels (2008) and propose a new model which accounts for both serial(short term) and longrange dependence among the response observations in situations where the number of unique dropout times is large. We group the dropout times using a latent variable approach taking into account the uncertainty in the number of groups. We also model the marginal covariate effects of interest. 5.2.2 Modeling Framework Longitudinal observations collected on an individual over multiple time points are always correlated since they correspond to the same subject. An established way of accounting for this dependence (in the response vector Yi of the ith subject) is to introduce subject specific random effects, say bi. In a typical longitudinal study the principle aim of the researcher is to model the marginal covariate effects using the 110 marginal regression model. But this goal cannot be achieved using a nonlinear link function since it doesn't hold for the marginal covariate effects. Heagerty (1999) proposed marginally specified logistic models which lead to direct modeling of the marginal covariate effects. Let Y, and Xit respectively be the response observation and the covariate vector corresponding to the ith individual at the tth time point, i = 1, 2,..., N ; t = 1, 2,..., T. Let E(YtXit, /) be the marginal mean of Y,. It is specified as logit [E(Y tX,t,/3)] = X/3 (52) The above structure is the marginal regression model. Now, in order to specify the dependence among (Y,1, Y2,..., T) the following conditional model is specified logit [E( YXit, bi)] = At + bi (53) where bi N(0, 0). Ai, can be computed by solving the following convolution equation P(Yt = 1)= P(Y,t Xit, bi)dF(bi) (54) Thus A is a function or / and 0. In this study we will be proposing a model which will marginalize over the random effects and the dropout distribution to directly model the marginal covariate effects of interest taking into account both the serial and exchangeable dependence structure among the Yi's. Let us briefly go over the necessary notations with respect to subject i. Let Y = (Y,, Y, ..., YT) be the response vector. Let the T unique dropout times be grouped into m classes by the latent indicators Si = (Si, ..., Sim). Here S is an indicator for class j,j = 1,..., m (m < T) such that S { 1 if the ith subject is in class Otherwise. 0 otherwise. We assume that conditional on the past observations, Y, depends only on the previous p observations i.e (Y,_t t2, ..., Yt). Here we have to deal with the following three types of dependence structures : 1. Dependence between response and dropout time modeled by the latent classes. 2. Short range (serial dependence) between Y, and (Yt_,..., ,p) modelled by a MTM(p). 3. Long range or nondiminishing dependence among the Y,'s modelled by the subject specific random effects bi, i = 1,..., N. We first specify the Marginal model as T = E(YtX t,0) = gl(t) (55) The above model marginalizes over the subject specific random effects and over the latent class distribution (implicitly over the dropout distribution) as well. In order to fully specify the association due to repeated measurements and nonignorability in the missingness process, we specify a conditional model in addition to the marginal model. By conditional, we mean conditioned over the random effects and latent classes. We assume that the relevant information in the dropout times is captured by the latent variable S this is obvious because the specific latent class a subject would belong to would solely depend on his/her dropout time. Thus, we specify a mixture distribution over these latent classes, as opposed to over D itself. Before delving into the model, it is important to note that the conditional model parameters are not of main interest, and in fact will be viewed as nuisance parameters. This is because we are not interested in estimating either subjectspecific effects (i.e. effects conditional on the random effects) or classspecific covariate effects (i.e. effects of covariates on Y given a particular dropout class). Moreover, the conditional model should be so specified that it is compatible with the marginal model (55). As we will see below, this leads to a somewhat complicated model. Specifying this conditional model 112 is necessary, as we will see, in order to account for the three types of dependencies mentioned above. We assume that Y,, conditional on the random effects bi and latent class 5,, are from an exponential family with distribution f( Yt Ik, k < t, bi, Si) = exp [{ t it (Tit)}/(mi) + h( t, ( )] where E(Y,t Yk, k < t, b,, Si) = gl(it) = '(lit). Here Tlit is the linear predictor, b() is a known function, 0 is a scale parameter and m, is the prior weight. We next specify the conditional model as m p g{E(Yt Yk, k < t, bi, Si)} = Ait + SuZ() + 7t,kyitk + b (56) j=1 k=l where, in the most general case, [bi, Sy = 1, X] ~ N(0, o(7(Xi)) and 7it,k(S = 1) = V'tkk forj = 1, 2,..., m and k = 1, 2,..., p, where Vi, and Zit are both subsets of Xt. Thus, the variance of bi may depend on the latent class and the covariate vector for the ith subject. Moreover, 62, 6, .... 6) determines how the dependence between Y, and Ytk varies as a function of the covariates Vit,k conditional on the latent classes. We also make the sumtozero constraint i.e at = Y1 a for the purpose of identifiability. Lastly, in this conditional model, each subject has its own intercept, and the effect of each covariate, is allowed to differ by dropout class via the regression coefficients, aO). The probabilities of the latent classes given the dropout times are specified as proportional odd's model (Agresti, 2002) given by logit P Su= = oDi /k A1Di, k = 1..., m 1. (57) j= 1 where Ao,1 < Ao,2 < ... < AO,M1 and A1 are unknown parameters. Thus the class probabilities are assumed to be a monotone function of dropout time (in fact, linear on the logit scale). 113 Instead of proportional odd's model, we can also assume proportional hazards model i.e log log1 P( S =l Di )] k = A/Di, k=l ... m 1 j 1 The other option would be to assume an ordinal probit formulation for the probabilities of the latent classes given by 1 P( S = Di =Ak A1D,, k= 1,...,m 1 The predicted probabilities obtained from the ordinal probit model are similar to those obtained from the proportional odd's model. Moreover, an advantage of the former model is that, sampling from its posterior distribution is particularly efficient. For this reason, the ordinal probit model is sometimes preferred if a Bayesian analysis needs to be performed. Lastly, the dropout times Di are assumed to follow a multinomial distribution with mass at each possible dropout times, parameterized by p. Here we make the important assumption that Y, is independent of Di given 5,. Our main target of inference are the covariate effects averaged over the classes i.e PM averaged over M. The intercept Ai, in (56) is determined by the following relationship between the marginal and conditional models E(Ytl) = Z p(SilDi)P(Di) J {E(Y tyt_ 1.... Ytp, bi, S)p(yt_1,.... ytplb, S)} D S A x p(bilSi)dbi where A = {it_, ..., Ytp}. 5.2.3 Likelihood, Priors and Posteriors Let, the set of all parameters be denoted by w = (3, a, a ,..., o, 6). We partition the complete response data for subject i, Yf into observed (values of Yf prior to dropout) components, denoted by Yi and missing (response observations after dropout) components, denoted by Y". Since the subjects are independent of one another, the likelihood for the parameters is the product of individual contributions (from each subject). Once A, = (A,i, A,2,..., AT) has been calculated, the evaluations of the individual contributions from subject i becomes straightforward. In the following expressions, m (the number of latent classes) will implicitly conditioned upon. Thus we have N L(wY, X, D) = Li(wlYi, X,, D,) i=1 where Li(w Y,, X,, D,) o I/ L,(Y,lY{ _,, 5 = 1, b,, a(), )p(So = 1 D,; A)p(D, y)dF(b,So = 1, o,) j=1 Here, Li(YilY{_i}, S, = 1, bi, aW), ) = Li(Y,\ S, = 1, bi, a ), )LiY,2 Y,,, S, = 1, bi, a ), ) x...x L,(YTI Y, T1 ... Y,Tp,So = b,, ) ) (58) Proportionality in (58) holds because we assume that the missing and observed responses from subject i are independent, given Si and b, (i.e. [Yn IY,, bi, Si] = [Y,"1bi, Si]). Following the OPEF formulation, we have !( T T T L,(Y, Y{i_,}, S,= 1, b,, a(, ) = exp ity,it (tlt) /(mi) + h(Y,, t=1 t= t=1 where m Tli, = g{E(Y,ilb,, S = 1)} = A, b, S,Za j=1 m 1Ti2 = g{E(Y,2yl, bi, SU = 1)} A,2 + bi + sZaJ +712,1Yi j=1 m p TiiT = g{E(Y, yr1,..., yTp, b,, Su = 1)} = An + b,+ SuZa0)+ 7iT,kYiTk j=1 k=l 115 Since the number of latent classes m is treated as a random variable itself, we assume a prior for m along with w. Let the priors be respectively denoted by 7(m), 7r(0), r(a), {7(o ), = 1, 2..., m)}, r(A), r(p), Tr(), and 7(6). So the full posterior of m and w is given by N m 7(w, m Y, X, D) = J Li(w Y,, X,, Di)(m)r(/3)7(a)7(A)7(p)7r()r(5){f 7(u72)} i= / 1 (59) We can avoid the integral (w.r.t b,) in (58) if we also sample the big's along with the other parameters from the full posterior (59). In that case, the full posterior may be rewritten as 7(w, mlY, X, D)= ere L*(w Yi, Xi, N m [ L*(wlYi, Xi, Di) 7(m) 7(0)X)7(a) 7(A) 7(p) 7(0)) (6){ nT7(72)} i=1 /= 1 (510) m Di) = Li(YiY{_i,}, Sy= 1, b,, a ), )p(5y= 1D,; A) (511) For the most general case, we have assumed an OPEF structure for each Y, conditional on the past. Since the outcomes are binary, we can simplify it to a Bernoulli distribution (512) where p = E(Ytlyt1, Yt2 ,.... Yitp, bi, Sy 1) = g Air bi p S it, kYitk k k= 1 116 wh M j=1 x p(Dily)p(bilSy = 1, ~72) Li(YiY{_i, SU = 1, bi, al), 0) H G'c) (I g)(1 Since logit P( S =1 D), AOk A1D,,we have, P(S = )= P(S, +...+ S = D,) P(S, + ...+ S,_ = D,) eAD O (0e) 1 e= i(eAj eA/jl) (513) 1 eAo + A1D,) + e1 + A1D,) Now, as mentioned earlier, D, is the dropout time for the ith subject. Also, there are T unique dropout times. Let, for t = 1,2,..., T 1 if the ith subject drops out between the (t l)th and tth observation times 0 otherwise. Thus 1b, = ('il, i,' ., ... iT) = (0, 0, ..., 0) would imply that the ith subject is a complete. So, D, = t <> = 1 and D, = T+ 1 => (',i, ',, ..., ,iT) = (0, 0, ... 0). Let pt denote the probability of dropping out between times t 1 and t, t = 1, 2,... T. So, for the ith subject, the density of D, would be Multinomial i.e P(D = d) = ... ( ... r)1 d, = 1, 2,..., T+ 1 (514) 5.2.4 Specification of Priors We assume that the number of latent classes m follows a truncated Poisson distribution with rate parameter j, truncated at an integer between 1 and T (the number of unique dropout times) i.e p(m) oc m =0,1,...,s where 1 < s < R For the other parameters, we assume the following priors 1. Let 0 ~ Nq(30, Z/o) assuming that Vi = 1,2,..., N and t = 1,2,... T, Xit is q dimensional. 2. Let all), a(2) ..., a(m) "d Nr((ao, Zao). where r < q since Zt C Xit Vi = 1,2,..., N and t= 1, 2, ..., T. 3. Let o1, ,..., ld U(a, b) where 0 < a < b < oo. 117 4. A ~ Nm(Ao, Zo) 5. (Oi1, 2, ...., (7) ~ Dirichlet(71 r2, ..., rT) 6. 6, ..., 6 ii"d Nr(60, ) for the same reasons as in (3). 7. For the time being we keep the prior of 4, 7r(4) unspecified. Now, combining (510 514) and the priors specified above, we can write down the full posterior distribution of m and w, 7r(w, mlY, X, D) upto a constant. Thus, we can get the full conditional distribution of all the relevant parameters and proceed with sample generation using MCMC. The assumption of conditional independence between Y, and Di given 5, and the covariates can be verified by performing a likelihood ratio test (Frequentist) or using Bayes factors (Bayesian). The null model is given by (56) and the alternative model may be written as m p g{E(Yt Yk, k < t, b,, S,, Di)} = At + SUyZ + 7itkYtk + b f(D,) (515) j=1 k=1 where f(Di) maybe a smooth but unspecified function of Di. Thus, the null hypothesis of conditional independence (between Y, and Di given 5, and Xi) would be simply f(Di) = 0. The test can be carried out by first fitting the null model (??). Then, the posterior probability of class membership for each subject can be estimated by f Li(YilY{i S, = 1, bi, & ~)p(S = lDi; )p(Dil,)dF(bS,, 2) P(5, = IDi, Yi, Xi, al) = Li(Di, Yi, CV) Li(Di, Yi, w) where w is obtained by performing a full Bayesian analysis on the full conditionals of w. The Likelihood Ratio test (LRT) is then performed by fitting models (??) and (514) using a weighted likelihood (the weights being the above posterior probability of class membership). An alternative way of doing the above conditional independence tests would be to use score tests based on smoothing splines as used in proportional hazards models by Lin et al. (2006). 118 The model proposed in (514) has the most general form. We can simplify it by assuming a linear effect of dropout time in which case the alternative (simpler) model would be m p J g{E(Yt Yk, k < t b,,S,, Di)} = Ai + SyZ' + it,kYitk + b,+ hj(Di)j (516) j=l k=1 j=l where each h(... ) is a known function and the b's are parameters. The null hypotheses would be Ho : i = ... = J = 0. The linear dropout effect would imply J = 1 and h(Di) = Di. The LRT can then be performed as before by fitting models (56) and (516) using the same weights given above. We can also use Bayes factors for carrying out these analysis. Note I : The above methodology was based on the fundamental assumption that the dropout time is discrete in nature; we assumed that there are T possible dropout times and then modeled the dropout distribution as a multinomial. A possible and interesting extension of the above methodology would be to assume that the dropout distribution is continuous in nature i.e the ith individual can dropout at any time point within an interval. In that case, we donot have to introduce latent classes to summarize the dropout times. We believe that the proposed methodology can be modified to accommodate this situation. Note II : As mentioned before, Heagerty (1999) proposed Marginally Specified Logistic Normal models for longitudinal binary data. He proposed two models : the first one was a marginal logistic regression model which links the average response to the covariates by the following equation : logitE(YX,) = X' (517) Here Yy and Xy respectively denote the binary response and the exogenous covariate vector recorded at time j for the ith subject, i = 1, 2,..., N;j = 1, 2,..., ni. The second model is a conditional model which explains the withinsubject dependence among 119 the longitudinal measurements. This is achieved by conditioning on a vector of latent variable (or random effects) bi such that logitE(Ybi, Xi) = A+ b (518) An important assumption that is made is that conditional on bi = (bil, bi2, ..., bi,,), the components of Y, are independent. Finally, it is assumed that (bilXi) ~ N(0, o,) where o, models the dependence among the big's (and thus, indirectly among the Y,'s) and can be obtained as a function of the observation times t, = (ti, ti2 ..., tin) and a parameter vector a. Heagerty (1999) referred to the models given in (517) and (518) as the marginally specified logistic normal models. Under the above modelling framework, the parameter A, can be expressed as a function of both the marginal linear predictor ,y = X3 0 and o, the standard deviation of b,. Writing b, as oz where z ~ N(O, 1), A, can be obtained as the solution to the following convolution equation : h(qU) h(A, +dz)O(z)dz (519) where h(.) is the inverse of the logit link and 0(.) is the standard normal density function. Given (Ty1, o), the above equation can be solved for A, using numerical integration and NewtonRaphson iteration. Ay, thus obtained from (519) will be a function of the marginal mean parameters 3 and the random effects covariance parameters a and should be computed for both the maximum likelihood and estimating equation methodology (Heagerty, 1999). For maximum likelihood estimation, the contribution of the ith subject to the observed data likelihood is ascertained by first assuming a linear transformation of the form bi = Cizi where Ci is a ni x q matrix and zi ~ Nq(0, Iqxq). The above transformation effectively links up bi to a lower dimensional random effect zi. The contribution of the ith subject (to the observed data likelihood) can now be expressed as a mixture over the random 120 effects distribution as Li(,a) = ... P(Y, =y, bi,,Xi)f(biX,)dbi j=1 = .../ h(A+ Cyzi)YU{1 h(A+ Cyzi)} lyq(zi)dz (520) j=1 q where Oq(zi) = O (zik). Since L,(#, a) cannot be evaluated analytically, numerical k=l procedures are required to find its value. Heagerty (2002) used GaussHermite Quadrature to perform the calculation but assumed q = 1. With increasing values of q, the computational burden increases exponentially and is not feasible at all. We are currently trying to develop alternative and less computationally intensive methodologies to accomplish the above objectives. We are working with Multivariate Logistic and Multivariate t distributions against a Bayesian framework as in O'brien and Dunson (2004). We hope that this methodology will provide a better alternative to the arduous numerical methods mentioned below. APPENDIX A PROOF OF BAYESIAN EQUIVALENCE RESULTS Proof of Theorem 1. Let Ydj (d = 0, 1;j = 1,..., J) be independently distributed as Poisson(Adj) where logAd = log/ + dlog9 + logj + d4' / Z(t)(t)dt (A1) Thus, the likelihood will be 1 J L(ji, O,i6, f fA= i ( )exp(Ad,) d and hence the log likelihood will be 1 J 1(p,0,, 6)= {ydjlog(Ad) Ad} d= Oj 1 Now, replacing the expression of logAdj from (A1) we have 1(p, ) = yyj (log+( dlog,' ogJ dq3' Zj(t)W(t)dt d=0j=1l c 1 J 0 ddjexpp (d' Zj(t)W (t)dt) (A2) d=Oj=1 c Differentiating (A2) w.r.t p and 0 and solving the resulting equations we have = EYyoj/CE and 0= J >yoj 5jexp (q' Zj(t)(J(t)dt) J J Replacing the above expressions in (A2) and then exponentiating, we obtain the expression of L(6, 4) in (28). Again, differentiating (A2) w.r.t 6j, we have J = d j 1...J (A3) 5 Odexp d Zj(t)xW(t)dt) d Jc It is easy to show that if we replace (A3) in (A2) and then exponentiate, we get the expression for L(O, 4) in (29). Since the order of maximization is immaterial, it follows that, L(6, 4) and L(O, 4), once maximized over the nuisance parameters (0 and 6 122 respectively) yield the same profile likelihood of 0. Thus, inferences about the parameter of interest 4 can be obtained using the prospective likelihood which has fewer nuisance parameters than the retrospective one. Proof of Theorem 2.(i) The posterior density of (0, 6, 4) is (A4) J 1 J p(0'6, 6,y) o p(4)fj 6 i1 If (AdJ)~exp(A,) j 1 d Oj 1 Replacing the expression of Adj from (210), we have p(O, 6,41y) oc x P() {Oexp Zjy(t)M (t)dt)} y+a1 j1 exp ([I +exp (1 Z' Z,(t)(t))dt)) 6 Integrating out 6, from the above expression, we have p(0, y) oc o ) F(y+j + aj) exp Z Wt) j1 1 + exp(0/ Z (t)x(t) dt) d j Now, performing the transformation from 0 to w yields expression (211). J (ii) First, we perform the transformation from 6 to (0, b), where = yj. Thus, j=1 6j = Ojy, j = 1,...,J. The jacobian of transformation will be J1. Using this transformation in (A4) and after some manipulation, we have J p(iO, 0, 4,\ly) oc p(a)Y+++1yI'+1I ojY a'1exp ('YJ ZJ(t)W(t)dt j 1 c x exp[ (A5) 123 j= 1 ,'exp Z (t) (t)dt) \ c / Now, integrating (A5) w.r.t 0 we obtain J NJ p(e, O, rly) ox p() J y, j 1 c x exp (' yi Zt(t)(dt 0 Y (A6) j 1 J c j= 1 Integration of (A6) w.r.t b yields (212) after some minor manipulation. (iii) The order in which p(O, 6, 0y) is integrated w.r.t the parameters does not make any difference in the marginal posterior density of p(0). Thus, integration of p(w, 01y) w.r.t w or p(O, 01y) w.r.t 0 will yield the same marginal posterior density p(0y) of 0. Remarks : 1. As in Seaman and Richardson (2004), the assumption of existence and finiteness of E (04' J Zq(t)W(t)dt and E 4' Z,(t)V(t)dt is automatically satisfied provided the prior density p(O) ensures that E(O) exists and is finite. 2. The posterior propriety of p(O, 6, 0 y) in (A 10) can be shown in a similar way to that in Seaman and Richardson (2001). 3. The prior distribution p(O) of 0 induces a prior distribution on the "influence function" {1(t), c < t < 0} in the logistic casecontrol model in (2 3) since 7(t) = O'(t), c < t < 0. Proof of Theorem 3. Let D denotes the disease status with r + 1 categories. As before, let {X(t), c < t < 0} be the exposure trajectory with support S = {Z(t), ..., Zj(t), c < t < 0}, the set of all exposure trajectories. Let P(D = dlX(t) = Zk(t), c < t < 0) = Pdk, (d = 0,1, ..., r; k = 1,..., K) and P(X(t) = Zk(t), c < t < 0D = 0) = k/ 11. Let ndk be the number of individuals with D = d and X(t) = Zk(t), c < t < 0}. It can be shown that 6kPdk/POk P(X(t) = Zk(t), c < t < OD = d) = k pk S1PdI/Po 1=1 The prospective likelihood will be given by Lp likelihood will be r K I n Pdk while the retrospective d=Ok=1 Sn, K K no r K LR 6k 1/11 H 1 6_kPdk POk k=l /=1 d=l k=l /= 1 Let Pdk/POk rewritten as K idldk. Assuming Y Td/ 1, we have Od K Sd/. Thus, Lp can be = 1 r since 1 = Pdk d=O rK K L, = fn (Pdk POknd f(POk)=O n d lk 1 k 1 = n f)(l Odldk)n1d (+ oedritdk') d=1k=1 k=l \ d=l Pok (1 + Y dildk LR can also be written as d=1 \nd1 K K no r K LR (kE n/ 1kldk k= 1 /=1 d=1 k=l 1 ~ T r since Pdk/POk = drldk. The augmented model is given by Zdk Adk poisson(Adk) where log(Adk) = Ig(d) + Iog(ldk)+ log(1 k), log(AOk) = log(6k), d = 1 ..., r; k = 1 ..., K. The prior distribution on the parameters is assumed to be O( r \J ( K \ d1 k1 (A7) 125 n7 The likelihood for the augmented model will be LA =exp( Adk hAdk/ndk1) d=O k=l d=O k=l K r r K cx exp ( 6k 1(i +zd)dk HH( ,'/:5). ndk k=1 dl d=0 k=1 exp k d (+ dldk (6k)E0 d ndk fJ(od)1 kndk dk)n Sk=1 d=1 k=l d=l d=l k=l The posterior based on the augmented likelihood will be n) r K ( ') , 6, Idn) N LA k = d 1 7r ) (d1 kl (A8) r \ Noting that /exp (k ((1 Odldk) (k)2= Oc Sd ) we have, by integrating out 6 in (A8), ld6k OC ( r K K r Yd0 do 7(7, dn) x n (Oddk) ndk i ddk dN wk1 k=1 d= 1 S(rd=l 1 Now, integrating out (1, ..., r,) from (A8), we have r 0o n  Z rdrldk d= 1 (A9) ( K K r K r '1 n, T(7q,6 n) oc exp Z j k H(k)EO nd1 H( )(dk) ndk f N 6kd Sk=1 k=l d=lk=l d=l k=l Next, we make the transformation 6k = Ok and o the prior distribution in (A7) becomes K 6i having jacobian '1. Hence = 1 126 (A1 0) ( r ( K d=1 k=1 Using the above transformation, (A10) can be rewritten as d= Ok=1 d= lk=1 r K k= ndkK z [exp(_ )(5)Z H ) 'o1 ( i (]fio" K no( 1) ( Hifio )k Sr K r kd 7Onk 1 \x nd d=1 k=) I k=1 d=1k= r K k n ndkK r ld=1 \k=l \1 d LR ( 1 71(k ) (A13) Integrating out (/ from (A11), we have From (A9) and (A12), it is clear that posterior inference for the parameter of interest, ir remains the same under either the prospective likelihood L, or the retrospective likelihood LR as long as the posterior is proper. It can be shown that the posterior will be proper for any proper prior for 1n if nok > 1 V k = 1,..., K. 127 APPENDIX B PROOF OF POSTERIOR PROPRIETY FOR THE SMALL AREA MODELS B.1 Univariate Small Area Model The proof of posterior propriety for the basic univariate semiparametric model (Model I) is outlined below. The necessary changes to the proof for the random walk model are mentioned at the end. Proof of Theorem : The basic parameter space is Q = (0, 0, 7, b, o, o {,...,) where 0 = (0'i,..., 0')' and b = (bl,..., b,)'. Let S= ... / p(Y, X, Z)dQ = ... {L(Yi j i)L(0i j,7, bi, d Xi, Zi)L(bi b )I L(, i7 (0)7(07b) ( 7) 7 ( j)df i=l j=1 (B1) We have to show that / < M where M is any finite positive constant. Integrating first w.r.t 3, we have / = w(/3) [L(O /3, b ,2, Xi, Zi)d/ = exp[ (Oi Xi/3 Z  bil)'W (Oi Xi3 Zi7 bil)] d i = X we Xi exp W'Wi + (B2) where Q = wlx ) xl 1X ) X(Z Xl ~ W Wi = 0 Z,7 b1l and V1 = diag(b2, .b2 .. 2). Now, W 1'Wi, = W '1/21/2Wi = S'S, where Si = X1/2W,. Similarly, W/ 1Xi = S'Ti, X'VJWi = TtSi and X'V Xi = T'T, where T, = X1/2X,. 128 On replacing these, the expression in the exponent of (B2) becomes S S, S T;) T T;) ( T'iSi) 2 I i i i A 2 [S'S S'ST(T'T)T'S] 1 = S'[I T(T'T)T']S= Q, say where S = (S',..., S')' and T = (T',..., T)'. Since (I T(T'T)IT') is idempotent, S' [I T(T'T)IT']S is nonnegative, implying Q < 0 and thus exp(Q) < 1. Next, we consider integration w.r.t b2 i.e x/ = / 1 I/2 1 ( :)"n/2c exp(dj/)d d ... b j=1 t "'. I Y X.' ,2X' 11/2  (,b2)m/2cj lexp(dJ/ 2)d 2 ..."d 2 ij j=1 (B3) Assuming ,,x = max( ,..., t), we have, Vj 1,..., t 2 > mx X .,'2X, > X Y,' X X 2X> x i J XyX' and thus (B4) I. X X, 112 _< )(p+l)P /2 Xj X 11/2 ,J iJ Combining (B3) and (B4), we have Assuming 2x, S< '7 for some k e [1,..., t], we have, X 11/ ... [ (,') P ck exp( )d'7] 2.. .d , Sck) t F(m/2 c 2) S/ 2 = W, sa(B5) j=1,jk jk 129 x () )3rc~+lexp d)d j=ljk k S12 ((m p 5)/2 I(mp5)/2+ck 'J "/< / < i xx 11/2 ... ( ax)( 2 ( )m/2cj+exp(dj/ )d ...d ; ; j ... j W 1 where W is finite if (m p 5)/2 + ck > 0, dk > 0, m/2+ cj 2 > 0 and dj > 0 for j = 1, ..., t;j / k. Combining (B1) and (B5), we have / < W I ... Jf {L{(Y, ,)L(ba) } L(i) ~) ()di2* (B6) where f* = (0 3 b). Since all the components of the integrand in (B5) have proper distributions, the above integral would be finite thus proving posterior propriety. For the random walk model, the integrand in (B1) will have an additional likelihood term nli L(vI vjv_, oi) and a prior term 7(ao2). The derivation would then proceed exactly as above and the integrand in (B5) will also contain these additional terms. But since both of these are proper distributions (normal and inverse gamma respectively), I will still be finite under the conditions stated in the theorem. B.2 Bivariate Small Area Model The proof of posterior propriety for the bivariate semiparametric model is outlined below. Proof of Theorem : Here, the parameter space is Q* = (0, 0, 7, b, Zo, 7, {( ,.... }). Here also, due to the same logic as in the univariate case, we just need to show I p()p(0 y, b, { l,..., J})d3 < oo / ( ,, ,/ ,6 or, J exp( (, X.3 Z71 b),) (0 X.3 Z>7 bi) df < oo (B7) in order to prove posterior propriety. Using the same type of algebraic manipulations as in the univariate case, the L.H.S of (B7) can be shown to be  X X WX/ exp W./WW (B8) 2J /J 130 where Q = ( WId'X X ( XyWX. 'x ( XyJi VW, and W, = O, Z  ij ij j bi. As before, the expression within the exponent in (B8) can be rewritten as K* = \ ( STS( U) (5 TTU) (5T/Su) iJ j j i,j S S' [I T(T'T)T']S < 0. 2 where S = (S' ..., S')', T = (T ,..., T')', S, = V1/2W, and T, = V/2X,. Thus, exp W.' W 1 + < 1 (B9) /,J So, in order to prove posterior propriety, we have to show ./ /. t (m d r 1) [ V trace J /)1l / = ... I Xolx. 1/2 1 f' 1 2 exp trace 2 d 1...d 1 ij j=j 1 < oo (B10) Here r is the order of j,j = 1, 2,..., t. (r = 2 in our case). Let A1, Aj2,..., Ajr be the distinct eigen values of .l,j = 1, 2,..., t. Since ,j is a variance covariance matrix, it is positive definite and symmetric. Hence, W1 also has the same properties. Thus, Ajk > 0, Vk = 1, 2,... r. Now, Vj = 1,2, ..., r, 1JX 1 > Aj YIr where Aj'" = min(A, A2..., Ar). y XuW JfX' > Y A7in X isrXje yx,,J u l'X. > Am/in5X .X where Am1n = min(A'. SX,,,'X Amin Y XX is nonnegative definite. ij ij Thus, we have  XV Y 1X' I > I A min XyUX  = Z> X' , < I(AmnZ xJ x I =11. x, xx/ 11/2 < min pxq2 ^ ^V<(A) 2 YLXUXW Since I W 1 = J Ajk, V 1... t k=1 (m+d,r1) (m+d,r1) 1 1 2 I (A k) 2 k=1 Now, replacing (B11) and (B12) S< xx  .('n) /  XUX'y  .. (A in where T denotes "trace". Let Am"n Then, I < /1 x 2I where in the expression of I in (B10), we have p+q+2 t r (mdjr1) V1 J1 2 H (,Ajk) 2 exp T 2( d ...d2 j1( k=1 (B13) = Aim, / [1 ..., t]; m [1 ..., r]. i f (m +dr1) (m+dppq2)r l1 = I> XyX 2 (Ak) 2 (AIn) 2 i, {k= 1,k m} Sp q 2 (m+d pq2)r1 =  XyX'  n (A/,k) 2 1 1 2 ij {k=l,k m} and 1exp T(V) 2 ) d exp T (V2)] d1 t m .n dj r1 /2 { 2 Td md(VJ Fr1d..} 2r 2d 2 2 f= 1J7i} which is finite. Thus, in order to show posterior propriety, we have to prove that /2 < oo. 132 (B11) (B12) (B14) Let us consider the integral exp T ( 2 d1 (B15) p q 2 (m+d/pq2)r1 /*= = (A k) 2 l1 2 {k=1,k m} By the AMGM inequality, we have, k=km {kZl, k4m} Ak < (r A/1,km} < (r 1 r 1 l {k=l,kjm} )r1 r l 1 {k =,k4m} ( k=k) S{k l,k4m} r r k A/k < km Ik {k=1,k m} k=1 p+q+2 2 < r1 r (r1)(p+q+2) k=1 (r1)(p+q+2) trace(2 tr race(1 1) (r rl 1 r 1 1 k1 k=l where 'm denotes the kth diagonal element of I 1. Since v 1 has a Wishart distribution, ,'m o kkX< ,, (k b g 1 a 1 w Combining (B1 5) and (B1 6), we have, 1 r k= 1 (r1)(p+q+2) ,() 2 /lii~l (r1)(p+q+2) 2 (B16) 1,..., r) implying that (m+d pq2)r 1  71  2 exp (r1)(p+ q2) 1 ) 2 S l v k=l 1 (r1)(p+q+2) 2 (m+dipq2)r1 Il 1 1 2 exp T (V/, ) dW r (r1)(p q+ 2) C E (l) 2 k=l where C = ( G 1 1 (r1)(p+q+2) 2 133 => ({k Since V k 1)(p+q+2) 2 1, ..., r, A/k > 0, S< G r( T( 2 d 2 the expectation being taken over the Wishart pdf. 2 S (r1)(p q 2) k=1 (r1)(p+q+2) bk) 2 (r < ((r q2) r q+2) ) q+2) r S ,M (r1)(p+q+2) k=1 (E ( () (r1)(p q 2) k= k 2 ( 17) k  which is finite because (r1)(p+q+2) ! k ) 2 < 00 V kr (r(p+q2) k1 SE(r1)(p q 2) < 00 (k= (r1)(p+ q2) S2) Thus /* is finite implying posterior propriety. Now, '< 00 k = 1, ..., r < 00 < 00 (B18) r 4 E k=1 r > E ( k=1 APPENDIX C FULL CONDITIONAL DISTRIBUTIONS C.1 Semiparametric Case Control Model The full conditional distribution of the parameters for the semiparametric case control model are as follows : 1. [/pa, A, b, A,, a,, Y, D, a] ~ N(M3, V) where /j + N n N 1 V ( YZZ Pp,(a,)cp, (a +.)' I AIM 'M:) , e l j1 =i=1 N n, N M3 = Vp ( Y ,,(au)(yd q(aP)b,) + AMc(Zi a bQ), Je j=1 i= 1 and Zp is the p + K + 1 order prior variancecovariance matrix of 3. 2. [Zia, 4, bi, A,, Di] ~ N(a + O'MI + b'Qi, A, ) truncated at the left (right) by 0 if Di = 1(Di = 0), i = 1,..., N. 3. [bil, a, ,, A, ao, ao, Y, D, a] N(Mb, b)( 1,..., N) where v = (zb + a (q,(a,).q,(a,)'+ AQ,/')Q , e =1 Mb = vb V q,,(a,)(y. _.(a)') + Ai (Z a 'Mi) , and Zb is the q + M + 1 order variancecovariance matrix of b. 4. [0* b, A, ao, Y,D,a] ~ N(Ma,, V,) where* = (a, 4)', / N 1 V. = + (1, 0'M + b ,Q,)A,(1, 0'M, + b/Q,) , i 1 N Mo* = A i (1, 'Mi + yQj)/Zi and Z,* is the r + K* + 2 order variancecovariance matrix of (a, 0). 5. [Ail, a, b,Y, D,a] ~ G( +, v + (Z, a 'M bQ ) where 6. [(r)11, a, b, Y, D, a] G( 1, b) j= 0 q. (i 12 135 7. [(,7a)11,, ,b,Y,D,a] G nI i= 1 q,,(ay)'bi) . N ni 1, (yui p,,(aij)' i= 1 j=1 8. [(,72)110, a, (K + 1 K ) 8. [(V ) 4, b, Y, D, a] G 2+1, Yr * k=1 9_ 1M N q+M 2 9. [() bY, D,a]G +1,+ Y b . C.2 Semiparametric Small Area Models i=1 j= q1 K' 10. [(r )ll, ,0, bY, D,a] G +1,2Y k=1 Here, G(x, y) denotes a Gamma density with shape parameter x and rate parameter y respectively. C.2 Semiparametric Small Area Models C.2.1 Semiparametric Univariate Small Area Model The full conditional distributions of the parameters for the univariate semiparametric small area model are as follows : 1. [O, 3 2,7 2,b, X, Z] ~ N(Mb, V) where ( 1 1a 1 1 Y (X  Z  hi b ) V = + and Mo + 6 6 2. [bi /3, b, N(, 2a, X, Z] N(Mb, Vb) where + = and M = (0 X'.. Z' .) . b j= 1 j= 1 j= 1 3. [37, 0, b, b2, X, Z] ~ N(M3, V3) where V,3= ( and M,3= ( (m X.)) i 1= 1 i=i 1 = 1 i= 1 j= 1 4. [y/, b, b,2, 2 X, Z] ~ N(M7, V.) where V= +'/) and M. = ( ,Z 11 d). 136 5. [(a,) G7] ~ G c 7 /7+ d 6. [0(@ ), ,y ,b, X, Z] G cji (2 (Ou X Z' b) ) i 1 7. [()b] G c, Y d+ i=1 Here G(a, b) denotes a gamma distribution with shape = a and rate = b. C.2.2 Univariate Random Walk Model The full conditional distribution of the parameters for the semiparametric random walk model will follow similarly as above. In this case, v and ao will have normal and inverse gamma full conditionals respectively while the full conditionals of the other parameters will depend on v. C.2.3 Bivariate Random Walk Model The full conditional distribution of the parameters for the bivariate randomwalk model are as follows : 1. [0o ,b,X, Z] ~ N(M,, Vj) (i = l... m,j = 1,..., t) where V = ( u I,1) and M = (,1 + 1 (Y + J(X + Z(X 0 + bi + v,)). 2. [bi, 3,y, 0, '1, ao, X, Z] ~ N(M Vib) where V=b _(_ 1 0) and J )1 J J 3. [/3P 0, b, 1, X, Z] N(M3, V3) where V3( = XIY;) and 1 4, 8, b, X,) 4. [7y3, 0, b, 11, E7, X, Z] ~ N(M V1) where Z7 bi v). 137 My (Zzu 6zs7Z" MVY = 6Z> Zy U and Zu q 1(06  E zi/ '; X'/ b v,) 5. [vl, /3,, 0, b, I, Ev, v2, X, Z] ~ N(MA ZE) where 1 and Mv m\q!1 M~ V ') 6. [vjlP,7, 0, b, Zv, vj_, vj+, X,Z] ~ N(M,Z) (j = 2,... t 1) where  = (m l  M = (m1j 2) and 2 Z V 7. [vt /3,7, I, b, v, Vt, X, Z] ~ N(Mtv, Z) where v m\q1 t t Mv m\q!1 t t an1 Zv1) and 1) (q i tt A = S,+ (0, X' Z bi 9. [ZEv] ~/W(S 10. [Zo b]~ /W so 11. [ZE,7]~ /W(S,. v)(0e, x,  1,..., t) where Z b ,v)' assuming vo  bib', do 77', d. + 1) 138 "(q jOil  X Zi,"7 bi) + EvV . q( (06 X'/ Z'7 bi) + l(v,+ + v,)). X/ Zt, bi) + 8. [\ 7, b, V, vt_, X, Z] ~ /W(Aj, d +m) (j (v v1)(v viy1)', dv + t) I REFERENCES Agresti, A. (2002). Categorical data analysis. Wiley. Albert, J. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669679. Althman, P. (1971). The analysis of matched proportions. Biometrika 58, 561576. Ashby, D., Hutton, J., and McGee, M. (1993). Simple Bayesian analyses for casecontrolled studies in cancer epidemiology. Statistician 42, 385389. Battese, G., Harter, R., and Fuller, W. (1988). An error component model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association 83, 2836. Bell, W. (1999). Accounting for uncertainty about variances in small area estimation. Bulletin of the International Statistical Institute . Botts, C. and Daniels, M. (2008). A fexible approach to Bayesian multiple curve fitting. Computational Statistics and Data Analysis 52, 51005120. Bradlow, E. and Zaslavsky, A. (1997). Case influence analysis in Bayesian inference. Journal of Computational and Graphical Statistics 6, 314331. Breslow, E. T. and Day, N. E. (1980). Statistical Methods in Cancer Research, Volume 1. International Agency for Research on Cancer, Lyon. Breslow, E. T, Day, N. E., Halvorsen, K. T, Prentice, R. L., and Sabai, C. (1978). Estimation of multiple relative risk functions in matched casecontrol studies. Ameri can Journal of Epidemiology 108, 299307. Breslow, N. (1996). Statistics in epidemiology : The casecontrol study. Journal of the American Statistical Association 91, 1428. Carroll, R. J., Wang, S., and Wang, C. Y. (1995). Prospective analysis of logistic case control studies. Journal of the American Statistical Association 90, 157169. Catalona, W., Partin, A., Slawin, K., and Brawer, M. (1998). Use of the percentage of free prostatespecific antigen to enhance differentiation of prostate cancer from benign prostatic disease : A prospective multicenter clinical trial. Journal of the American Medical Association 19, 15421547. Cornfield, J. (1951). A method of estimating comparative rates from clinical data: applications to cancer of the lung, breast, and cervix. Journal of the National Cancer Institute 11, 12691275. Cornfield, J., Gordon, T, and Smith, W. W. (1961). Quantal response curves for experimentally uncontrolled variables. Bulletin of the International Statistical Institute 38, 97115. 139 Datta, G., Ghosh, M., Nangia, N., and Natarajan, K. (1993). Estimation of median income of fourperson families : A Bayesian approach, in W.A. Berry, K.M. Chaloner and J.K. Geweke (Eds),. Bayesian Analysis in Statistics and Econometrics pages 129140. Denison, D., Mallick, B., and Smith, A. (1998). Automatic Bayesian curve fitting. Journal of the Royal Statistical Society, Series B 60, 333350. Diggle, P., Heagerty, P., Liang, K., and Zeger, S. (2002). The analysis of longitudinal data, 2nd Edition. New York : Oxford University Press. Diggle, P., Morris, S., and Wakefield, J. (2000). Point source modeling using matched casecontrol data. Biostatistics 1, 89109. DiMatteo, I., Genovese, C., and Kass, R. (2001). Bayesian curve fitting with free knot splines. Biometrika 88, 10551071. Durban, M., Harezlak, J., Wand, M., and Carroll, R. (2004). Simple fitting of subject specific curves for longitudinal data. Statistics in Medicine 00, 124. Eilers, P. and Marx, B. (1996). Flexible smoothing with Bsplines and penalties. Statisti cal Science 11, 89121. Ericksen, E. and Kadane, J. (1985). Estimating the population in census year : 1980 and beyond (with discussion). Journal of the American Statistical Association 80, 98131. Escobar, M. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90, 577 588. Etzioni, R., Pepe, M., Longton, G., Hu, C., and Goodman, G. (1999). Incorporating the time dimension in receiver operating characteristic curves : A case study of prostate cancer. Medical Decision Making 19, 242251. Eubank, R. (1988). Spline smoothing and nonparametric regression. New York : Marcel Dekker. Eubank, R. (1999). Nonparametric regression and spline smoothing. New York : Marcel Dekker. Fan, J. and Gijbels, I. (1996). Local polynomial modeling and its applications. Chapman and Hall. Fay, R. (1987). Application of multivariate regression to small domain estimation, in R. Platek, J.N.K. Rao, C.E. Srndal, and M.P. Singh (Eds). SmallArea Statistics. Fay, R. and Herriot, R. (1979). Estimation of income from small places : an application of JamesStein procedures to census data. Journal of the American Statistical Association 74, 269277. 140 Fay, R., Nelson, C., and Litow, L. (1993). Estimation of median income of fourperson families by state, in Statistical Policy Working Paper 21, Indirect Estimators in Federal Programs. Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics 19, 1141. Gelfand, A. and Ghosh, S. (1998). Model choice : A minimum posterior predictive loss approach. Biometrika 85, 111. Gelfand, A. and Smith, A. (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association 85, 398409. Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statistical Science 7, 457511. Ghosh, M. and Chen, M.H. (2002). Bayesian inference for matched case control studies. Sankhya, B 64, 107127. Ghosh, M., Nangia, N., and Kim, D. (1996). Estimation of median income of fourperson families : A Bayesian time series approach. Journal of the American Statistical Association 91, 14231431. Ghosh, M. and Rao, J. N. K. (1994). Small area estimation : An appraisal. Statistical Science 9, 5576. Godambe, V. P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika 63, 277284. Green, P. (1995). Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711732. Green, P. and Silverman, B. (1994). Nonparametric regression and generalized linear models : a roughness penalty approach. Chapman and Hall/CRC. Gustafson, P., Le, N., and Valle, M. (2002). A Bayesian approach to casecontrol studies with errors in covariables. Biostatistics 3, 229243. Hampel, F, Ronchetti, E., Rousseeuw, P., and Stahel, W. (1987). Robust statistics : The approach based on influence functions. Wiley. Hanson, T. and Johnson, W. (2000). Spatially adaptive penalties for spline fitting. Australian and New Zealand Journal of Statistics 2, 205224. Heagerty, P. (1999). Marginally specified logistic normal models for longitudinal binary data. Biometrics 55, 688698. Heagerty, P. (2002). Marginalized transition models and likelihood inference for longitudinal categorical data. Biometrics 58, 342351. Henderson, C. (1950). Estimation of genetic parameters (abstract). Annals of Mathe matical Statistics 21, 309310. Hogan, J. and Laird, N. (1998). Mixture models for the joint distribution of repeated measures and event times. Statistics in Medicine 16, 239257. Hogan, J., Roy, J., and Korkontzelou, C. (2004). Tutotial in biostatistics : Handling dropout in longitudinal studies. Statistics in Medicine 23, 14551497. Jiang, J. and Lahiri, P. (2006). Mixed model prediction and small area estimation. Test 15, 196. Johnson, V. (2004). A Bayesian X2 test for goodnessoffit. Annals of Statistics 32, 23612384. Lewis, M., Heinemann, L., MacRae, K., Bruppacher, R., and Spitzer, W. (1996). The increased risk of venomous thromboembolism and the use of third generation progestagens : Role of bias in observational research. Contraception 54, 513. Lin, J., Zhang, D., and Davidian, M. (2006). Smoothing spline based score tests for proportional hazards models. Biometrics 62, 803812. Lindstrom, M. (1999). Penalized estimation of freeknot splines. Journal of Computa tional and Graphical Statistics 8, 333352. Lipsitz, S., Parzen, M., and Ewell, M. (1998). Inference using conditional logistic regression with missing covariates. Biometrics 54, 295303. Little, R. and Rubin, D. (1987). Statistical Analysis with Missing Data. New York: Wiley & Sons. MacEachern, S. and Muller, P. (1998). Estimating mixtures of Dirichlet process models. Journal of Computational and Graphical Statistics 2, 223238. Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute 22, 719748. Marshall, R. (1988). Bayesian analysis of casecontrol studies. Statistics in Medicine 7, 1223 1230. Morris, C. (1983). Parametric empirical Bayes inference : theory and applications. Journal of the American Statistical Association 78, 4754. Muller, P., Parmigiani, G., Schildkraut, J., and Tardella, L. (1999). A Bayesian hierarchical approach for combining casecontrol and prospective studies. Biometrics 55, 858866. Muller, P. and Roeder, K. (1997). A Bayesian semiparametric model for casecontrol studies with errors in variables. Biometrika 84, 523537. 142 Nurminen, M. and Mutanen, P. (1987). Exact Bayesian analysis of two proportions. Scandinavian journal of Statistics 14, 6777. O'brien, S. and Dunson, D. (2004). Bayesian multivariate logistic regression. Biometrics 60, 739746. Opsomer, J., Claeskens, G., Ranalli, M., and Breidt, F. (2008). Nonparametric small area estimation using penalized spline regression. Journal of the Royal Statistical Society, Series B 70, 265286. Paik, M. and Sacco, R. (2000). Matched casecontrol data analyses with missing covariates. Applied Statistics 49, 145156. Park, E. and Kim, Y (2004). Analysis of longitudinal data in casecontrol studies. Biometrika 91, 321330. Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case control studies. Biometrika 66, 403411. Rao, J. N. K. (2003). Small Area Estimation. Wiley Inter Science, New York. Rathouz, P., Satten, G., and Carroll, R. (2002). Semiparametric inference in matched casecontrol studies with missing covariate data. Biometrika 89, 905916. Robinson, G. (1991). That BLUP is a good thing : the estimation of random effects. Statistical Science 6, 1531. Roeder, K., Carroll, R., and Lindsay, B. (1996). A semiparametric mixture approach to casecontrol studies with errors in covariables. Journal of the American Statistical Association 91, 722732. Roy, J. (2003). Modeling longitudinal data with nonignorable dropouts using a latent dropout class model. Statistics in Medicine 59, 829836. Roy, J. and Daniels, M. (2008). A general class of pattern mixture models for nonignorable dropouts with many possible dropout times. Biometrics 64, 538545. Rubin, D. (1981). The Bayesian bootstrap. The Annals of Statistics 9, 130134. Ruppert, D. (2002). Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics 11, 735757. Ruppert, D. and Carroll, R. (2000). Spatially adaptive penalties for spline fitting. Australian and New Zealand Journal of Statistics 2, 205224. Ruppert, D., Wand, M., and Carroll, R. (2003). Semiparametric Regression. Cambridge University Press, Cambridge, U.K. Satten, G. and Carroll, R. (2000). Conditional and unconditional categorical regression models with missing covariates. Biometrics 56, 384388. 143 Satten, G. and Kupper, L. (1993). Inferences about exposuredisease associations using probabilityofexposure information. Journal of the American Statistical Association 88, 200208. Schildcrout, J. and Heagerty, P. (2007). Marginalized models for moderate to long series of longitudnal binary response data. Biometrics 63, 322331. Seaman, S. R. and Richardson, S. (2001). Bayesian analysis of casecontrol studies with categorical covariates. Biometrika 88, 10731088. Seaman, S. R. and Richardson, S. (2004). Equivalence of prospective and retrospective models in the Bayesian analysis of casecontrol studies. Biometrika 91, 1525. Sinha, S., Mukherjee, B., and Ghosh, M. (2004). Bayesian semiparametric modeling for matched casecontrol studies with multiple disease states. Biometrics 60, 4149. Sinha, S., Mukherjee, B., Ghosh, M., Mallick, B., and Carroll, R. (2005). Semiparametric Bayesian analysis of matched casecontrol studies with missing exposure. Journal of the American Statistical Association 100, 591601. Stone, C., Hansen, M., Kooperberg, C., and Truong, Y. (1997). Polynomial splines and their tensor products in extended linear modeling. The Annals of Statistics 25, 13711470. Wahba, G. (1990). Spline models for observational data. CBMSNSF Regional Conference Series in Applied Mathematics. Wand, M. (2003). Smoothing and mixed models. Computational Statistics 18, 223249. Wand, M. and Jones, M. (1995). Kernel Smoothing. Chapman and Hall. Zelen, M. and Parker, R. (1986). Case control studies and Bayesian inference. Statistics in Medicine 5, 261 269. Zhang, D., Lin, X., and Sowers, M. (2007). Two stage functional mixed models for evaluating the effect of longitudinal covriate profiles on a scalar outcome. Biometrics 63, 351362. Zhou, S. and Shen, X. (2001). Spatially adaptive regression splines and accurate knot selection schemes. Journal of the American Statistical Association 96, 247259. BIOGRAPHICAL SKETCH Dhiman Bhadra received his Bachelor of Science in statistics from Presidency College, Calcutta (India) in 2002 and Master of Science in statistics from Calcutta University in 2004. He joined the Department of statistics at University of Florida in January 2005 for pursuing a PhD in statistics. He plans to graduate in August 2010. 145 PAGE 2 2 PAGE 3 3 PAGE 4 IhadthegoodfortunetobeastudentattheDepartmentofStatisticsatUniversityofFlorida.ItisherethatIcameinclosecontactwithsomeofthepreeminentstatisticiansofthedayandlearntalotfromthem.Ideeplyacknowledgethetremendoushelp,encouragementandendlesssupportthatIreceivedfrommyadvisorProf.MalayGhosh,mycoadvisorProf.MichaelJ.DanielsandProf.AlanAgrestithroughoutthehighsandlowsofdoingmyresearchwork.Theynotonlytaughtmestatisticsortheartofwritingpapersorsolvingproblemstheyintroducedmetothespiritofdiscoveryandthejoyoflearning,somethingthatwillstaywithmeforeverandwouldmotivatemeinwaysIcanneverimagine.Howeverthelistdoesn'tendheresinceeachandeverymemberofthefacultyopenedupnewdoorsformethroughwhichknowledgeowedpastandenrichedmealongtheway.Myendlessgratitudetoeachandeveryoneofthem.IalsowishtothankProf.BhramarMukherjee(currentlyattheDepartmentofBiostatisticsatUniversityofMichigan)forherhelpandinspirationovertheyears.Lastbutnottheleast,myunendinggratitudetomymotherwhosesacrice,unconditionalloveandblessingwasalwayswithme,guidingmealongtheway.Iwouldendbyconveyingmydeepestrespecttothememoryofmyfatherhewastherewithmealwaysthroughoutthisjourney. 4 PAGE 5 page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 13 1.1OverviewofDissertation ............................ 13 1.2ReviewofCaseControlStudies ....................... 14 1.3ReviewofSmallAreaEstimation ....................... 21 1.4NonParametricRegressionMethodology .................. 25 2BAYESIANSEMIPARAMETRICANALYSISOFCASECONTROLSTUDIESWITHTIMEVARYINGEXPOSURES ....................... 31 2.1Introduction ................................... 31 2.1.1Setting .................................. 32 2.1.2MotivatingDataset:ProstateCancerStudy ............. 34 2.2ModelSpecication .............................. 35 2.2.1Notation ................................. 35 2.2.2ModelFramework ............................ 35 2.3PosteriorInference ............................... 40 2.3.1LikelihoodFunction ........................... 40 2.3.2Priors .................................. 40 2.3.3PosteriorComputation ......................... 41 2.4BayesianEquivalence ............................. 42 2.5ModelComparisonandAssessment ..................... 46 2.5.1PosteriorPredictiveLoss ........................ 46 2.5.2Kappastatistic ............................. 47 2.5.3CaseInuenceAnalysis ........................ 48 2.6AnalysisofPSAData ............................. 49 2.6.1ConstantInuenceModel ....................... 50 2.6.2LinearInuenceModel ......................... 51 2.6.3OverallModelComparison ....................... 53 2.6.4ModelAssessment ........................... 54 2.7ConclusionandDiscussion .......................... 55 5 PAGE 6 ................... 59 3.1Introduction ................................... 59 3.1.1SAIPEProgramandRelatedMethodology .............. 59 3.1.2RelatedResearch ........................... 61 3.1.3MotivationandOverview ........................ 62 3.2ModelSpecication .............................. 65 3.2.1GeneralNotation ............................ 65 3.2.2SemiparametricIncomeTrajectoryModels .............. 66 3.2.2.1ModelI:BasicSemiparametricModel(SPM) ....... 66 3.2.2.2ModelII:SemiparametricRandomWalkModel(SPRWM) 67 3.3HierarchicalBayesianInference ........................ 68 3.3.1LikelihoodFunction ........................... 68 3.3.2PriorSpecication ........................... 68 3.3.3PosteriorDistributionandInference .................. 69 3.4DataAnalysis .................................. 70 3.4.1ComparisonMeasuresandKnotSpecication ............ 71 3.4.2ComputationalDetails ......................... 72 3.4.3AnalyticalResults ............................ 73 3.4.4KnotRealignment ............................ 74 3.4.5ComparisonwithanAlternateModel ................. 78 3.5ModelAssessment ............................... 80 3.6Discussion ................................... 82 4ESTIMATIONOFMEDIANINCOMEOFFOURPERSONFAMILIES:AMULTIVARIATEBAYESIANSEMIPARAMETRICAPPROACH .......... 85 4.1Introduction ................................... 85 4.1.1CensusBureauMethodology ..................... 85 4.1.2RelatedLiterature ............................ 87 4.1.3MotivationandOverview ........................ 89 4.2ModelSpecication .............................. 90 4.2.1Notation ................................. 90 4.2.2SemiparametricModelingFramework ................ 91 4.2.2.1Simplebivariatemodel ................... 91 4.2.2.2Bivariaterandomwalkmodel ................ 92 4.3HierarchicalBayesianAnalysis ........................ 93 4.3.1LikelihoodFunction ........................... 93 4.3.2PriorSpecication ........................... 94 4.3.3PosteriorDistributionandInference .................. 94 4.4DataAnalysis .................................. 95 4.4.1ComparisonMeasuresandKnotSpecication ............ 96 4.4.2ComputationalDetails ......................... 97 4.4.3AnalyticalResults ............................ 98 4.5ConclusionandDiscussion .......................... 102 6 PAGE 7 .................... 104 5.1AdaptiveKnotSelection ............................ 105 5.2AnalyzingLongitudinalDatawithManyPossibleDropoutTimesusingLatentClassandTransitionalModelling ................... 107 5.2.1IntroductionandBriefLiteratureReview ............... 107 5.2.2ModelingFramework .......................... 110 5.2.3Likelihood,PriorsandPosteriors ................... 114 5.2.4SpecicationofPriors ......................... 117 APPENDIX APROOFOFBAYESIANEQUIVALENCERESULTS ................ 122 BPROOFOFPOSTERIORPROPRIETYFORTHESMALLAREAMODELS .. 128 B.1UnivariateSmallAreaModel ......................... 128 B.2BivariateSmallAreaModel .......................... 130 CFULLCONDITIONALDISTRIBUTIONS ...................... 135 C.1SemiparametricCaseControlModel ..................... 135 C.2SemiparametricSmallAreaModels ..................... 136 C.2.1SemiparametricUnivariateSmallAreaModel ............ 136 C.2.2UnivariateRandomWalkModel .................... 137 C.2.3BivariateRandomWalkModel ..................... 137 REFERENCES ....................................... 139 BIOGRAPHICALSKETCH ................................ 145 7 PAGE 8 Table page 11Atypical22table ................................. 15 21Estimatesofoddsratiosfordifferenttrajectorylengthsandageatdiagnosisfora0.5verticalshiftoftheexposuretrajectoryforthelinearinuencemodel 52 22Posteriormeansand95%condenceintervalsofoddsratioforI=(10,5)forthelinearinuencemodel .................... 53 23Posteriorpredictivelosses(PPL)fortheconstantandlinearinuencemodelsforvaryingexposuretrajectoriesandknots .......... 54 31ParameterestimatesofSPRWMwith5knots ................... 74 32ComparisonmeasuresforSPM(5)andSPRWM(5)estimateswithknotrealignment ................... 77 33PercentageimprovementsofSPM(5)andSPRWM(5)estimatesoverSAIPEandCPSestimates ..................... 77 34ParameterestimatesofSPM(5) 78 35ParameterestimatesofSPRWM(5) 78 36Comparisonmeasuresfortimeseriesandothermodelestimates ................................ 79 41Comparisonmeasuresforunivariateestimates .................. 99 42PercentageimprovementsofunivariateestimatesoverCensusBureauestimates ..................... 99 43Comparisonmeasuresforbivariatenonrandomwalkestimates .................................... 100 44PercentageimprovementsofbivariatenonrandomwalkestimatesoverCensusBureauestimates .................. 101 45Comparisonmeasuresforbivariaterandomwalkmodel ............. 102 8 PAGE 9 Figure page 21Longitudinalexposure(PSA)prolesof3randomlysampledcases(1stcolumn)and3randomlysampledcontrols(2ndcolumn)plottedagainstage. ...... 36 22Sensitivityof1,0,1anddiseaseprobabilityestimatestocasedeletions. .. 56 31LongitudinalCPSmedianincomeprolesfor6statesplottedagainstIRSmeanandmedianincomes.(1stcolumn:IRSMeanIncome;2ndcolumn:IRSMedianIncome). ........................................ 63 32PlotsofCPSmedianincomeagainstIRSmeanandmedianincomesforallthestatesoftheU.S.from1995to1999. ..................... 65 33Exactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Theknotsaredepictedastheboldfacedtrianglesatthebottom. ........................................ 75 34Positionsof5knotsafterrealignment.Theknotsaretheboldfacedtrianglesatthebottom.Theregionbetweenthedashedandboldlinesistheadditionalcoverageareagainedfromtherealignment. .................... 76 35QuantilequantileplotofRBvaluesfor10000drawsfromtheposteriordistributionofthebasicsemiparametricandsemiparametricrandomwalkmodels.TheXaxisdepictstheexpectedorderstatisticsfroma2distributionwith9degreesoffreedom. ...................................... 81 9 PAGE 10 CaseControlstudiesandsmallareaestimationaretwodistinctareasofmodernStatistics.Theformerdealswiththecomparisonofdiseasedandhealthysubjectswithrespecttoriskfactor(s)ofadiseasewiththeaimofcapturingdiseaseexposureassociationspeciallyforrarediseases.Thelaterareaisconcernedwiththemeasurementsofcharacteristicsofsmalldomainsregionswhosesamplesizeissosmallthattheusualsurveybasedestimationprocedurescannotbeappliedintheinferentialroutines.Boththeseareasareimportantintheirownright.Casecontrolstudiesformsoneofthepillarsofmodernbiostatisticsandepidemiologyandhasdiverseapplicationsinvarioushealthrelatedissues,speciallythoseinvolvingrarediseaseslikeCancer.Ontheotherhand,estimatesofcharacteristicsforsmallareasarewidelyusedbyFederalandlocalgovernmentsforformulatingpoliciesanddecisions,inallocatingfederalfundstolocaljurisdictionsandinregionalplanning.MydissertationdealswiththeapplicationofBayesiansemiparametricproceduresinmodelingunorthodoxdatascenariosthatmayariseincasecontrolstudiesandsmallareaestimation. Therstpartofthedissertationdealswithananalysisoflongitudinalcasecontrolstudiesi.ecasecontrolstudiesforwhichtimevaryingexposureinformationareavailableforbothcasesandcontrols.Inatypicalcasecontrolstudy,theexposureinformationiscollectedonlyonceforthecasesandcontrols.However,somerecentmedicalstudieshaveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory, 10 PAGE 11 ThesecondandthirdpartofmydissertationdealswithunivariateandmultivariatesemiparametricproceduresforestimatingcharacteristicsofsmallareasacrosstheUnitedStates.Inthesecondpart,weputforwardasemiparametricmodelingprocedureforestimatingthemedianhouseholdincomeforallthestatesoftheU.S.andtheDistrictofColumbia.Ourmodelsincludeanonparametricfunctionalpartforaccomodatinganyunspeciedtimevaryingincomepatternandalsoastatespecicrandomeffecttoaccountforthewithinstatecorrelationoftheincomeobservations.ModelttingandparameterestimationiscarriedoutinahierarchicalBayesianframeworkusingMarkovchainMonteCarlo(MCMC)methodology.ItisseenthatthesemiparametricmodelestimatescanbesuperiortoboththedirectestimatesandtheCensusBureauestimates.Overall,ourstudyindicatesthatpropermodelingoftheunderlyinglongitudinalincomeprolescanimprovetheperformanceofmodelbasedestimatesofhouseholdmedianincomeofsmallareas. Inthethirdpartofthedissertation,weputforwardabivariatesemiparametricmodelingprocedurefortheestimationofmedianincomeoffourpersonfamiliesforthedifferentstatesoftheU.S.andtheDistrictofColumbiawhileexplicitlyaccommodatingforthetimevaryingpatternintheincomeobservations.OurestimatestendtohavebetterperformancesthanthoseprovidedbytheCensusBureauandalsohave 11 PAGE 12 12 PAGE 13 EilersandMarx 1996 ). InChapter 2 ,Ipresentananalysisofacasecontrolstudywhenlongitudinal,timevaryingexposureobservationsareavailableforthecasesandcontrols.Semiparametricregressionproceduresareusedtoexiblymodelthesubjectspecicexposureprolesandalsotheinuencepatternoftheexposureprolesonthediseasestatus.Thisenablesustoanalyzewhetherpastexposureobservationsaffectthecurrentdiseasestatusofasubjectconditionalonhis/hercurrentexposurecondition.Theproposedmethodologyismotivatedbyandappliedtoacasecontrolstudyofprostatecancerwherelongitudinalbiomarkerinformationareavailableforthecasesandcontrols.WealsoshowthedetailsofthehierarchicalBayesianimplementationofourmodelsandsomeequivalenceresultsthathaveenabledustouseaprospectivemodelingframeworkonaretrospectivelycollecteddataset. InChapter 3 ,IproposeaBayesiansemiparametricmodelingprocedureforestimatingthemedianhouseholdincomeofsmallareaswhenareaspeciclongitudinalincomeobservationsareavailable.Ourmodelsincludeanonparametricfunctional 13 PAGE 14 Chapter 4 dealswithanextensionofthemethodologyinChapter3whereabivariatesemiparametricprocedurehasbeenusedtoestimatethemedianincomeoffamiliesofvaryingsizesacrosssmallareas.Thiscanalsobeseenasanextensionofthetimeseriesmodelingframeworkof Ghoshetal. ( 1996 ).Weshowthatthesemiparametricmodelsgenerallyhavebetterperformancethantheirtimeseriescounterpartsandinafewsituations,theperformancesarecomparable.Wewanttoconveythemessagethatsemiparametricregressionmethodologycanprovideanattractivealternativetothetraditionalmodelingtechniquesspeciallywhentimevaryinginformationareavailableforsmallareas. InChapter 5 ,weprovideanoveralldiscussionofourresultsandalsopointtosomeinterestingopenproblemsandareasforfutureresearchthatmaybeworthpursuing. 14 PAGE 15 Casecontrolstudieshaveconsistentlyattractedtheattentionofstatisticians,andasaresult,arichandvoluminousbodyofworkhasdevelopedovertheyears.NotableworkintheFrequentistdomaininclude Corneld ( 1951 )whopioneeredthelogisticmodelfortheprobabilityofdiseasegivenexposure.Hewasthersttodemonstratethattheexposureoddsratioforcasesversuscontrolsequalsthediseaseoddsratioforexposedversusunexposedandthatthelatterinturnapproximatestheratioofthediseaseratesifthediseaseisrare.LetDandEbedichotomousfactorsrespectivelycharacterizingthediseaseandexposurestatusofindividualsinapopulation.AcommonmeasureofassociationbetweenDandEisthe(disease)oddsratio ByapplyingtheBayestheorem,theaboveexpressioncanberewrittenas whichistheexposureoddsratio.Anotherwellknownmeasureofassociationistherelativerisk(RR)ofdiseasefordifferentexposurevaluesgivenbyP(D=1jE=1)=P(D=1jE=0).Forrarediseases,bothP(D=0jE=0)andP(D=0jE=1)areclosetooneandthediseaseoddsratioisapproximatelyequaltotherelativeriskofdisease.Theclassicpaperby MantelandHaenszel ( 1959 )furtherclariedtherelationshipbetweenaretrospectivecasecontrolstudyandaprospectivecohortstudy.Theyconsideredaseriesof22tablesasinTable 11 Table11. Atypical22table DiseaseStatusExposedNotExposedTotal Casen11in10in1iControln01in00in0iTotale1ie0iNi PAGE 16 IXi=1n01in10i=Ni(1) ItmaybeofinteresttotestfortheequalityoftheoddsratiosacrosstheItablesi.e whichfollowsanapproximate2distributionwithI1degreesoffreedomunderthenullhypotheses.Thederivationofthevarianceoftheaboveestimatorinitiallyposedsomechallengebutwaseventuallyaddressedinseveralsubsequentpapers( Breslow 1996 ). BreslowandDay ( 1980 )markedthedevelopmentoflikelihoodbasedinferencemethodsforoddsratio.Methodstoevaluatethesimultaneouseffectsofmultiplequantitativeriskfactorsondiseaserateswerepioneeredinthe1960's. Inacasecontrolstudy,theappropriatelikelihoodistheretrospectivelikelihoodofexposuregiventhediseasestatus. Corneldetal. ( 1961 )notedthatiftheexposuredistributionsinthecaseandcontrolpopulationsarenormalwithdifferentmeansbutacommoncovariancematrix,thentheprospectiveprobabilityofdisease(D)giventheexposure(X)hasthelogisticformi.e whereL(u)=1=1+exp(u).However,thereisaconceptualcomplicationinusingaprospectivelikelihoodbasedonP(DjX)whereasacasecontrolsampling 16 PAGE 17 PrenticeandPyke ( 1979 )whoshowedthatthemaximumlikelihoodestimatorsandasymptoticcovariancematricesofthelogoddsratiosobtainedfromtheretrospectivelikelihoodarethesameasthatobtainedfromaprospectivelikelihoodunderalogisticformulationforthelatter.Thus,casecontrolstudiescanbeanalyzedusingaprospectivelikelihoodwhichgenerallyinvolvesfewernuisanceparametersthanaretrospectivelikelihood. Carrolletal. ( 1995 )extendedtheprospectiveformulationtothesituationofmissingdataandmeasurementerrorintheexposurevariables. Inacasecontrolsetup,matchingifoftenusedforselectingcomparablecontrolstoeliminatebiasduetoconfounding.Statisticaltechniquesforanalyzingmatchedcasecontroldatawererstdevelopedby Breslowetal. ( 1978 ).Inthesimplestsetting,thedataconsistofmmatchedsets,say,S1,...,Sm,withMicontrolsmatchedwithacaseineachsetorstratum.Aprospectivestratiedlogisticdiseaseincidencemodelgivenby isassumed.i'sarethestratumspecicinterceptterms,treatedasnuisanceparametersandareeliminatedbyconditioningonthenumberofcasesineachstratum.Thegeneratedconditionallogisticlikelihoodyieldstheoptimumestimatingfunction( Godambe 1976 )forestimating.Theclassicalmethodsforanalyzingunmatchedandmatchedstudiessufferfromlossofefciencywhentheexposurevariableispartiallymissing. Lipsitzetal. ( 1998 )proposedapseudolikelihoodmethodtohandlemissingexposurevariables. Rathouzetal. ( 2002 )developedamoreefcientsemiparametricmethodofestimationwhichtookintoaccountmissingexposuresinmatchedcasecontrolstudies. SattenandKupper ( 1993 ), PaikandSacco ( 2000 )and SattenandCarroll ( 2000 )addressedtheproblemofmissingexposurefromafulllikelihoodapproachbyassumingadistributionoftheexposurevariableinthecontrolpopulation. 17 PAGE 18 Althman ( 1971 )isprobablytherstBayesianworkwhichconsideredseveral22contingencytableswithacommonoddsratioandperformedaBayesiantestofassociationbasedonthecommonoddsratio.Later, ZelenandParker ( 1986 ), NurminenandMutanen ( 1987 )and Marshall ( 1988 )consideredidenticalBayesianformulationsofacasecontrolmodelwithasinglebinaryexposure.Theseworksdealtwithinferencefromtheposteriordistributionofsummarystatisticslikethelogoddsratio,riskratioandriskdifference. Ashbyetal. ( 1993 )analyzedacasecontrolstudyfromaBayesianperspectiveanduseditasasourceofpriorinformationforasecondstudy.TheirpaperemphasizedthepracticalrelevanceoftheBayesianperspectiveinaepidemiologicalstudyasanaturalframeworkforintegratingandupdatingknowledgeavailableateachstage. MullerandRoeder ( 1997 )introducedanovelaspecttoBayesiantreatmentofcasecontrolstudiesbyconsideringcontinuousexposurewithmeasurementerror.Theirapproachisbasedonanonparametricmodelfortheretrospectivelikelihoodofthecovariatesandtheimpreciselymeasuredexposure.Theychosethenonparametricdistributiontobeaclassofexiblemixturedistributions,obtainedbyusingamixtureofnormalmodelswithaDirichletprocessprioronthemixingmeasure( EscobarandWest 1995 ).Theprospectivediseasemodelrelatingdiseasetoexposureisassumedtohavealogisticformcharacterizedbyavectoroflogoddsratioparameters.Thispaperpioneeredtheuseofcontinuouscovariates,measurementerrorandexiblenonparametricmodelingofexposuresinaBayesiansettingandbroughttolightthetremendouspossibilityofmodernBayesiancomputationaltechniquesinsolvingcomplexdatascenariosincasecontrolstudies. SeamanandRichardson ( 2001 )extendedthebinaryexposuremodelofZelenandParkertoanynumberofcategorical 18 PAGE 19 Mulleretal. ( 1999 )consideredanynumberofcontinuousandbinaryexposures.However,incontrasttoSeamanandRichardson,theyspeciedaretrospectivelikelihoodandthenderivedtheimpliedprospectivelikelihood.Theyalsoaddressedtheproblemofhandlingcategoricalandquantitativeexposuressimultaneously. ContinuouscovariatescanbetreatedintheSeamanandRichardsonframeworkbydiscretizingthemintogroupsandlittleinformationislostifthediscretizationissufcientlyne. Gustafsonetal. ( 2002 )treatedtheproblemofmeasurementerrorsinexposurebyapproximatingtheimpreciselymeasuredexposurebyadiscretedistributionsupportedonasuitablychosengrid.Intheabsenceofmeasurementerror,thesupportischosenasthesetofobservedvaluesoftheexposure,adevicethatresemblestheBayesianBootstrap( Rubin 1981 ).TheyassignedaDirichlet(1,1,...,1)priorontheprobabilityvectorcorrespondingtothegridpoints. SeamanandRichardson ( 2004 )provedequivalencebetweentheprospectiveandretrospectivelikelihoodintheBayesiancontext.Specically,theyshowedthatposteriordistributionofthelogoddsratiosbasedonaprospectivelikelihoodwithauniformpriordistributiononthelogodds(thatanindividualwithbaselineexposureisdiseased)isexactlyequivalenttothatbasedonaretrospectivelikelihoodwithaDirichletpriordistributionontheexposureprobabilitiesinthecontrolgroup.Thus,Bayesiananalysisofcasecontrolstudiescanbecarriedoutusingalogisticregressionmodelundertheassumptionthatthedatawasgeneratedprospectively. Diggleetal. ( 2000 )introducedBayesiananalysisformatchedcasecontrolsstudieswhencasesareindividuallymatchedtocontrols.Theyintroducednuisanceparameters 19 PAGE 20 GhoshandChen ( 2002 )developedgeneralBayesianinferentialtechniquesformatchedcasecontrolproblemsinthepresenceofoneormorebinaryexposurevariables.Theirframeworkwasmoregeneralthanthatof ZelenandParker ( 1986 ).Unlike Diggleetal. ( 2000 ),theybasedtheiranalysisonunconditionalratherthantheconditionallikelihoodaftereliminationofthenuisanceparameters.Theirframeworkincludedawidevarietyoflinkslikecomplimentaryloglinksandsomesymmetricandskewedlinksinadditiontotheusuallogitandprobitlinks.Recently Sinhaetal. ( 2004 )and Sinhaetal. ( 2005 )proposedauniedBayesianframeworkformatchedcasecontrolstudieswithmissingexposures.Theyalsomotivatedasemiparametricalternativeformodelingvaryingstratumeffectsontheexposuredistributions.TheparameterswereestimatedinaBayesianframeworkbyusinganonparametricDirichletprocessprioronthestratumspeciceffectsinthedistributionoftheexposurevariableandparametricpriorsonallotherparameters.TheinterestingaspectoftheBayesiansemiparametricmethodologyisthatitcancaptureunmeasuredstratumheterogeneityinthedistributionoftheexposurevariableinarobustmanner.Theyalsoextendedtheproposedmethodtosituationswithmultiplediseasestates. Inatypicalcasecontrolstudydesign,theexposureinformationiscollectedonlyonceforthecasesandcontrols.However,somerecentmedicalstudies Lewisetal. ( 1996 )haveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory,whenavailable,mayleadtoagainininformationonthecurrentdiseasestatusofasubjectvisavismorepreciseestimationoftheoddsratioofdisease.Itmayalsoprovideinsightsonhowthepresentdiseasestatusofasubjectisbeinginuencedbypastexposureconditionsconditionalonthecurrentones.Unfortunately,properandrigorousstatisticalmethodsofincorporatinglongitudinallyvaryingexposureinformationinsidethecasecontrolframeworkhavenotyetbeenproperlydeveloped.Inthiswork, 20 PAGE 21 GhoshandRao ( 1994 )provideanicereviewofthedifferenttypesofestimatorsandinferentialproceduresusedinsurveysamplingandsmallareaestimation. Sincesamplesurveysaregenerallydesignedforlargeareas,theestimatesofmeansortotalsobtainedthereofarereliableforlargedomains.Directsurveybasedestimatorsforsmalldomainsoftenyieldlargestandarderrorsduetothesmallsamplesizeoftheconcernedarea.Thisisduetothefactthattheoriginalsurveywasdesignedtoprovideaccuracyatamuchhigherlevelofaggregationthanforlocalareas.Thismakesitanecessitytoborrowstrengthfromadjacentorrelatedareastondindirectestimatorsthatincreasetheeffectivesamplesizeandthusincreasetheprecisionoftheresultingestimateforagivensmallarea.Broadlyspeaking,asmallareamodelhasageneralizedlinearformwithameanterm,arandomareaspeciceffecttermandameasurementerrortermwhichreectsthenoisefornotsamplingtheentiredomain. 21 PAGE 22 Duringthelast1015years,modelbasedinferencehasbeenwidelyusedinthesmallareacontext.Thisismainlyduetothewiderangeoffunctionalitiesthatcomeswiththelinearmixedeffectsmodelingframework.Someofthemainadvantagesofthisframeworkare(i)Randomareaspeciceffectsaccountingforbetweenareavariationaboveandbeyondthatexplainedbyauxiliaryvariablesinthemodel.(ii)Differentvariationslikenonlinearmixedeffectsmodels,logisticregressionmodels,generalizedlinearmodelscanbeentertained.(iii)Areaspecicmeasuresofprecisioncanbeassociatedwitheachsmallareaestimateunliketheglobalmeasures.(iv)Complexdatastructureslikespatialdependence,timeseriesstructures,longitudinalmeasurementscanbeexploredand(v)Recentmethodologicaldevelopmentsforrandomeffectsmodelscanbeutilizedtoachieveaccuratesmallareainferences.Generally,therearetwokindsofsmallareamodelsdependingonwhethertheresponseisobservedattheareaortheunitlevel. 1. Area(oraggregate)levelmodelsrelatesmallareameanstoareaspecicauxiliaryvariables. 2. Unitlevelmodelsrelatetheunitvaluesofthestudyvariabletounitspecicauxiliaryvariables. Thebasicarealevelmodelisgivenby Hereiisoftenassumedtobeafunctionofthepopulationmean,Yioftheithsmallarea,zi=(zi1,...,zip)0isthecorrespondingauxiliarydata,vi'sareareaspecicrandom 22 PAGE 23 Inordertoinferaboutthesmallareameans,Yi,directestimators,^Yiareassumedtobeknownandavailable.Thelinearmodel isassumedwherethesamplingerrors,eiareindependentwithEp(eiji)=0,Vp(eiji)=i,iknown whichimpliesthat^iaredesignunbiased.Bysetting2v=0in( 1 ),wehavei=z0iwhichleadstosyntheticestimatorsthatdoesnotaccountforlocalvariationaboveandbeyondthatreectedintheauxiliaryvariableszi.Combining( 1 )and( 1 ),wehave whichisaspecialcaseofalinearmixedmodel.Here,viandeiareassumedtobeindependent. FayandHerriot ( 1979 )studiedtheabovearealevelmodel( 1 )inthecontextofestimatingthepercapitaincome(PCI)forsmallplacesintheUnitedStatesandproposedEmpiricalBayesestimatorforthatcase. EricksenandKadane ( 1985 )usedthesamemodelwithbi=1andknown2vtoestimatetheundercountinthedecennialcensusofU.S.ThearealevelmodelhasalsobeenusedrecentlytoproducemodelbasedcountyestimatesofpoorschoolagechildrenintheUnitedStates. Intheunitlevelmodel,itisassumedthatunitspecicauxiliarydataxij=(xij1,...,xijp)0areavailableforeachpopulationelementjineachsmallareai.Moreover,itisassumedthatthevariableofinterest,yij,isrelatedtoxijthroughaonefoldnestederrorlinearregressionmodel 23 PAGE 24 Batteseetal. ( 1988 )studiedthenestederrorregressionmodel( 1 )inestimatingtheareaundercornandsoyabeansforcountiesinNorthCentralIowausingsamplesurveydataandsatelliteinformation.Indoingso,theycameupwithanempiricalbestlinearunbiasedpredictor(EBLUP)forthesmallareameans. Overtheyears,numerousextensionshavebeenproposedfortheabovemodelingframeworksincludingmultivariateFayHerriotmodels,generalizedlinearmodels,spatialmodelsandmodelswithmorecomplicatedrandomeffectsstructureetc. Rao ( 2003 )presentedaniceoverviewofthedifferentestimationmethodswhile JiangandLahiri ( 2006 )reviewedthedevelopmentofmixedmodelestimationinthesmallareacontext. AproperreviewofmodelbasedsmallareaestimationwillbeincompletewithoutexplainingtheEBLUP,EBandHBapproachesthatarebeingwidelyusedinthiscontext.Asshownabove,smallareamodelsarespecialcasesofgenerallinearmixedmodelsinvolvingxedandrandomeffectssuchthatsmallareaparameterscanbeexpressedaslinearcombinationsoftheseeffects. Henderson ( 1950 )derivedtheBLUPestimatorsofsmallareaparametersintheclassicalfrequentistframework.Thesearesocalledbecausetheyminimizethemeansquarederroramongtheclassoflinearunbiasedestimatorsanddonotdependonnormality.So,theyaresimilartothebestlinearunbiasedestimators(BLUEs)ofxedparameters.TheBLUPestimatortakesproperaccountofthebetweenareavariationrelativetotheprecisionofthedirectestimator.AnEBLUPestimatorisobtainedbyreplacingtheparameterswiththeasymptoticallyconsistentestimator. Robinson ( 1991 )givesanexcellentaccountofBLUPtheoryandsomeapplications.InanEBapproach,theposteriordistributionoftheparametersof 24 PAGE 25 Morris ( 1983 ).Lastbutnottheleast,intheHBapproach,apriordistributionisspeciedonthemodelparametersandtheposteriordistributionoftheparameterofinterestisobtained.Inferencesabouttheparametersarebasedontheposteriordistribution.Theparameterofinterestisestimatedbyitsposteriormeanwhileitsprecisionisestimatedbyitsposteriorvariance.RecentadvancesinMarkovchainMonteCarlotechnique,specicallyGibbsandMetropolisHastingssamplershaveconsiderablysimpliedthecomputationalaspectofHBprocedures. TheSmallAreaIncomeandPovertyEstimates(SAIPE)programoftheU.S.CensusBureauwasestablishedwiththeaimofprovidingannualestimatesofincomeandpovertystatisticsforallstates,countiesandschooldistrictsacrosstheUnitedStates.Theresultingestimatesaregenerallyusedfortheadministrationoffederalprogramsandtheallocationoffederalfundstolocaljurisdictions.TheSAIPEprogramalsoprovidesannualstateandcountylevelestimatesofmedianhouseholdincome.Generally,observationsonvariouscharacteristicsofsmallareasthatarecollectedovertimemaypossessacomplicatedunderlyingtimevaryingpattern.Itislikelythatmodelswhichtakesintoaccountthislongitudinalpatternintheobservationsmayperformbetterthanclassicalsmallareamodelswhichdonotutilizethisinformation.Inthisstudy,wepresentasemiparametricBayesianframeworkfortheanalysisofsmallarealeveldatawhichexplicitlyaccomodatesforthelongitudinaltimevaryingpatternintheresponseandthecovariates. 25 PAGE 26 Suppose,theresponseyandthecovariatexarerelatedas wheref(x)isanunknownandunspeciedsmoothfunctionofxandeiN(0,2e).Thebasicproblemofnonparametricregressionistoestimatethefunctionf()usingthedatapoints(xi,yi).Indoingso,itistypicallyassumedthatbeneatharoughobservationaldatapatternthereisasmoothtrajectory.Thisunderlyingsmoothpatternisestimatedbyvarioussmoothingtechniques.Broadly,therearefourmajorclassesofsmoothersusedtoestimatef(.)vizLocalpolynomialkernelsmoothers( FanandGijbels ( 1996 ); WandandJones ( 1995 )),Regressionsplines( Eubank ( 1988 ), Eubank ( 1999 )),Smoothingsplines( Wahba ( 1990 ); GreenandSilverman ( 1994 ))andPenalizedsplines( EilersandMarx ( 1996 ); Ruppertetal. ( 2003 )).Eachsmootherhasitsownstrengthsandweaknesses.Forexample,localpolynomialsmoothersarecomputationallyadvantageousforhandlingdenseregionswhilesmoothingsplinesmaybebetterforsparseregions.Here,wewillbrieyreviewthemaincharacteristicsofsplinesingeneralandpenalizedsplinesinparticular. Thebasicideabehindsplinesistoexpresstheunknownfunctionf(x)usingpiecewisepolynomials.Twoadjacentpolynomialsaresmoothlyjoinedatspecicpointsintherangeofxknownasknots.Theknots,say,(1,...,K)partitiontherangeofxintoKdistinctsubintervals(orneighborhoods).Withineachsuchneighborhood,apolynomialofcertaindegreeisdened.Apolynomialsplineofdegreephas(p1)continuousderivativesandadiscontinuouspthderivativeatanyinteriorknot.Thepthderivativereectsthejumpofthesplinesattheknots.Thus,asplineofdegree0isa 26 PAGE 27 Here(xk)p+isthefunction(xk)pIfx>kg.Usingtheabovebasis,asplineofdegreepcanbeexpressedas Here,(0,...,p)and(1,...,K)arethecoefcientsofthepolynomialandsplineportionsoftheabovestructureandmustbeestimated.p=1,2,3correspondstoalinear,quadraticorcubicsplinerespectively.TheabovebasisconstitutesoneofthemostcommonlyusedbasisfunctionswhileotherbaseslikeradialbasisorBsplinescanalsobeused.Itcanbeshownthatthereexistsaveryrichclassofsplinegeneratingfunctionswhichinturngreatlyincreasesthescopeandapplicabilityofsplinesinvariousmodelingframeworks.Moreover,theverystructureofthesplinesmakesthemextremelygoodatcapturinglocalvariationsinapatternofobservations,somethingwhichcannotbeachievedusingFourierorPolynomialbases. Oneofthemostimportantaspectofsmoothingistheproperselectionandpositioningoftheknots.Thisisbecausetheknotsactassensorsinrelayinginformationabouttheunderlyingtrueobservationalpattern.Toofewknotsoftenleadtoabiasedtwhileanexcessivenumberofknotsleadstooverttingvisavisoverparametrizationandmayevenworsentheresultingt.Thus,asufcientnumberofknotsshouldbeusedandtheyshouldbeplaceduniformlythroughouttherangeoftheindependentvariable.Generally,theknotsareplacedonagridofequallyspacedsamplequantilesofxandamaximumof35to40knotssufcesforanypracticalproblem( Ruppert 2002 ).Recently,therehavebeeninterestingcontributionsonknot 27 PAGE 28 Friedman ( 1991 ); Stoneetal. ( 1997 ); Denisonetal. ( 1998 ); Lindstrom ( 1999 ); DiMatteoetal. ( 2001 ); BottsandDaniels ( 2008 )).Theexibilityandwideapplicabilityofsplinesisduetothefactthatprovidedtheknotsareevenlyspreadoutovertherangeofx,f(xj,)canaccuratelyestimateaverylargeclassofsmoothfunctionsf(.)evenifthedegreeofthesplineiskeptrelativelylow(say,1or2). Thesplinecoefcients(1,...,K)in( 1 )correspondtothediscontinuouspthderivativeofthesplinethus,theymeasurethejumpsofthesplineattheknots(1,...,K).Thus,theycontributetotheroughnessoftheresultingspline.Inordertosmoothoutthet,aroughnesspenaltyisplacedontheseparameters.Thisisoftendonebyminimizingtheexpression whereisknownasthesmoothingparameter.Thisissynonymoustominimizingtherstpartof( 1 )subjecttotheconstraint0.playsacrucialroleinthesmoothingprocesssinceitcontrolsthegoodnessoftandroughnessofthettedmodel.Decreasing,thesplinewilltendtoovert,becominganinterpolatingcurveas!0.Increasing,thesplinewillbecomesmootherandwilltendtotheleastsquarestas!1.Therearedifferentmethodsforchoosingtheoptimallikecrossvalidation,generalizedcrossvalidation,Mallow'sCpcriterionetc. Broadlyspeaking,therearethreemaintypesofsplines:Regressionsplines,SmoothingsplinesandPenalizedsplines(orPsplines).Allofthemarebasedonthesameprincipleasdetailedabovebutdifferinthespecicmannerinwhichsmoothingisdoneortheknotsareselected.Inregressionsplines,smoothingisachievedbythedeletionofnonessentialknotsorequivalently,bysettingthejumpsatthoseknotstozerokeepingthejumpsattheotherknotsundisturbed.Insmoothingandpenalizedsplines,smoothingisachievedbyshrinkingthejumpsatalltheknotstowardszerousing 28 PAGE 29 1 ).Amajordifferencebetweensmoothingsplinesandpenalizedsplinesisthat,intheformer,alltheuniquedatapointsareusedasknotsbutinthelatterthenumberofknotsaremuchsmallerresultinginmoreexibility.Infact,penalizedsplinescanbeseenasageneralizationofregressionandsmoothingsplines. Thewideapplicabilityofpenalizedsplinesindiversesettingsismainlyduetoitscorrespondencewithlinearmixedeffectsmodels.Infact,penalizedsplinescanbeshowntobebestlinearunbiasedpredictors(BLUP)'sinamixedmodelframework.Toseethis,werewrite( 1 )as where=(,)0,=(0,1,...,p)0,=(1,2,...,K)0andDisaknownpositivesemidenitepenaltymatrixsuchthatD=0B@0(p+1)(p+1)0(p+1)(K)0(K)(p+1)1K1CA 1 )correspondstosetting=I. LetXbethematrixwiththeithrowXi=(1,xi,...,xpi)andZbethematrixwiththeithrowZi=f(xi1)p+,...,(xi1)p+).Usingthisformulationin( 1 )withthebasisfunctionin( 1 )anddividingbytheerrorvariance2e,wehave 2ekk2(1) ByassumingthatisavectorofrandomeffectswithCov()=2Iwhere2=2e=whileasthesetofxedeffectsparameters,theabovepenalizedsplineframework 29 PAGE 30 whereCov(e)=2eIandandeareindependent. BayesianPsplineshaverecentlybecomepopularbecausetheycombinetheexibilityofnonparametricmodelsandtheexactinferenceprovidedbytheBayesianinferentialprocedure.Thisisevenmoretruebecauseoftheseamlessfusionofpenalizedsplinesintothemixedmodelframework( Wand 2003 )asshownabove.Thisequivalencealsocarriesovertothemannerinwhichsmoothingisdone.Smoothingcanbeachievedbyimposingpenaltiesonthesplinecoefcients,asshownin( 1 )orbyassumingadistributionalformfor,forexampleNK(0,2IK).IntheBayesiancontext,priorsareplacedon2andtheotherparametersandusualposteriorsamplingiscarriedout.Sincesamplesaregeneratedfromthesmoothingparameteralongsidetheotherparameters,thismethodisalsoknownasautomaticscatterplotsmoothing.Inalltheproblemstackledinthisdissertation,wewillbeusingBayesianinferentialproceduresonpenalizedsplinesasshownabove. 30 PAGE 31 Lewisetal. 1996 )haveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory,whenavailable,mayleadtoagainininformationonthecurrentdiseasestatusofasubjectandmorepreciseestimationoftheoddsratioofdisease.Itmayalsoprovideinsightsonhowthepresentdiseasestatusofasubjectisbeinginuencedbypastexposureconditionsconditionalonthecurrentones.Inthiswork,wepresentaBayesiansemiparametricapproachforanalyzingcasecontroldatawhenlongitudinalexposureinformationisavailableforbothcasesandcontrols. Statisticalanalysisofcasecontroldatawaspioneeredby Corneld ( 1951 ), Corneldetal. ( 1961 )and MantelandHaenszel ( 1959 ).Sincethen,importantandfarreachingcontributionshavebeenmadeinvirtuallyeveryaspectoftheeld.Someofthenotableonesareequivalenceofprospectiveandretrospectivelikelihoods( PrenticeandPyke 1979 ),measurementerrorinexposures( Roederetal. 1996 )andmatchedcasecontrolstudies( Breslowetal. 1978 ).ImportantcontributionsintheBayesianparadigmincludebinaryexposures( ZelenandParker 1986 ),continuousexposures( MullerandRoeder 1997 ),categoricalexposures( SeamanandRichardson 2001 ),equivalence( SeamanandRichardson 2004 )andmatching( Diggleetal. ( 2000 ); GhoshandChen ( 2002 )). Theanalysisofcomplexdatascenariosinacasecontrolframeworkisarelativelynewareaofresearch.Specically,analysisoflongitudinalcasecontrolstudieshasonly 31 PAGE 32 ParkandKim ( 2004 )areoneoftherstcontributorstothisarea.Theyproposedanordinarylogisticmodeltoanalyzelongitudinalcasecontroldatabutignoredthelongitudinalnatureofthecohort.Theyalsoshowedthatordinarygeneralizedestimatingequations(GEE)basedonanindependentcorrelationstructurefailsinthisframework. Inviewoftheabovechallenges,weproposetousefunctionaldataanalytictechniques,speciallynonparametricregressionmethodologytomodelboththetime 32 PAGE 33 EilersandMarx ( 1996 ); Ruppertetal. ( 2003 )).Wealsoexpresstheeffectoftheexposuresonthecurrentdiseasestateasapenalizedsplinetoaccountforanypossibletimevaryingpatternsofinuence.AnalysisiscarriedoutinahierarchicalBayesianframework.Ourmodelingframeworkisquiteexiblesinceitcanaccommodateanypossiblenonlineartimevaryingpatternintheexposureandinuenceproles.Itisdifculttoachievethesamegoalinapurelyparametricsetting. Inacasecontrolstudy,thenaturallikelihoodistheretrospectivelikelihood,basedontheprobabilityofexposuregiventhediseasestatus. PrenticeandPyke ( 1979 )showedthatthemaximumlikelihoodestimatorsandasymptoticcovariancematricesofthelogoddsratiosobtainedfromaretrospectivelikelihoodarethesameasthatobtainedfromaprospectivelikelihood(basedontheprobabilityofdiseasegivenexposure)underalogisticformulationforthelatter.Thus,casecontrolstudiescanbeanalyzedusingaprospectivelikelihoodwhichgenerallyinvolvesfewernuisanceparametersthanaretrospectivelikelihood. SeamanandRichardson ( 2004 )provedasimilarresultintheBayesiancontext.Specically,theyshowedthatposteriordistributionofthelogoddsratiosbasedonaprospectivelikelihoodwithauniformpriordistributiononthelogodds(thatanindividualwithbaselineexposureisdiseased)isexactlyequivalenttothatbasedonaretrospectivelikelihoodwithaDirichletpriordistributionontheexposureprobabilitiesinthecontrolgroup.Thus,Bayesiananalysisofcasecontrolstudiescanbecarriedoutusingalogisticregressionmodelundertheassumptionthatthedatawasgeneratedprospectively. Weshowthattheresultsof SeamanandRichardson ( 2004 )appliesfortheproposedsemiparametricframeworkthusenablingustoperformtheanalysisbasedonaprospectivelikelihoodeventhoughacasecontrolstudyisretrospectiveinnature.Weperformmodelcheckingbasedontheposteriorpredictivelosscriterion( Gelfandand 33 PAGE 34 , 1998 ).Oncetheoptimalmodelisidentied,modelassessmentiscarriedoutusingcasedeletiondiagnostics( BradlowandZaslavsky 1997 ). Etzionietal. 1999 ).Thisdatasetisbasedonabiomarkerbasedscreeningprocedureforprostatecancertoelucidatetheassociationbetweenprostatecancerandprostatespecicantigen(PSA).Theeffectivenessofbiomarkerbasedscreeningproceduresforprostatecanceriscurrentlyatopicofintensedebateandinvestigationintherealmsofhealthcarepractice,policyandresearch.Sincethediscoveryofprostatespecicantigen(PSA)andtheobservationthatserumPSAlevelsmaybesignicantlyincreasedinprostatecancerpatients,alotofefforthasbeendedicatedtoidentifyingeffectivePSAbasedtestingprogramswithfavorablediagnosticproperties. Inthisstudy,thelevelsoffreeandtotalPSAweremeasuredintheseraof71prostatecancercasesand70controls.Participantsinthisstudyincludedmenaged50to65athighriskoflungcancer.TheywererandomizedtoreceiveeitherplaceboorBetaCaroteneandRetinol.Theinterventionhadnonoticeableeffectontheincidenceofprostatecancer,withsimilarnumberofcasesobservedintheinterventionandcontrolarms.SeveralPSAmeasurementsrecordedforthecasesweretakenaslongas10yearspriortotheirdiagnosis.The71prostatecancercaseswerediagnosedbetweenSeptember1988andSeptember1995inclusive.Theindividualsdeemedcontrolswereselectedamongindividualsnotyetdiagnosedashavingcancerbythetimeofanalysis.Astheexposurevariable,weusethenaturallogarithmofthetotalPSA(Ptotal)althoughthenegativelogarithmoftheratiooffreetototalPSA(Pratio)canalsobeconsidered.Inadditiontotheabovemeasurements,observationswerecollectedontime(years)relativetoprostatecancerdiagnosisandageatblooddrawforthecases 34 PAGE 35 21 showsthePSAtrajectoryagainstageforsomerandomlychosencasesandcontrols. Etzionietal. ( 1999 )analyzedthisdatasetbymodelingthereceiveroperatingcharacteristic(ROC)curvesassociatedwithboththebiomarkers(PtotalandPratio)asafunctionofthetimewithrespecttodiagnosis.Theyobservedthatalthoughthetwomarkersperformedsimilarlyeightyearspriortodiagnosis,PtotalwassuperiortoPratioattimesclosertodiagnosis. Therestofthechapterisorganizedasfollows.InSection 2.2 ,weintroducethesemiparametricmodelingframework.Section 2.3 describesthedetailsofposteriorinference.InSection 2.4 ,wediscussrelevantBayesianequivalenceresultsforourframework.Section 2.5 outlinesthemodelcomparisonandmodelassessmentproceduresweperformed.WedescribethedataanalysisresultsbasedontheprostatecancerdatasetinSection 2.6 andendwithadiscussioninSection 2.7 2.2.1Notation 35 PAGE 36 Longitudinalexposure(PSA)prolesof3randomlysampledcases(1stcolumn)and3randomlysampledcontrols(2ndcolumn)plottedagainstage. 36 PAGE 37 Ourmodelingframeworkbearssomeresemblancetothatof Zhangetal. ( 2007 )whousedatwostagefunctionalmixedmodelapproachformodelingtheeffectofalongitudinalcovariateproleonascalaroutcome.Theyproposedalinearfunctionalmixedeffectsmodelformodelingtherepeatedmeasurementsonthecovariate.Theeffectofthecovariateproleonthescalaroutcomewasmodeledusingapartialfunctionallinearmodel.Indoingso,theytreatedtheunobservedtruesubjectspeciccovariatetimeproleasafunctionalcovariate.Forttingpurposes,theydevelopedatwostagenonparametricregressioncalibrationmethodusingsmoothingsplines.Thus,estimationatboththestageswasconvenientlycastintoauniedmixedmodelframeworkbyusingtherelationbetweensmoothingsplinesandmixedmodels.ThekeydifferencesbetweentheirframeworkandoursisthatweuseBayesianinferentialtechniquestosimultaneouslyestimatetheparametersoftheexposureanddiseasemodels.Moreover,insteadofalinearmodelingframework,weuseacombinationoflinearandlogisticmodelssinceourresponseisbinary. whereeijN(0,2e),f(a)isthepopulationmeanfunctionmodelingtheoverallPSAtrendasafunctionofageforallthesubjectswhilegi(a)isthesubjectspecicdeviationfunctionreectingthedeviationoftheithsubjectspecicprolefromthemeanpopulationprole. Thereasonformodelingexposureasafunctionofageisthatforarandomlychosensubjectwithunknowndiseasestatus,thePSAvalueatacertaintimepointshoulddependonthesubject'sageatthattimepointcontrollingforthetimewithrespect 37 PAGE 38 Werepresentbothf(aij)andgi(aij)usingpsplinesasfollows wherep,(aij)=[1,aij,...,apij,(aij1)p+,...,(aijK)p+]0andq,(aij)=[1,aij,...,aqij,(aij1)q+,...,(aijM)q+]0aretruncatedpolynomialbasisfunctionsofdegreespandqwithknots(1,...,K)and(1,...,M)respectively( Durbanetal. 2004 ).Generally,MK. whereL(.)isthelogisticdistributionfunction,Xi(t+adi)isthetrue,errorfreeunobservedsubjectspecicexposureprolemodeledasf(t+adi)+gi(t+adi)while(t+adi)isanunknownsmoothfunctionofagewhichreectsthetimepatternoftheeffectofthePSAtrajectoryonthecurrentdiseasestatusfortheithsubject.In( 2 ),weusetherelationaij=tij+aditomodeltheexposuretrajectoryX(.)andtheinuencefunction(.)asafunctionoftimewithrespecttodiagnosis.Indoingso,wecaneasilyassesstheeffectofthetrajectoryonthecurrentdiseasestateatanygivenpointbeforediagnosisforaparticularsubject.cisthetimebywhichwegobackinthepasttorecordtheexposurehistoryfortheithsubject;e.g.c=8wouldimplythat,fortheithsubject,theexposureobservationsrecordedsinceeightyearspriortodiagnosisarebeingconsideredforanalysis.Thus,bychangingthevalueofc,theeffectofdifferentiallengthsofPSAtrajectoriesonthecurrentdiseasestatuscanbestudied. 38 PAGE 39 wherer,(t+adi)=[1,(t+adi),...,(t+adi)r,(t+adi1)r+,...,(t+adiK)r+]0,=(0,...,K+r)0and(1,...,K)aretheknots. Asspecialcasesof( 2 ),wemayconsider(t+adi)=0,inwhichcasethecovariateistheareaunderthePSAprocessfXi(t+adi),ct0gand0isitseffectonthediseaseprobability(orlogitofthediseaseprobability).Wecanalsoassume(t+adi)=0+1(t+adi)whichsigniesalinearpatternoftheeffectoftheexposuretrajectoryonthediseaseprobability.Intheabovemodels,theknotscanbechosenonagridofequallyspacedquantilesoftheages. Replacing( 2 )and( 2 )intheR.H.Sof( 2 ),wehave whereMi=Z0cp,(t+adi)r,(t+adi)0dtandQi=Z0cq,(t+adi)r,(t+adi)0dt. Forprechosendegreesofthebasisfunctionsandtheknots,bothMiandQiarematricesandareavailableinclosedforms.Weassumenormaldistributionalformsforthesplinecoefcientsin( 2 )and( 2 )inordertopenalizethejumpsofthesplineattheknots.Thus,wehavep+kN(0,2)(k=1,...,K);bi,q+mN(0,2b)(m=1,...,M)andk+rN(0,2)(k=1,...,K).Finally,therandomsubjectspecicdeviationfunctiongi(aij)ismodeledasbijN(0,2j)(i=1,...,N;j=0,...,q). 39 PAGE 40 2.3.1LikelihoodFunction Thelikelihoodfortotheithsubject,conditionalontherandomeffectsisgivenby wherep(Yij,ai,bi,2e)istheprobabilitydistributioncorrespondingtothetrajectorymodel,p(Dij,,)denotesthelogisticdistributioncorrespondingtothediseasemodelwhiletherestdealswiththedistributionalstructuresonthesplinecoefcientsandrandomeffects. Sincethetrajectorymodel( 2 )hasanormaldistributionalstructurewhilethediseasemodel( 2 )hasalogisticstructure,thelikelihoodfunctionandhencetheposteriorhaveacomplicatedform.Toalleviatethisproblem,weapproximatethelogisticdistributionasamixtureofnormalsusingawellknowndataaugmentationalgorithmproposedby AlbertandChib ( 1993 ).ThisisbrieyexplainedinSection3.3. 40 PAGE 41 LikelihoodApproximation AlbertandChib ( 1993 )toapproximatethelikelihoodandthussimplifyposteriorinference.Theyshowedthatalogisticregressionmodelonbinaryoutcomescanbewellapproximatedbyanunderlyingmixtureofnormalregressionstructureonlatentcontinuousdata.Indoingso,itcanbeshownthatalogitlinkisapproximatelyequivalenttoaStudenttlinkwith8degreesoffreedom. Asin AlbertandChib ( 1993 ),weintroducelatentvariablesZ1,Z2,...,ZNsuchthatDi=1ifZi>0andDi=0otherwise.LetZibeindependentlydistributedfromatdistributionwithlocationHi=+0Mi+b0iQi,scaleparameter1anddegreesoffreedom.Equivalently,withtheintroductionoftheadditionalrandomvariablei,thedistributionofZicanbeexpressedasscalemixturesofnormaldistribution 26 )as 41 PAGE 42 2 ).Since,themarginalposteriordistributionofisanalyticallyintractable,weconstructanMCMCalgorithmtosamplefromitsfullconditionals.Indoingso,weusemultiplechainsandmonitorconvergenceofthesamplersusingGelmanandRubindiagnostics( GelmanandRubin 1992 ). 1.2 ), SeamanandRichardson ( 2004 )showedthatforcertainchoicesofthepriorsonthelogodds,posteriorinferencefortheparameterofinterestbasedonaprospectivelogisticmodelcanbeshowntobeequivalenttothatbasedonaretrospectiveone.Asaresult,aprospectivemodelingframeworkcanbeusedtoanalyzecasecontroldatawhicharegenerallycollectedretrospectively.HereweshowthattheBayesianequivalenceresultsof SeamanandRichardson ( 2004 )canbeextendedtothesemiparametricframeworkwehaveproposed.Thisenablesustouseaprospectivelogisticframework(asdescribedinSection( 2.2.2 ))toanalyzethePSAdataset. Ourmodelingframeworkhingesontheideathatforeverysubject,insteadofasingleexposureobservation,aseriesofpastexposureobservationsareavailable.Weusethisexposuretrajectoryorexposureproleinanalyzingthepresent 42 PAGE 43 Rubin ( 1981 )andlaterby Gustafsonetal. ( 2002 )canbeappliedtothetrajectoryasawholei.efXi(t),ct0gcanbeassumedtobeadiscreterandomvariablewithsupportfZ1(t),...,ZJ(t),ct0g,thesetofallobservableexposuretrajectorieswherefZj(t),ct0,j=1,...,JgisanitecollectionofelementsinthesupportoftheXij's.LetY0jandY1jbethenumberofcontrolsandcaseshavingexposureprolefZj(t),ct0g.WedenotetheNullorbaselinetrajectoryasfX(t)=0,ct0g. TheoddsratioofdiseasecorrespondingtofZj(t),ct0gwithrespecttobaselineexposureisexpZ0cZj(t)(t)dt.AssumingthatacontrolhasexposureprolefZj(t),ct0gwithprobabilityj=PJk=1k,itcanbeeasilyshownthatP(X(t)=Zj(t),ct0jD=1)=jexpZ0cZj(t)(t)dt 43 PAGE 44 since(t)=(t)0=0(t)by( 2 ).Weassume1=1foridentiability.Hered=0and1standsforcontrolsandcasesrespectively.Assuming#tobethebaselineoddsofdisease,theprospectivelikelihoodisgivenby Basedontheabovesetup,wehavethefollowingequivalenceresults: 2 )withrespecttoisthesameasthatobtainedbymaximizingL(#,)in( 2 )withrespectto#. 44 PAGE 45 (ii)Assuming=(1,...,J)andj=j=JXk=1k,theposteriordensityof(,)is (iii)Themarginalposteriordensitiesofobtainablefromp(w,jy)andp(,jy)arethesame. Theproofsoftheabovetheoremaresimilarinnaturetothosein SeamanandRichardson ( 2004 )andaregivenintheAppendixA.SincewehaveconsiderednearuniformpriorforandourprioronensurestheexistenceandnitenessofE(),theconditionsofTheorem2areessentiallysatisedforourframework. Basedontheaboveresults,itcanbeconcludedthatthemarginalposteriordistributionoftheparameterofinterest,willbethesameregardlessofwhetherwetaprospectiveorretrospectivemodel.Thus,wecananalyzethePSAdatausingtheprospectivesemiparametricmodelingframeworkdescribedabove.Bayesianequivalencecanalsobeshowninthemoregeneralcaseofmulticategorycasecontrolsetup,i.ewhentherearemultiple(>2)diseasestates.Wehavethefollowingresult PAGE 46 KXl=1ldl1CCCCCAndkKYk=11k!() TheproofoftheabovetheoremisgiveninAppendixA. 2.5.1PosteriorPredictiveLoss GelfandandGhosh ( 1998 ).Thiscriterionisbasedontheideathatanoptimalmodelshouldprovideaccuratepredictionofareplicateoftheobserveddata. 46 PAGE 47 ( 1998 )obtainedthiscriterionbyminimizingtheposteriorlossforagivenmodelandthen,forallmodelsunderconsideration,selectingtheonewhichminimizesthiscriterion.Foragenerallossfunction,thiscriterioncanbeexpressedasalinearcombinationoftwodistinctpartsi.eagoodnessoftpartandapenaltypart.Forourframework,theposteriorpredictivelosscanbewrittenas k+1NXi=1Var(^Di)(2) where^Di=E(Drepijy,D)andVar(^Di)=Var(Drepijy,D)=E(Drepijy,D)(E(Drepijy,D))2.Forourframework,Drep=(Drep1,...,DrepN)isthereplicateddiseasestatusvectorforallthesubjects.ItisstraightforwardtocalculatetheexpectedvalueoftheabovecriterionusingtheposteriorsamplesobtainedfromtheGibbssampler.Lowervaluesofthiscriterionwouldimplyabettermodelt.Weassumek=1andobtainthevaluesofposteriorpredictivelossfordifferentlengthsofexposuretrajectoriesanddifferentnumberofknots.TheresultsaregiveninTable 23 andexplainedinSection 2.6 .Fortheoptimalmodelselectedusingtheposteriorpredictivelosscriterion,modelassessmentwasperformedusingKappameasuresofagreementandcasedeletiondiagnostics.Themethodologyisdescribedbelow. Agresti 2002 )whichcomparesagreementagainstthatwhichmightbeexpectedbychance.Thevalueofrangesfrom1to1;=1impliesperfectagreementwhile=1impliescompletedisagreement.Avalueof0indicatesnoagreementaboveandbeyondthatexpectedbychance. 47 PAGE 48 Theobserveddiseasestatus(visaviscaseorcontrolstatus)ofasubjectisobtainedfromthedatasetwhilethepredicteddiseasestatusiscalculatedfromtheposteriorestimatesoftheparameters.AtiterationnoftheGibbssampler,wecancalculatethequantity^p(n)i=^P(n)(Di=1jXi(t+adi),t2[c,0])=L(n)(+0Mi+b0iQi)whereL(.)canbeeithertheexactlogitcdfortheapproximateStudenttcdf(with8degreesoffreedom).Basedonthevalueof^p(n)i,wecanassign^D(n)i=8><>:1if^p(n)i>0.50if^p(n)i0.5 Hampeletal. 1987 ).Thesediagnosticscanbeusedtodetectobservationswithanunusualeffectonthettedmodelandthusmayleadtoidenticationofdataormodelerrors. BradlowandZaslavsky ( 1997 )appliedcaseinuencetoolsin 48 PAGE 49 LetHi=+0Mi+b0iQiandSij=p,(aij)0+q,(aij)0bi.SupposeL(YijjSij,2e)bethedensityfunctioncorrespondingtothetrajectorymodel,whileL(DijHi)betheoneforthediseasemodel.Weworkedwiththefollowingthreetypesofweightingschemesbasedonthoseproposedby BradlowandZaslavsky ( 1997 ) HerendenotethenthiterationoftheGibbssampler,thesubscriptidenotethedeletionofyiandthesuperscriptdenoteunnormalizedweights.Inthelastweighingscheme,L(YijjSij,2e)andL(DijHi)aretheusuallikelihoodswiththepopulationlevelparametersi.e(,,,2e)replacedbythefulldataposteriormedians.Herefulldataposterioristheposteriordistributionobtainedfromthecompletedataseti.etheonehavingallthesubjects. 2.2 toanalyzetheprostatecancerdatasetdescribedinSection 2.1.2 .MultipleobservationsonfreeandtotalPSAwereobtainedfor71prostatecancercasesand70controls.Forsomesubjects,observationswerecollectedasfaras10yearspriortodiagnosis.WeusethenaturallogarithmoftotalPSA(Ptotal)asourexposureofinterest.Ourprinciple 49 PAGE 50 Forthepurposeofouranalysis,wehaveusedalinearpspline(p=1)withasubjectspecicslopeparametertomodeltheexposuretrajectoryasfollows Fortheprospectivediseasemodel( 2 ),weconsideredtwospecicscenariosviz.constantinuence,(t+adi)=0andlinearinuence,(t+adi)=0+1(t+adi).Theresultsforthesetwocasesaresummarizedbelow. Onttingtheabovemodel,weobservedthatforalltrajectorylengths,0issignicant(its95%credibleintervaldoesnotcontain0).Foranyparticularinterval(i.echoiceofc),theposteriormeansand95%credibleintervalsof0donotchangemuchwiththenumberofknots(K).Inaddition,0increasesasthetrajectorylengthdecreasesi.easwemoveclosertothepointofdiagnosis.ThisislikelyrelatedtothescaleoftheareaunderthePSAprocessbutitalsoseemstosupportthewellknownmedicalfactthattotalPSAisabetterdiscriminatorofprostatecancerattimesclosertodiagnosisthanattimesfurtheroff( Catalonaetal. 1998 ).ToassesstheimpactofonlythepastPSAobservationsonthecurrentdiseasestate,weconsideredtheexposureintervalI=(10,5)and3knotsinthetrajectory.Theposteriormeanof0is0.298 50 PAGE 51 2.6.3 ParameterizingfZi(t+adi),ct0gasp,(t+adi)0+q,(t+adi)0di,asin( 2 ),wecanrewrite( 2 )asexp()0Z0cp,(t+adi)r,(t+adi)0dtexp(dibi)0Z0cq,(t+adi)r,(t+adi)0dt. expmZ0c(t+adi)dt=expcm(0+(adic=2)1).(2) 51 PAGE 52 21 showstheposteriormeansand95%credibleintervalsoftheoddsratioscorrespondingtodifferenttrajectorylengthsandageatdiagnosiswhenm=0.5.Foraxedtrajectorylength,theoddsratiosdecreaseasageatdiagnosisincreases.This Table21. Estimatesofoddsratiosfordifferenttrajectorylengthsandageatdiagnosisfora0.5verticalshiftoftheexposuretrajectoryforthelinearinuencemodel Age(3,0)(5,0)(8,0)(10,0) seemstosupportthenotionthatyoungersubjectstendtohavemoreaggressiveformofprostatecancerthanolderonesandthusaremostlikelytobebenetedfromearlydetection( Catalonaetal. 1998 ).Formostagesatdiagnosis,theoddsratiossteadilyincreaseaslongerexposuretrajectoriesareconsideredi.easpastexposureobservationsaretakenintoaccount.However,therateofincreaseishigherforlowerageatdiagnosis.Thus,considerationofpastexposureobservationsinadditiontorecentonesresultinasignicantgainininformationaboutthecurrentdiseasestatusofasubject.Finally,forthehighestageatdiagnosisconsidered(80),theoddsratiosdecreaseaslongerexposuretrajectoriesareconsidered.Thismayimplythatforasubjectwithveryhighageatdiagnosis,his/herpastexposureobservationsmaynotcontainsignicantamountsofinformationaboutthepresentdiseasestatus. Asbefore,wettedthediseasemodelontheintervalI=(10,5).Theposteriormeanand95%credibleintervalof0and1arerespectively1.24(0.29,2.19)and0.015(0.029,0.003)implyingthatexposureobservationsrecorded510yearspriortodiagnosisalsohasasignicanteffectonthecurrentdiseasestatus.Theposteriormeansand95%credibleintervalsoftheoddsratiosshowninTable 22 corroboratetheaboveconclusion. 52 PAGE 53 Posteriormeansand95%condenceintervalsofoddsratioforI=(10,5)forthelinearinuencemodel AgeatDiagnosis 50607080 Mean4.993.272.221.5695%C.I(1.96,10.41)(1.91,5.36)(1.67,2.98)(1.10,2.29) 23 ThePPLvaluesforthelinearmodelweresmallerthanthosecorrespondingtotheconstantinuencemodel.Thus,wecanconcludethatfortheprostatecancerdata,theclassoflinearinuencemodelstbetterthantheclassofconstantinuencemodels.Forbothsetups,themodelwith0knotshastheworstt(highestPPLcriterion)acrossalltrajectorylengths.Foragiventrajectory,themodelstendtoimprovewithanincreaseinthenumberofknotsuntilacertainnumberofknotsisreached.Furtherincreaseofknotstendtoworsenthet;thisagreeswiththendingsof Ruppert ( 2002 ).Theimportantpointtonotehereisthatthenumberofknotsandthelengthoftheexposuretrajectoryseemtointeractintheireffectonmodelt.Thebestttingconstantinuencemodelseemtobetheonewithexposuretrajectory(10,0)and3knots. Forthelinearinuencesetup,thePPLcriterionhasadecreasingtrendaslongerexposuretrajectoriesaretakenintoaccount.Thus,inclusionofpastexposuresresultinanimprovementofmodelt.Thismaybeindicativeofthefactthatpastexposureobservationscontainsignicantamountofinformationaboutthecurrentdiseasestatus.Inaddition,forthetrajectoryintervalI=(10,5),thePPLcriteriacorrespondingtothelinearandconstantinuencemodelsaremoderatelysmall.Thus,exposureobservationsrecorded510yearspriortodiagnosisalsoprovideamodestamountofinformationtowardpredictingthecurrentdiseasestatus,corroboratingtheconclusions 53 PAGE 54 Posteriorpredictivelosses(PPL)fortheconstantandlinearinuencemodelsforvaryingexposuretrajectoriesandknots KnotsModel(2,0)(5,0)(8,0)(10,0)(10,5) reachedearlier.Forthelinearsetup,themodelwithexposuretrajectoryI=(8,0)and4knotsperformthebest(hasthelowestPPLcriterionamongallthemodelsconsidered). Forthismodel,theposteriormeanofwasabout0.6with95%credibleinterval(0.535,0.680)whichindicatessubstantialagreementbeyondwhatisexpectedbychance.Wenextperformedcasedeletionanalysis.Wedeletedeachsubject(withalltheobservations)ratherthaneachobservationforasubject.Figure 22 (a)(c)showsthecasedeletedposteriormeansand95%credibleintervalsfor1,0and1.(In 54 PAGE 55 22 (d)showstheplotoftheposteriormeansofthedifferenceprobabilitiesandthecorrespondingcondenceintervals.(Inthisgure,thesolidlinerepresentszerodifference.Thesolidpointsrepresentsthedifferenceindiseaseprobabilitiesbasedonthefullandcasedeletedposteriors.Theverticallinesegmentsarethe95%posteriorintervalsofthedifferences).Surprisingly,theobservationforcasenumber108hasasignicantdeparturefromtherest.Onanalyzingthissubject,itwasfoundthatithadtheuniquecombinationofveryhighageandveryhighvaluesofPSA.Infactithadthehighestmeanageinthesample,thehighestageatdiagnosiswhilethethirdhighestmeanPtotalvalue.Thesecharacteristicsmayhavecontributedtotheexceptionallyhighdifferenceinthepredictedprobabilityofdisease. Wealsoperformedcasedeletionanalysisoftheinterceptparametersofthediseaseandtrajectorymodelsandthevariancecomponents.Noneofthesubjectswerefoundtobeinuentialontheposteriorestimatesoftheseparameters.Thus,basedontheabovetwomeasures,wemayconcludethatthesemiparametriclinearinuencemodelwithtrajectoryI=(8,0)and4knotsseemstottheobserveddatarelativelywell. 55 PAGE 56 Sensitivityof1,0,1anddiseaseprobabilityestimatestocasedeletions. 56 PAGE 57 Inthiswork,wehaveappliedsemiparametricregressiontechniquesinanalyzinglongitudinalcasecontrolstudies.Wehaveusedpenalizedregressionsplinesinmodelingtheexposuretrajectoriesforthecasesandthecontrols.Thusourframeworkcanbeusedevenwhenexposureobservationsarecollectedatdifferenttimepointsacrosssubjectsi.ewhenexposuresareunbalancedinnature.Theexposuretrajectoryisusedasthepredictorinaprospectivelogisticmodelforthebinarydiseaseoutcome.Wehavealsomodeledtheslopeparameterofthediseasemodelasapsplinetoaccountforanytimevaryinginuencepatternoftheexposuretrajectoryonthecurrentdiseasestatus.Indoingso,wehavesummarizedtheexposurehistoryforthecasesandcontrolsinaexiblewaywhichallowedustoconsiderdifferentiallengthsoftheexposuretrajectoryinanalyzingitseffectonthecurrentdiseasestatus.Inordertosimplifytheanalysis,weusedthelogitmixtureofnormalapproximation( AlbertandChib 1993 ).WeshowedthattheBayesianequivalenceresultsof SeamanandRichardson ( 2004 )essentiallyholdsforourframework,thusallowingustouseaprospectivelogisticmodelhavingfewernuisanceparametersalthoughthedatasetwascollectedretrospectively.AnalysishavebeencarriedoutinanhierarchicalBayesianframework.ParameterestimatesandassociatedcredibleintervalsareobtainedusingMCMCsamplers.Wehaveappliedourmethodologytoalongitudinalcasecontrol 57 PAGE 58 Weanalyzedourmodelusingdifferentiallengthsofexposuretrajectories.Indoingso,wehaveconcludedthatpastexposureobservationsdoprovidesignicantinformationtowardspredictingthecurrentdiseasestatusofasubject.Specically,wehaveshownthatacrossallageatdiagnosisgroups,theoddsofdiseasesteadilyincreaseaspastexposureobservationsaretakenintoaccountinadditiontotherecentones.Wealsoobservedthatforaxedtrajectorylength,theoddsofdiseasesteadilydecreaseastheageatdiagnosisincreasescorroboratingthemedicalfactthatyoungersubjectstendtohavemoreaggressiveformofprostatecancerandthusaremostlikelytobebenettedfromearlydetection.Weperformedmodelcomparisonusingposteriorpredictiveloss( GelfandandGhosh 1998 ).Thiscriterionindicatedthatmodelswithlongerexposuretrajectoriestendtoperformbetterthanthosewithshortertrajectories.Lastly,modelassessmentwasperformedontheoptimalmodelusingthekappastatisticandcasedeletiondiagnostics.Boththesetoolssuggestedthatourmodeltsrelativelywelltothedata. Someinterestingextensionscanbedonetooursetup.Forricherdatasets,itwillbeinterestingtomodelthesubjectspecicdeviationfunctionsaspsplines.Inaddition,wehaveonlyassumedconstantandlinearparameterizationsoftheinuencefunctionoftheprospectivediseasemodel.Foralargerdataset,apsplineformulationcanalsobeusedfortheinuencefunctionwhichmaybringoutanyunderlyingnonlinearpatternofinuenceoftheexposuretrajectoryonthecurrentdiseasestatus.Althoughwehaveusedabinarydiseaseoutcome,itwillbeinterestingtoextendourframeworktoaccommodatemulticategorydiseasestates.Ourmodelingframeworkcanalsobegeneralizedbyincorporatingalargerclassofnonparametricdistributionalstructures(likeDirichletprocessesorPolyatrees)forthesubjectspecicrandomeffects. 58 PAGE 59 59 PAGE 60 ThecurrentmethodologyoftheSAIPEprogramisbasedoncombiningstateandcountyestimatesofpovertyandincomeobtainedfromtheAmericanCommunitySurvey(ACS)withotherindicatorsofpovertyandincomeusingtheFayHerriotclassofmodels( FayandHerriot 1979 ).Theindicatorsaregenerallythemeanandmedianadjustedgrossincome(AGI)fromIRStaxreturns,SNAPbenetsdata(formerlyknownasFoodStampProgramdata),themostrecentdecennialcensus,intercensalpopulationestimates,SupplementalSecurityIncomeReceipiencyandothereconomicdataobtainedfromtheBureauofEconomicAnalysis(BEA).EstimatesfromACSarebeingusedsinceJanuary2005ontherecommendationoftheNationalAcademyofSciencesPanelonEstimatesofPovertyforSmallGeographicAreas(2000).Incomeandpovertyestimatesuntil2004werebasedondatafromtheAnnualSocialandEconomicSupplement(ASEC)oftheCurrentPopulationSurvey(CPS). Apartfromvariouspovertymeasures,theSAIPEprogramprovidesannualstateandcountylevelestimatesofmedianhouseholdincome.Atthispoint,directACSestimatesofmedianhouseholdincomeareonlyavailablefortheperiod20052008.Thus,forillustrationpurpose,wehaveconsidereddatafromASECfortheperiod19951999inordertoestimatethestatelevelmedianhouseholdincomefor1999.Thisisbecause,themostrecentcensusestimatescorrespondtotheyear1999andthesecensusvaluescanbeusedforcomparisonpurposes.TheSAIPEregressionmodelforestimatingthemedianhouseholdincomefor1999useascovariates,themedianadjustedgrossincome(AGI)derivedfromIRStaxreturnsandthemedianhouseholdincomeestimatefor1999obtainedfromthe2000Census.Theresponsevariableisthedirectestimateofmedianhouseholdincomefor1999obtainedfromthe 60 PAGE 61 Bell 1999 ).NoninformativepriordistributionsareplacedontheregressionparametercorrespondingtotheIRSmedianincomesinceitwasfoundtobestatisticallysignicanteveninthepresenceofcensusdata,bothinthe1989and1999models. Fay ( 1987 )inthisregard.EstimationwascarriedoutinanempiricalBayes(EB)frameworksuggestedby Fayetal. ( 1993 ).Later, Dattaetal. ( 1993 )extendedtheEBapproachof Fay ( 1987 )andalsoputforwardunivariateandmultivariatehierarchicalBayes(HB)models.TheestimatesfromtheirEBandHBproceduressignicantlyimprovedovertheCPSmedianincomeestimatesfor1979. Ghoshetal. ( 1996 )exploitedtherepetitivenatureofthestatespecicCPSmedianincomeestimatesandproposedaBayesiantimeseriesmodelingframeworktoestimatethestatewidemedianincomeoffourpersonfamiliesfor1989.Indoingso,theyusedatimespecicrandomcomponentandmodeleditasarandomwalk.TheyconcludedthatthebivariatetimeseriesmodelutilizingthemedianincomesoffourandvepersonfamiliesperformsthebestandproducesestimateswhicharemuchsuperiortoboththeCPSandCensusBureauestimates.Ingeneral,thetimeseriesmodelalwaysperformedbetterthanitsnontimeseriescounterpart. 61 PAGE 62 Opsomeretal. ( 2008 )inwhichtheycombinedsmallarearandomeffectswithasmooth,nonparametricallyspeciedtrendusingpenalizedsplines.Indoingso,theyexpressedthenonparametricsmallareaestimationproblemasamixedeffectsregressionmodelandanalyzeditusingrestrictedmaximumlikelihood.Theoreticalresultswerepresentedonthepredictionmeansquarederrorandlikelihoodratiotestsforrandomeffects.Inferencewasbasedonasimplenonparametricbootstrapapproach.Themethodologywasusedtoanalyzeanonlongitudinal,spatialdatasetconcerningtheestimationofmeanacidneutralizingcapacity(ANC)oflakesinthenortheasternstatesofU.S. Ghoshetal. ( 1996 ),wehaveviewedthestatespecicannualhouseholdmedianincomevaluesaslongitudinalprolesorincometrajectories.ThisgainedmoregroundbecauseweusedthestatewideCPSmedianhouseholdincomevaluesforonlyveyears(19951999)inourestimationprocedure.Figure 31 showssamplelongitudinalCPSmedianhouseholdincomeprolesforsixstatesspanning1995to2004whileFigure 32 showstheplotsoftheCPSmedianincomeagainsttheIRSmeanandmedianincomesforallthestatesfortheyears1995through1999.ItisapparentthatCPSmedianincomemayhaveanunderlyingnonlinearpatternwithrespecttoIRSmeanincome,speciallyforlargevaluesofthelatter.Theabovetwofeaturesmotivatedustouseasemiparametricregressionapproach.Indoingso,wehavemodeledtheincometrajectoryusingpenalizedspline(orPspline)( EilersandMarx 1996 )whichisacommonlyusedbutpowerfulfunctionestimationtoolinnonparametricinference.ThePsplineis 62 PAGE 63 LongitudinalCPSmedianincomeprolesfor6statesplottedagainstIRSmeanandmedianincomes.(1stcolumn:IRSMeanIncome;2ndcolumn:IRSMedianIncome). 63 PAGE 64 GelfandandGhosh 1998 )hasbeenusedtoobtaintheparameterestimates. Wehavecomparedthestatespecicestimatesofmedianhouseholdincomefor1999withthecorrespondingdecennialcensusvaluesinordertotestfortheiraccuracy.Indoingso,weobservedthatthesemiparametricmodelestimatesimproveuponboththeCPSandtheSAIPEestimates.Interestingly,thepositioningoftheknotshadsignicantinuenceontheresultsaswillbediscussedlateron.WewanttomentionherethattheSAIPEmodelhadaconsiderableadvantageoveroursinthattheyusedthecensusestimatesofthemedianincomefor1999asapredictor.Insmallareaestimationproblems,thecensusestimatesareregardedasthegoldstandardsincethesearethemostaccurateestimatesavailablewithvirtuallynegligiblestandarderrors.So,usingthoseasexplanatoryvariableswasanaddedadvantageoftheSAIPEstatelevelmodels.ThefactthatourestimatesstillimproveontheSAIPEmodelbasedestimatesisatestamenttotheexibilityandstrengthofthesemiparametricmethodologyspeciallywhenobservationsarecollectedovertime.ItalsoindicatesthatitmaybeworthwhiletotakeintoaccountthelongitudinalincomepatternsinestimatingthecurrentincomeconditionsofthedifferentstatesoftheU.S. Therestofthechapterisorganizedasfollows.InSection 3.2 weintroducethetwotypesofsemiparametricmodelswehaveused.Section 3.3 goesoverthehierarchicalBayesiananalysisweperformed.InSection 3.4 ,wedescribetheresultsofthedata 64 PAGE 65 BIRSmedianincomeplot PlotsofCPSmedianincomeagainstIRSmeanandmedianincomesforallthestatesoftheU.S.from1995to1999. analysiswithregardtothemedianhouseholdincomedataset.InSection 3.5 ,wediscusstheBayesianmodelassessmentprocedureweusedtotestthegoodnessoftofourmodels.WeendwithadiscussioninSection 3.6 .Theappendixcontainstheproofsoftheposteriorproprietyandtheexpressionsofthefullconditionaldistributions. 3.2.1GeneralNotation 65 PAGE 66 wheref(xij)isanunspeciedfunctionofxijreectingtheunknownresponsecovariaterelationship. Weapproximatef(xij)usingaPsplineandrewrite( 3 )as whereij=X0ij+Z0ij+bi+uijisourtargetofinference. HereXij=(1,xij,...,xpij)0,Zij=f(xij1)p+,...,(xijK)p+g0,=(0,...,p)0isthevectorofregressioncoefcientswhile=(1,...,K)0isthevectorofsplinecoefcients.Theabovesplinemodelwithdegreepcanadequatelyapproximateanyunspeciedsmoothfunction.Typically,linear(p=1)orquadratic(p=2)splinesservesmostpracticalpurposessincetheyensureadequatesmoothnessinthettedcurve.mandtrespectivelydenotethenumberofsmallareasandthenumberoftimepointsatwhichtheresponseandcovariatesaremeasured.Thus,inourcase,m=51,forallthe50statesoftheU.S.andtheDistrictofColumbiaandt=5fortheyears19951999.biisastatespecicrandomeffectwhileuijrepresentsaninteractioneffectbetweentheithstateandthejthyear.Weassumebii.i.dN(0,2b)andN(0,2IK).2controlstheamountofsmoothingoftheunderlyingincometrajectory.Moreover,itisassumed 66 PAGE 67 3.1.1 .InthedatasetsprovidedbytheCensusBureau,theseestimatesaregivenforallthestatesateachofthetimepoints.Theknots(1,...,K)areusuallyplacedonagridofequallyspacedsamplequantilesofxij's. From( 3 )and( 3 ),wehave 3 )andmodeleditasarandomwalkasfollows whereij=X0ij+Z0ij+bi+vj+uij Beforeproceedingtothenextsection,wemaynotethatunlikethemodelsof Ghoshetal. ( 1996 ),themodelsgivenin( 3 )and( 3 )incorporatestatespecicrandomeffects(bi).Thisrectiesalimitationoftheformeraspointedoutin Rao ( 2003 ). 67 PAGE 68 3.3.1LikelihoodFunction Here,L(Uja,b)denotesanormaldensitywithmeanaandvariancebwhileL(bij2b)andL(j2)denotesanormaldistributionwithmean0andvariances2band2respectively. Fortherandomwalkmodel,theparameterspacefortheithstatewouldbei=(i,,,bi,v,2,2b,2,2v)wherev=(v1,...,vt)isthevectoroftimespecicrandomeffects.Thus,thelikelihoodfunctionfortheithstatewillhaveanextracomponentcorrespondingtovasfollows whereL(vjjvj1,2v)denotesanormaldistributionwithmeanvj1andvariance2vwherev0=0. 68 PAGE 69 Thus,wehavethefollowingpriors:uniform(Rp+1),(2j)1G(cj,dj)(j=1,...,t),(2b)1G(c,d),(2)1G(c,d)and(2v)1G(cv,dv).HereXG(a,b)denotesagammadistributionwithshapeparameteraandrateparameterbhavingtheexpressionf(x)/xa1exp(bx),x0.Sincewehavechosenimproperpriorsfor,posteriorproprietyofthefullposteriorhavebeenshown.Wehavethefollowingtheorem Fortherandomwalkmodel,therewillbeanadditionalterm(2v).Bytheconditionalindependenceproperties,wecanfactorizethefullposterioras[,,,b,2b,2,f21,...,2tgjY,X,Z]/[Yj][j,,b,f21,...,2tg,X,Z][bj2b][j2][][2][2b]tYj=1[2j] 69 PAGE 70 GelmanandRubin ( 1992 )andrunn(2)parallelchains.Foreachchain,werun2diterationswithstartingpointsdrawnfromanoverdisperseddistribution.Todiminishtheeffectsofthestartingdistributions,therstditerationsofeachchainarediscardedandposteriorsummariesarecalculatedbasedontherestofthediterates.Thefullconditionalsforboththemodelsaregivenintheappendix. 3.2.2 toanalyzethemedianhouseholdincomedatasetreferredtoinSection 3.1.3 .TheresponsevariableYijandthecovariatesXijrespectivelydenotetheCPSmedianhouseholdincomeestimateandthecorrespondingIRSmean(ormedian)incomeestimatefortheithstateatthejthyear(i=1,...,51;j=1,...,5).ThestatespecicmeanormedianincomeguresareobtainedfromIRStaxreturndata.TheCensusBureaugetslesofindividualtaxreturndatafromtheIRSforuseinspecicallyapprovedprojectssuchasSAIPE.Foreachstate,theIRSmean(median)incomeisthemean(median)adjustedgrossincome(AGI)acrossallthetaxreturnsinthatstate.LikeotherSAIPEmodelcovariatesobtainedfromadministrativerecordsdata,thesevariablesdonotexactlymeasurethemedianincomeacrossallhouseholdsinthestate.OneofthereasonsforthisisthattheAGIwouldnotnecessarilybethesameastheexactincomegureandthetaxreturnuniversedoesnotcovertheentirepopulationi.esomehouseholdsdonotneedtoletaxreturns,andthosethatdonotarelikelytodifferinregardtoincomethanthosethatdo.However,theuseofthemeanormedianAGIasacovariateonlyrequiresittobecorrelatedwithmedianhouseholdincome,notnecessarilybethesamething.Specicallyforthisstudy,wehaveusedIRSmeanincomeasourcovariate.Thisisbecause,itseemstopossess 70 PAGE 71 32A ),andsoitismoresuitedtoasemiparametricanalysis. Inordertochecktheperformanceofourestimates,weplantousefourcomparisonmeasures.ThesewereoriginallyrecommendedbythepanelonsmallareaestimatesofpopulationandincomesetupbytheCommitteeonNationalStatisticsinJuly1978andareavailableintheirJuly1980report(p.75).Theseare ThebasicstructureofourmodelswouldremainthesameasinSection 3.2.2 .WehaveusedtruncatedpolynomialbasisforthePsplinecomponentinboththemodels.SinceFig2adoesnotindicateahighdegreeofnonlinearity,wehaverestricted 71 PAGE 72 Ruppert 2002 ).Generally,atmost35to40knotsarerecommendedforeffectivelyallsamplesizesandfornearlyallsmoothregressionfunctions.Followingthegeneralconvention,wehaveplacedtheknotsonagridofequallyspacedsamplequantilesoftheindependentvariable(IRSmeanincome). GelmanandRubin ( 1992 ).Weranthreeindependentchainseachwithasamplesizeof10,000andwithaburninsampleofanother5,000.Weinitiallysampledtheij'sfromtdistributionswith2dfhavingthesamelocationandscaleparametersasthecorrespondingnormalconditionalsgivenintheAppendix.ThisisbasedontheGelmanRubinideaofinitializingcertainsamplesofthechainfromoverdisperseddistributions.However,onceinitialized,thesuccessivesamplesofij'saregeneratedfromregularunivariatenormaldistributions.ConvergenceoftheGibbssamplerwasmonitoredbyvisuallycheckingthedynamictraceplots,acfplotsandbycomputingtheGelmanRubindiagnostic.Thecomparisonmeasuresdeviatedslightlyfordifferentinitialvalues.Wechosetheleastofthoseasthenalmeasurespresentedinthetablesthatfollows. 72 PAGE 73 WettedModelI(SPM)withallpossibleknotchoicesfrom0to40butthebestresultswereachievedwith5knots.Theestimates(with5knots)improvedsignicantlyovertheCPSestimatesbasedonallthefourcomparisonmeasures.Additionofmoreknotsseemedtodegradethetofthemodel.Thismayhappenaspointedoutin Ruppert ( 2002 ).Ontheotherhand,theSAIPEmodelbasedestimateswereslightlysuperiortotheSPMestimates. Next,wettedthesemiparametricrandomwalkmodel(SPRWM)toourdata.Overall,therandomwalkstructureleadtosomeimprovementintheperformanceoftheestimates.However,forthemodelwith5knots,theperformanceoftheestimatesremainednearlythesame.Thismaybebecause5knotsissufcienttocapturetheunderlyingpatternintheincometrajectoryandtherandomwalkcomponentdoesnotleadtoanyfurtherimprovement.Lastbutnottheleast,therandomwalkmodelestimates,althoughgenerallybetterthanthoseofthebasicsemiparametricmodel,stillcannotclaimtobesuperiortotheSAIPEestimatesforallthecomparisonmeasures.Table 31 reportstheposteriormean,medianand95%CIfortheparametersoftheSPRWMwith5knots. Itisofinterestthatthe95%CIfor1,4and5doesnotcontain0indicatingthesignicanceoftherst,fourthandfthknots.ThisisindicativeoftherelevanceofknotsinthepenalizedsplinetontheCPSmedianincomeobservations.ThesameistrueforthecoefcientsofSPM. 73 PAGE 74 ParameterestimatesofSPRWMwith5knots ParameterMeanMedian95%CI 3.1.1 ,theSAIPEstatemodelsusethecensusestimatesofmedianincome(for1999)asoneofthepredictorwhichessentiallygivesthemabigedgeoverus.Thismaybeoneofthereasonswhytheestimatesobtainedfromthesemiparametricmodelsareatmostcomparable,butnotsuperiortotheSAIPEestimates.Butthatdoesn'truleoutthefactthatthesemiparametricmodelshaveroomforimprovement.Inthissection,wewilllookforanypossibledecienciesintheourmodelsandwilltrytocomeupwithsomeimprovements,ifthereisany. AsmentionedinSection 3.4.1 ,selectionandproperpositioningofknotsplaysapivotalroleincapturingthetrueunderlyingpatterninasetofobservations.Poorlyplacedknotsdoeslittleinthisregardandcanevenleadtoanerroneousorbiasedestimateoftheunderlyingtrajectory.Ideally,asufcientnumberofknotsshouldbeselectedandplaceduniformlythroughouttherangeoftheindependentvariabletoaccuratelycapturetheunderlyingobservationalpattern. Figures 33A and 33B showstheexactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Inboththecases,theknotsareplacedonagridofequallyspacedsamplequantilesofIRSmeanincome.Inboththegures,theknotslieontheleftofIRSmean=50000,theregionwherethedensityofobservationsishigh.Theknotstendtolieinthisregionbecausetheyareselectedbasedonquantileswhichisadensitydependentmeasure.Thus,inboththegures,thecoverageareaofknots(i.ethepartoftheobservationalpatternwhichiscapturedbytheknots)isthe 74 PAGE 75 BPositioningof7Knots Exactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Theknotsaredepictedastheboldfacedtrianglesatthebottom. regiontotheleftofthedottedverticallines.Ontheotherhand,thenonlinearpatternistangibleonlyinthelowdensityareaoftheploti.etheregionlyingtotherightofIRSmean=50000.Evidently,noneoftheknotslieinthispartofthegraph.Thus,wecanpresumethatinboththecases(5and7knots),theunderlyingnonlinearobservationalpatternisnotbeingadequatelycaptured. Asanaturalsolutiontothisissue,wedecidedtoplacehalfoftheknotsinthelowdensityregionofthegraphwhiletheotherhalfinthehighdensityregion.Theexactboundarylinebetweenthehighdensityandlowdensityregionsishardtodetermine.WetesteddifferentalternativesandcameupwithIRSmean=47000asatentativeboundarybecauseitgavethebestresults.Inboththeregions,weplacedtheknotsatequallyspacedsamplequantilesoftheindependentvariable.Figure 34 showsthenewknotpositionsfor5knots. ItisclearfromFigure 34 thatthenewknotsaremoredispersedthroughouttherangeofIRSmeanthantheoldones.Theregionbetweentheboldanddashedverticallinesdenotestheadditionalcoveragethathasbeenachievedwiththeknot 75 PAGE 76 Positionsof5knotsafterrealignment.Theknotsaretheboldfacedtrianglesatthebottom.Theregionbetweenthedashedandboldlinesistheadditionalcoverageareagainedfromtherealignment. rearrangement.Basedonthenumberofdatapointsinsidethisregion,itisclearthatamuchlargerproportionofobservationshasbeencapturedwiththeknotrealignment.Noknotsareintheregionbeyondtheboldverticallines(i.ebeyondIRSmean56000)possiblyduetotheverylowdensityoftheobservationsinthatarea.Overall,itseemsthat,thenewknotscancapturesomeoftheunderlyingnonlinearpatterninthedatasetwhichtheoldknotsfailedtoachieve.Wealsoexperimentedbyplacingalltheknotsinthelowdensityregion(beyondIRSmean=47000)buttheresultswerenotsatisfactory.Thisindicatesthattheknotsshouldbeuniformlyplacedthroughouttherangeoftheindependentvariabletogetanoptimalt. Wehaveworkedwith5knotsbecauseitperformedconsistentlywellforboththeSPMandSPRWmodels.Onttingthesemiparametricmodelswiththenewknotalignment,wedidachievesomeimprovementintheresults.Table 32 reports 76 PAGE 77 33 depictsthepercentageimprovementofthesemiparametricestimatesovertheCPSandSAIPEestimates.Here,SPM(5)andSPRWM(5)respectivelydenotethesemiparametricmodelswiththerealigned5knots. Table32. ComparisonmeasuresforSPM(5)andSPRWM(5)estimateswithknotrealignment EstimateARBASRBAABASD CPS0.04150.00271,753.335,300,023SAIPE0.03260.00151,423.753,134,906SPM(5)0.0280.00121173.712,334,379SPRWM(5)0.02950.00131256.082,747,010 Table33. PercentageimprovementsofSPM(5)andSPRWM(5)estimatesoverSAIPEandCPSestimates EstimateModelARBASRBAABASD SPM(5)14.11%20.00%17.56%25.54%SAIPESPRWM(5)9.51%13.33%11.78%12.37%SPM(5)32.53%55.55%33.06%55.96%CPSSPRWM(5)28.92%51.85%28.36%48.17% Itisclearthat,withtheknotrealignment,thecomparisonmeasurescorrespondingtothesemiparametricestimateshavedecreasedsubstantially,speciallysofortheSPM.ThenewcomparisonmeasuresforthesemiparametricmodelsarequitelowerthanthosecorrespondingtotheSAIPEestimates.Thus,wemaysaythatthesemiparametricmodelestimatesperformsbetterthantheSAIPEestimateswiththerealignedknots.Thisimprovementisapparentlyduetotheadditionalcoverageoftheobservationalpatternthatisbeingachievedwiththerelocationoftheknots.Asaresultofthisincreasedcoverage,alargerproportionoftheunderlyingnonlinearpatternintheobservationsinbeingcapturedbythenewknots.Althoughwehavedonethisexercisewithonly5knots,itwouldbeinterestingtoexperimentwithothertypesofknotalignment 77 PAGE 78 34 andTable 35 reporttheposteriormean,medianand95%CIfortheparametersinSPM(5)andSPRWM(5)respectively. Table34. ParameterestimatesofSPM(5) Table35. ParameterestimatesofSPRWM(5) Itisofinteresttonotethat,withtheknotrealignment,alltheknotcoefcients(i.ethe's)aresignicantforbothSPMandSPRWM.Fortheoldconguration,someoftheknotcoefcientswerenotsignicantforthemodels.Thiscorroboratesthefactthat,withtheknotrealignment,alltheveknotsaresignicantlycontributingtothecurvettingprocessintermsofcapturingthetrueunderlyingnonlinearpatternintheobservations. Ghoshetal. ( 1996 ),henceforthreferredtoastheGNKmodel.Theirunivariatemodelisasfollows where(bjjbj1)N(0,2b),uijN(0,2j)andeijN(0,2ij). 78 PAGE 79 wherebii.i.dN(0,2b)whileuijandeijhavethesamedistributionasabove.Clearly,theonlydifferencebetween( 3 )and( 3 )isthattheformercontainsatimespecicrandomcomponentwhilethelattercontainsaareaspecicrandomcomponent. Ghoshetal. ( 1996 )showedthattheestimatesfromthebivariateversionoftheGNKmodel( 3 )performsmuchbetterthanthecensusbureauestimatesinestimatingthemedianhouseholdincomeof4personfamiliesintheUnitedStates.Table 36 depictsthecomparisonmeasurescorrespondingtotheabovemodels. Table36. Comparisonmeasuresfortimeseriesandothermodelestimates EstimateARBASRBAABASD CPS0.04150.00271,753.335,300,023SAIPE0.03260.00151,423.753,134,906GNK0.03970.00251709.585,229,869SPM(0)0.03370.00171408.73,137,978SPM(5)0.0280.00121173.712,334,379SPRWM(5)0.02950.00131256.082,747,010 Itisclearthat,althoughtheestimatesfromtheGNKmodelperformslightlybetterthantheCPS,thosearequiteinferiortothesemiparametricandSAIPEestimates.Thismaybebecausethestatespecicrandomeffectsinthesemiparametricmodelscanaccountforthewithinstatecorrelationsintheincomevalues,somethingwhichtheGNKmodelfailstodo.SincethecomparisonmeasuresforSPM(0)aremuchlowerthanthosefortheGNKmodel,wecanalsoconcludethattheareaspecicrandomeffectismuchmorecriticalthanatimespecicrandomcomponentinthissituation. 79 PAGE 80 Johnson ( 2004 ).ThisisessentiallyanextensionoftheclassicalChisquaregoodnessofttestwherethestatisticiscalculatedateveryiterationoftheGibbssamplerasafunctionoftheparametervaluesdrawnfromtherespectiveposteriordistribution.Thus,aposteriordistributionofthestatisticisobtainedwhichcanbeusedforconstructingglobalgoodnessoftdiagnostics. Toconstructthisstatistic,weform10equallyspacedbins((k1)=10,k=10),k=1,...,10,withxedbinprobabilities,pk=1=10.Themainideaistoconsiderthebincountsmk(~)toberandomwhere~denotesaposteriorsampleoftheparameters.AteachiterationoftheGibbssampler,binallocationismadebasedontheconditionaldistributionofeachobservationgiventhegeneratedparametervaluesi.eYijwouldbeallocatedtothekthbinifF(Yijj~)2((k1)=10,k=10),k=1,...,10.TheBayesianchisquarestatisticisthencalculatedasRB(~)=10Xk=1"mk(~)npk Theonlyassumptionsforthisstatistictoworkarethattheobservationsshouldbeconditionallyindependentandtheparametervectorshouldbenitedimensional.The 80 PAGE 81 BSemiparametricRWModel QuantilequantileplotofRBvaluesfor10000drawsfromtheposteriordistributionofthebasicsemiparametricandsemiparametricrandomwalkmodels.TheXaxisdepictstheexpectedorderstatisticsfroma2distributionwith9degreesoffreedom. secondassumptionnaturallyholdsinourcase.Regardingtherstone,sincewehavemultipleobservationsovertimeforeverystate,theremaybewithinstatedependencebetweenthose.Thus,insteadoftakingalltheobservations(i.etheCPSmedianincomevalues),wedecidedtousethelastobservationforeachstate.Forthebasicsemiparametricmodel(SPM),theabovesummarymeasureswererespectively0.049and0.5whilefortherandomwalkmodel(SPRWM),thesewere0.047and0.51.ThesemeasuressuggestthatbothSPMandSPRWMtsthedataquitewell.Figure 35A and 35B showsthequantilequantileplotsofRBvaluesobtainedfrom10000samplesofSPMandSPRWMwith5knots.BoththeplotsdemonstrateexcellentagreementbetweenthedistributionofRBandthatofa2(9)randomvariable. JohnsonpointsoutthattheBayesianchisquareteststatisticisalsoanusefultoolforcodeverication.IftheposteriordistributionofRBdeviatessignicantlyfromitsnulldistribution,itmayimplythatthemodelisincorrectlyspeciedortherearecodingerrors.Sincethesummarymeasuresarequiteclosetothecorrespondingnullvalues, 81 PAGE 82 FayandHerriot 1979 ).Inthisstudy,wehaveproposedasemiparametricclassofmodelswhichexploitthelongitudinaltrendinthestatespecicincomeobservations.Indoingso,wehavemodeledtheCPSmedianincomeobservationsasanincometrajectoryusingpenalizedsplines( EilersandMarx 1996 ).Wehavealsoextendedthebasicsemiparametricmodelbyaddingatimeseriesrandomwalkcomponentwhichcanexplainanyspecictrendintheincomelevelsovertime.Wehaveusedasourcovariate,themeanadjustedgrossincome(AGI)obtainedfromIRStaxreturnsforallthestates.AnalysishasbeencarriedoutinahierarchicalBayesianframework.OurtargetofinferencehasbeenthemedianhouseholdincomesforallthestatesoftheU.S.andtheDistrictofColumbiafortheyear1999.Wehaveevaluatedourestimatesbycomparingthosewiththecorrespondingcensusestimatesof1999usingsomecommonlyusedcomparisonmeasures. Ouranalysishasshownthatinformationofpastmedianincomelevelsofdifferentstatesdoprovidestrengthtowardstheestimationofstatespecicmedianincomesforthecurrentperiod.Infact,ifthereisanunderlyingnonlinearpatterninthemedianincomelevels,itmaybeworthwhiletocapturethatpatternasaccuratelyaspossibleandusethatintheinferentialprocedure.Intermsofmodelingtheunderlyingobservationalpattern,thepositioningofknotsprovedtobebothimportantandinteresting.The 82 PAGE 83 Theabovemodelscanbeextendedinvariouswaysbasedonthenatureoftheobservationalpatternandthequality(orrichness)ofthedataset.Someobviousextensionsaregivenasfollows:(1)Inthemodelsconsideredabove,thesplinestructuref(xij)representsthepopulationmeanincometrajectoryforallthestatescombined.Thedeviationoftheithstatefromthemeanismodeledthroughtherandominterceptbi.Thisimpliesthatthestatespecictrajectoriesareparallel.Amoreexible 83 PAGE 84 Heregi(x)isanunspeciednonparametricfunctionrepresentingthedeviationoftheithstatespecictrajectoryfromthepopulationmeantrajectoryf(x).gi(x)isalsomodeledusingPsplinewithalinearpart,bi1+bi2xandanonlinearone,PKk=1wik(xk)+thusallowingformoreexibility.Boththesecomponentsarerandomwith(bi1,bi2)0N(0,)(beingunstructuredordiagonal)andwikN(0,2w).Thisextensionisparticularlyrelevantinsituationswherethestatespecicincometrajectoriesarequitedistinctfromthepopulationmeancurveandthusneedtobemodeledexplicitly.Weplantopursuethisextensionifwecanprocurearicherdatasetwithlongerstatespecicincometrajectories.(2)Sometimesthefunctiontobeestimated(herethemedianincomepattern)mayhavevaryingdegreesofsmoothnessindifferentregions.Inthatcase,asinglesmoothingparametermaynotbeproperandaspatiallyadaptivesmoothingprocedurecanbeused( RuppertandCarroll 2000 ).(3)WeusedthetruncatedpolynomialbasisfunctiontomodeltheincometrajectorybutothertypesofbaseslikeBsplines,radialbasisfunctionsetccanalsobeused.(4)Althoughweusedaparametricnormaldistributionalassumptionfortherandomstateandtimespeciceffects,abroaderclassofdistributionslikethemixturesofDirichletprocesses( MacEachernandMuller 1998 )orPolyatrees( HansonandJohnson 2000 )maybetested. Lastbutnottheleast,wethinkthatsemiparametricmodelingapproachholdsalotofpromiseforsmalldomainproblemsspeciallywhenobservationsforeachdomainarecollectedovertime.TheassociatedclassofsemiparametricmodelscanwellbeanattractivealternativetothemodelsgenerallyemployedbytheU.S.CensusBureau. 84 PAGE 85 TheU.S..CensusBureauhasalwaysbeenconcernedwiththeestimationofincomeandpovertycharacteristicsofsmallareasacrosstheUnitedStates.Theseestimatesplayavitalroletowardstheadministrationoffederalprogramsandtheallocationoffederalfundstolocaljurisdictions.Forexample,statelevelestimatesofmedianincomeforfourpersonfamiliesareneededbytheU.S.DepartmentofHealthandHumanServices(HHS)inordertoformulateitsenergyassistanceprogramtolowincomefamilies.Sinceincomecharacteristicsforsmallareasaregenerallycollectedovertime,theremaywellbeatimevaryingpatterninthoseobservations.Neglectingthosepatternsmayleadtobiasedestimateswhichdoesnotreectthetruepicture.Inthisstudy,weputforwardamultivariateBayesiansemiparametricprocedurefortheestimationofmedianincomeoffourpersonfamiliesforthedifferentstatesoftheU.S.whileexplicitlyaccommodatingforthetimevaryingpatternintheobservations. 85 PAGE 86 Inestimatingthemedianincomeoffourpersonfamilies,theU.S.CensusBureaureliedondatafromthreesources.ThebasicsourcewastheannualdemographicsupplementtotheMarchsampleoftheCurrentPopulationSurvey(CPS)whichusedtoprovidethestatespecicmedianincomeestimatesfordifferentfamilysizes.Thesecondsourcewasthedecennialcensusestimatesfortheyearpreceedingthecensusyeari.e1969,1979,1989andsoon.Lastly,theCensusBureaualsousedtheannualestimatesofpercapitaincome(PCI)providedbytheBureauofEconomicAnalysis(BEA)oftheU.S.DepartmentofCommerce.Eachoftheabovedatasources(andtheresultingestimates)havesomedisadvantageswhichneccesiatedanestimationprocedurethatusedacombinationofallthreetoproducethenalmedianincomeestimates.TheCPSestimateswerebasedonsmallsampleswhichresultedinsubstantialvariability.Ontheotherhand,decennialcensusestimates,althoughhavingnegligiblestandarderrors,wereonlyavailableevery10years.Duetothislaginthereleaseofsuccessivecensusestimates,therewasasignicantlossofinformationconcerninguctuationsintheeconomicsituationofthecountryingeneralandsmallareasinparticular.Lastly,thepercapitaincomeestimatesdidnothaveassociatedsamplingerrorssincetheywerenotobtainedusingtheusualsamplingtechniques.Thedetailsoftheestimationprocedureappearsin Fayetal. ( 1993 ). TheCensusBureaubasedtheirestimationprocedureonabivariateregressionmodelsuggestedby Fay ( 1987 ).Indoingso,theyusedmedianincomeobservationsforthreeandvepersonfamiliesinadditiontothoseoffourpersonfamilies.ThebasicdatasetforeachstatewasabivariaterandomvectorwithonecomponenttheCPSmedianincomeestimatesoffourpersonfamiliesandtheothercomponentbeingtheweightedaverageofCPSmedianincomesofthreeandvepersonfamilies,withweights0.75and0.25respectively.Boththeregressionequationsusedthebaseyear 86 PAGE 87 Adjustedcensusmedian(c)=PCI(c) PCI(b)censusmedian(b) HerePCI(c)andPCI(b)denotesthepercapitaincomeestimatesproducedbytheBEAforthecurrentandbaseyearsrespectively.Thus,intheaboveexpression,thecurrentyearadjustedcensusmedianestimateisobtainedbyadjustingthebaseyearcensusmedianbytheproportionalgrowthinthePCIbetweenthebaseyearandthecurrentyear.Intheregressionequation,thebaseyearcensusmedianadjustsforanypossibleoverstatementoftheeffectofchangeinthePCIinestimatingthecurrentmedianincomes.Finally,theCensusBureauusedanempiricalBayesian(EB)technique( Fay ( 1987 ); Fayetal. ( 1993 ))tocalculatetheweightedaverageofthecurrentCPSmedianincomeestimateandtheestimatesobtainedfromtheregressionequation. Dattaetal. ( 1993 )extendedandrenedtheideasof Fay ( 1987 )andproposedamoreappealingempiricalBayesianprocedure.TheyalsoperformedanunivariateandmultivariatehierarchicalBayesiananalysisofthesameproblemandshowedthatboththeEBandHBproceduresresultedinsignicantimprovementovertheCPSmedianincomeestimatesfortheunivariateandmultivariatemodels.However,themultivariatemodelresultedinconsiderablylowerstandarderrorandcoefcientofvariationthantheunivariatemodelalthoughthepointestimatesweresimilar.Later, Ghoshetal. ( 1996 )(henceforthreferredtoasGNK)presentedaBayesiantimeseriesanalysisofthesameproblembyexploitingtheinherentrepetitivenatureoftheCPSmedianincomeestimates.Indoingso,theyestimatedthestatewidemedianincome 87 PAGE 88 Semiparametricregressionmethodshavenotbeenusedinsmallareaestimationcontextsuntilrecently.Thiswasmainlyduetomethodologicaldifcultiesincombiningthedifferentsmoothingtechniqueswiththeestimationtoolsgenerallyusedinsmallareaestimation.Thepioneeringcontributioninthisregardistheworkby Opsomeretal. ( 2008 )inwhichtheycombinedsmallarearandomeffectswithasmooth,nonparametricallyspeciedtrendusingpenalizedsplines( EilersandMarx 1996 ).Indoingso,theyexpressedthenonparametricsmallareaestimationproblemasamixedeffectsregressionmodelandanalyzeditusingrestrictedmaximumlikelihood.Theyalsopresentedtheoreticalresultsonthepredictionmeansquarederrorandlikelihoodratiotestsforrandomeffects.Inferencewasbasedonasimplenonparametricbootstrapapproach.Theyappliedtheirmodeltoanonlongitudinal,spatialdatasetconcerningtheestimationofmeanacidneutralizingcapacity(ANC)oflakesinthenortheasternstatesofU.S. 88 PAGE 89 Ghoshetal. ( 1996 ),wehavetreatedthestatespecicmedianincomeobservationsaslongitudinalprolesorincometrajectories.Aswithanylongitudinallyvaryingobservations,theincomeproles(bothstatespecicandoverall)mayhaveanonlinearpatternovertime.Moreover,thesuccessiveincomeobservationsmaybeunbalancedinnature.Thesefeaturesmotivatedustouseasemiparametricregressionapproachinourmodelingframework.Indoingso,wehavemodeledtheincometrajectoryusingpenalizedspline(orPspline)whichisacommonlyusedbutpowerfulfunctionestimationtoolinnonparametricinference.ThePsplineisexpressedusingtruncatedpolynomialbasisfunctionswithvaryingdegreesandnumberofknotsalthoughothertypesofbasisfunctionslikeBsplinesorthinplatesplinescanalsobeused.Ascovariates,wehaveusedtheadjustedcensusmedianincomessinceitwasfoundtobethemosteffectivecovariateby Ghoshetal. ( 1996 ).Wetestedfourdifferentregressionmodelsviz(1)AunivariatemodelwithonlytheCPSmedianincomeoffourpersonfamilyastheresponsevariable;(2)AbivariatemodelwiththeCPSmedianincomesofthreeandfourpersonfamiliesastheresponsevariables;(3)AbivariatemodelwiththeCPSmedianincomesoffourandvepersonfamiliesastheresponsevariables;andlastly(4)AbivariatemodelwiththeCPSmedianincomesoffourpersonfamilyandweightedaverageoftheCPSmedianincomesofthreeandvepersonfamilies(withweights0.75and0.25)astheresponsevariables.Inallthecases,ourprimaryobjectivehasbeentheestimationofmedianincomesoffourpersonfamiliesofallthe50U.S.statesandtheDistrictofColumbiafor1989.Foreachofthesemodels,analysishasbeencarriedoutusingahierarchicalBayesianapproach.Sincewechosenoninformativeimproperpriorsfortheregressionparameters,proprietyoftheposteriorhasbeenrigorouslyprovedbeforeproceedingwiththecomputations(seeTheorem3in 89 PAGE 90 GelfandandSmith 1990 )hasbeenusedtoobtaintheparameterestimates. Wehavecomparedthestatespecicestimatesofmedianhouseholdincomefor1989withthecorrespondingdecennialcensusvaluesinordertotestfortheiraccuracy.Indoingso,weobservedthatthesemiparametricmodelestimatesimproveuponboththeCPSandtheCensusBureauestimates.Interestingly,foralltheabovemodels,thesemiparametricestimatesaregenerallysuperiororatleastcomparabletothecorrespondingestimatesfromthetimeseriesmodelsof Ghoshetal. ( 1996 ).Thisisatestamenttotheexibilityandstrengthofthesemiparametricmethodologyspeciallywhenobservationsarecollectedovertime.ItalsoindicatesthatitmaybeworthwhiletotakeintoaccountthelongitudinalincomepatternsinestimatingthecurrentincomeconditionsoftheU.S.states.Lastly,thesemiparametricmodelingframeworkisverygeneralandcanbeappliedtoanysituationwherevariouscharacteristicsofsmallareasarecollectedovertime. Therestofthechapterisorganizedasfollows.InSection 4.2 weintroducethebivariatesemiparametricmodelingframework.Section 4.3 goesoverthehierarchicalBayesiananalysisweperformed.InSection 4.4 ,wedescribetheresultsofthedataanalysiswithregardtothemedianhouseholdincomedataset.Finally,weendwithadiscussionandsomereferencestowardsfutureworkinSection 4.5 .Theappendixcontainstheproofsoftheposteriorproprietyandtheexpressionsofthefullconditionaldistributionsforourmodels. 4.2.1Notation 90 PAGE 91 3 .Here,wewillexplainthebivariateframeworkwhichisoftwotypesvizasimplebivariatemodelandabivariaterandomwalkmodel.ThesecanalsobeseenasextensionsoftheunivariatemodelsexplainedinSection 3.2.2 Thisisthemostgeneralstructuresincethedegreesofthesplineaswellasthenumberandpositionoftheknotsaredifferentforthetwomodels.Iffori=1,2,...,m;j=1,2,...,t,fYij1,Xij1gandfYij2,Xij2ghavesimilarrelationship,wecanassumep=qandk1=k2,k=1,2,...,K1(=K2). Equation( 4 )canberewrittenas 91 PAGE 92 4 )asfollows whereij=U0ij+Z0ij+bi+vj+uij. AsinSection 3.2.2.2 ,weassumethat(vjjvj1,v)N(vj1,v)withv0=0.Alternatively,wemaywritevj=vj1+wjwherewji.i.dN(0,v). 92 PAGE 93 3 Here,L(Xj,)denotesamultivariatenormaldensitywithmeanvectorandvariancecovariancematrix. Forthebivariaterandomwalkmodel,theparameterspacefortheithstatewouldbei=(i,,,bi,v,f1,...,tg,0,,v)wherev=(v01,...,v0t)0isthevectoroftimespecicrandomeffects.ThehierarchicalBayesianframeworkisgivenby 1. PAGE 94 4 )willhaveanextracomponentcorrespondingtovgivenbyL(vjjvj1,v)whichhasanormaldistributionwithmeanvj1andcovariancematrixv. Thus,wehavethefollowingpriors:uniform(Rp+q+2),jIW(Sj,dj)(j=1,...,t),IW(S,d),0IW(S0,d0)andvIW(Sv,dv)HereXIW(A,b)denotesainverseWishartdistributionwithscalematrixAanddegreesoffreedombhavingtheexpressionf(X)/jXj(b+p+1)=2exp(tr(AX1)=2),pbeingtheorderofA. Fortherandomwalkmodeltherewillbeanadditionalterm(v).Byconditionalindependenceproperties,wecanfactorizethefullposterioras[,,,b,0,,f1,...,tgjY,U,Z]/[Yj][j,,b,f1,...,tg,X,Z][bj0][j][][][0]tYj=1[j] 94 PAGE 95 GelmanandRubin ( 1992 )andrunn(2)parallelchains.Foreachchain,werun2diterationswithstartingpointsdrawnfromanoverdisperseddistribution.Todiminishtheeffectsofthestartingdistributions,therstditerationsofeachchainarediscardedandposteriorsummariesarecalculatedbasedontherestofthediterates.Thefullconditionalsforboththemodelsaregivenintheappendix. Onceposteriorsamplesaregeneratedfromthefullconditionalsoftheparameters,RaoBlackwellizationyieldsthefollowingposteriormeansandvariancesofij and 4.2.2 .toanalyzethemedianincomedatasetreferredtoinSection 4.1.3 .Thebasicdatasetforourproblemisthetriplet(Yij1,Yij2,Yij3)andtheassociatedvariancecovariancematrixij(i=1,...,51;j=1,...,11).HereYij1,Yij2andYij3respectivelydenotetheCPSmedianincomesof 95 PAGE 96 Fortheunivariatesetup,theresponseandcovariatesarerespectivelyYij1andXij1.Forthebivariatesetup,thebasicdatavectorisadupletwithrstcomponentYij1andsecondcomponentiseitherYij2,Yij3or0.75Yij2+0.25Yij3.Theadjustedcensusmediansarechosenanalogously.Asmentionedbefore,ourtargetofinferencearethestatespecicmedianincomesoffourpersonfamiliesfor1989. Inordertochecktheperformanceofourestimates,weplantousefourcomparisonmeasures.ThesewereoriginallyrecommendedbythepanelonsmallareaestimatesofpopulationandincomesetupbytheCommitteeonNationalStatisticsinJuly1978andisavailableintheirJuly1980report(p.75).Theseare PAGE 97 ThebasicstructureofourmodelswouldremainthesameasinSection 4.2.2 .WehaveusedlineartruncatedpolynomialbasisfunctionsforthePsplinecomponentinourmodelssincethemedianincomeprolesdidnotexhibitahighdegreeofnonlinearity.Forhighlynonlinearprolesaquadraticorcubicpolynomialbasisfunctionrepresentationcanbeused.Innonparametricregressionproblems,theproperselectionofknotsplaysacriticalrole.Ideally,asufcientnumberofknotsshouldbeselectedandplaceduniformlythroughouttherangeoftheindependentvariablesothattheunderlyingobservationalpatternisproperlycaptured.Toofewortoomanyknotsgenerallydegradesthequalityofthet.Thisisbecause,iftoofewknotsareused,thecompleteunderlyingpatternmaynotbecapturedproperly,thusresultinginabiasedt.Ontheotherhand,oncethereareenoughknotstotimportantfeaturesofthedata,furtherincreaseintheknotshavelittleeffectonthetandmayleadtooverparametrization( Ruppert 2002 ).Generally,atmost35to40knotsarerecommendedforeffectivelyallsamplesizesandfornearlyallsmoothregressionfunctions.Followingthegeneralconvention,wehaveplacedtheknotsonagridofequallyspacedsamplequantilesoftheindependentvariable(adjustedcensusmedianincome). GelmanandRubin ( 1992 ).Weranthreeparallelchains,withvaryinglengthsandburnins.Weinitiallysampledtheij'sfrommultivariatetdistributionswith2dfhavingthesamelocationandscalematricesasthecorrespondingmultivariatenormalconditionalsgivenintheAppendix.ThisisbasedontheGelmanRubinideaofinitializingthechainatoverdisperseddistributions.However,onceinitialized,the 97 PAGE 98 Wettedboththeunivariateandbivariatemodelstothemedianincomedataset.Indoingso,weworkedwithallpossibleknotchoicesfrom0to40.Here,wewouldonlyshowtheresultscorrespondingtothebestperformingmodeli.ethemodelwiththelowestvaluesofthecomparisonmeasures. Intheunivariateframework,themodelwith3knotsintheincometrajectoryperformedthebest.Table 41 reportsthecomparisonmeasuresforthismodel(denotedasUSPM(3))alongwiththoseoftheCPSestimates(CPS),CensusBureauestimates(Bureau),andtheunivariateGNKtimeseries(GNK.TS)andnontimeseries(GNK.NTS)estimates.Table 42 reportsthepercentageimprovementofthetimeseries,nontimeseriesandthesemiparametricestimatesoverthecensusbureauestimates. FromTable 41 ,itisclearthatthesemiparametricestimatessignicantlyimproveupontheCPS,timeseriesandnontimeseriesestimateswithrespecttoallthecomparisonmeasures.Infact,thesemiparametricestimatesperformslightlybetterthanthebivariateCensusBureauestimatestoowithrespecttoARBandAAB.This 98 PAGE 99 Comparisonmeasuresforunivariateestimates EstimateARBASRBAABASD CPS0.07350.00842,928.8213,811,122.39Bureau0.02960.00131,183.902,151,350.18GNK.TS0.03380.00181,351.673,095,736.14GNK.NTS0.03630.00211,457.473,468,496.61USPM(3)0.02890.00141169.742,549,698.26 Table42. PercentageimprovementsofunivariateestimatesoverCensusBureauestimates EstimateARBASRBAABASD GNK.TS14.19%38.46%14.17%43.90%GNK.NTS22.64%61.54%23.11%61.22%USPM(3)2.37%7.69%1.2%18.52% isalsoreectedinTable 42 wherethesemiparametricestimatesmarginallyimproveupontheBureauestimatesfortheabovetwocomparisonmeasures.Overall,thedegreeofdominanceoftheBureauestimatesonthetimeseriesandnontimeseriesestimatesismuchlargercomparedtothatonthesemiparametricestimates.Theseresultsindicatethat,intheunivariateframework,thesemiparametricmodelwith3knotsperformsignicantlybetterthanthetimeseriesandnontimeseriesmodelsof Ghoshetal. ( 1996 ). Now,wemoveontothebivariatenonrandomwalksetup.First,weconsiderthemodelwithresponsevectortheCPSmedianincomeof4and3personfamiliesi.e(Yij1andYij2).Thecovariatesarethecorrespondingadjustedcensusmedians.SinceweassumedinverseWishartpriorsforthevariancecovariancematrices,thevaluesofthecomparisonmeasuresweredependentonthedegreesoffreedomoftheWishartdistributionandthenumberofknotsintheincometrajectory.Weworkedwithdifferentcombinationsofthetwointtingthesemodels.Thebestresults(lowestcomparisonmeasures)wereobtainedfortwomodels,bothwith6knotsbutwithdegreesoffreedoms7and9respectively.ThesemodelsaredenotedbyBSPM(1)(4,3)andBSPM(2)(4,3)respectively.Whenweconsiderthemedianincomesof4and5person 99 PAGE 100 Comparisonmeasuresforbivariatenonrandomwalkestimates EstimateARBASRBAABASD CPS0.07350.00842,928.8213,811,122.39Bureau0.02960.00131,183.902,151,350.18GNK.TS(4,3)0.02950.00131,171.712,194,553.67GNK.NTS(4,3)0.03230.00161,287.782,610,249.94BSPM(1)(4,3)0.02740.00131079.632,182,669.56BSPM(2)(4,3)0.02860.00111131.611,880,089.29GNK.TS(4,5)0.02300.0009932.511,618,025.33GNK.NTS(4,5)0.02950.00131,179.942,216,738.06BSPM(4,5)0.02550.00101033.121,859,373.98GNK.TS(4,3+5)0.02870.00131,150.242,116,692.71GNK.NTS(4,3+5)0.03240.00151,297.122,530,938.06BSPM(1)(4,3+5)0.02710.00121078.52,128,679.65BSPM(2)(4,3+5)0.02890.00121132.101,838,598.30 families,thelowestcomparisonmeasureswereobtainedforthemodelwith4knotsintheincometrajectoryand7degreesoffreedom.WedenotethismodelbyBSPM(4,5). Lastly,forthemodelwithmedianincomesof4personfamiliesandtheweightedaverageincomesof3and5personfamilies(withweights0.75and0.25)asresponsevectors,thebestresultswereobtainedfortwomodels,bothwith6knotsandwithdegreesoffreedoms7and9respectively.WedenotethesemodelsasBSPM(1)(4,3+5)andBSPM(2)(4,3+5)respectively.Table 43 reportsthecomparisonmeasuresforthesemodelsalongwiththoseofCPS,Bureau,andthecorrespondingbivariateGNKtimeseriesandnontimeseriesestimates.Table 44 reportsthepercentageimprovementoftheaboveestimatesoverthecensusbureauestimates. FromTable 43 andTable 44 ,itisclearthatbothBSPM(4,3)andBSPM(4,3+5)estimatesimproveuponthebivariatetimeseriesandnontimeseriesestimateswithrespecttonearlyallthefourcomparisonmeasures.ThesemiparametricestimatesalsoimprovesupontheCensusBureauestimatesandtherawCPSestimates.Forthemodelwithmedianincomeoffourandvepersonfamiliesasresponse,thesemiparametricestimatesfallswellbehindthebivariatetimeseriesestimatesof Ghoshetal. ( 1996 )butsignicantlyimprovesupontheCPSandCensusBureauestimates. 100 PAGE 101 PercentageimprovementsofbivariatenonrandomwalkestimatesoverCensusBureauestimates EstimateARBASRBAABASD GNK.TS(4,3)0.48%2.52%1.03%2.01%GNK.NTS(4,3)8.99%22.45%8.77%21.33%BSPM(1)(4,3)7.43%0.00%8.81%1.46%BSPM(2)(4,3)3.38%15.38%4.42%12.61%GNK.TS(4,5)22.19%30.52%21.23%24.79%GNK.NTS(4,5)0.31%0.18%0.33%3.04%BSPM(4,5)13.85%23.08%12.74%13.57%GNK.TS(4,3+5)2.94%3.56%2.84%1.61%GNK.NTS(4,3+5)9.36%17.18%9.56%17.64%BSPM(1)(4,3+5)8.45%7.69%8.90%1.05%BSPM(2)(4,3+5)2.37%7.69%4.37%14.54% Nowletusconsiderthebivariaterandomwalkmodel.Forthecasewith4and3personfamilies,thelowestcomparisonmeasureswereobtainedforthreemodelswithdegreesoffreedomsandnumberofknots(3,6),(5,6)and(9,1)respectively.WedenotethesemodelsasBRWM(1)(4,3),BRWM(2)(4,3)andBRWM(3)(4,3)respectively.EachofthesemodelssignicantlyimprovesupontheCPSandCensusBureauestimatesandarealsosuperiortothebivariatetimeseriesandnontimeseriesmodelsproposedby Ghoshetal. ( 1996 )(GNK).Therandomwalkestimatesalsoseemtoimprovemarginallyoverthosecorrespondingtothenonrandomwalksemiparametricmodel.Whenweconsiderthemedianincomeestimatesof4and5personfamilies,therandomwalkmodelwithdegreesoffreedom5and1knotinthetrajectoryseemstoperformthebest.ThecomparisonmeasuresaresignicantlybetterthantheCPS,BureauandthenontimeseriesmodelofGNK.However,theyfallmarginallyshortofthetimeseriesestimatesbutfarebetterthanthecorrespondingestimatesobtainedfromthenonrandomwalkmodel(BSPM(4,5)).WedenotethismodelasBRWM(4,5).Lastly,forthemodelwithmedianincomesof4personfamiliesandtheweightedaverageincomesof3and5personfamilies(withweights0.75and0.25)asresponsevectors,thebestresultswereobtainedforthemodelwith5degreesoffreedomand1knotinthetrajectory.ThecomparisonmeasuresweresignicantlybetterthantheCPS, 101 PAGE 102 Comparisonmeasuresforbivariaterandomwalkmodel EstimateARBASRBAABASD BRWM(1)(4,3)0.02610.00111043.331,902,416.1BRWM(2)(4,3)0.02740.00101094.251,804,969.06BRWM(3)(4,3)0.02580.00121037.032,114,599.65BRWM(4,5)0.02450.0010978.121,672,183.6BRWM(4,3+5)0.02440.0011990.501,941,833.29 BureauandGNK(bothtimeseriesandnontimeseries)whileitalsoimproveduponthenonrandomwalksemiparametricmodel.WedenotethismodelasBRWM(4,3+5).Table 45 reportsthecomparisonmeasuresfortherandomwalkmodels. EstimationofmedianincomesoffourpersonfamiliesfordifferentstatesofU.S.(hereplayingtheroleofsmallareas)isofinteresttotheU.S.BureauoftheCensus.Towardsthisend,theBureauofCensuscollectedannualmedianincomeestimatesof3,4and5personfamiliesforallthestatesandtheDistrictofColumbiaforeveryyear.ButthemethodologyusedbytheCensusBureaudoesnottakeintoaccountthelongitudinalnatureofthestatespecicmedianincomeobservations. 102 PAGE 103 Ghoshetal. ( 1996 ).Wealsoextendedthebasicsemiparametricframeworkbyincorporatingatimeseries(randomwalk)componenttoaccountforthewithinstatedependenceinthesuccessiveincomeobservations.Theclassofrandomwalkmodelsseemedtoimproveupontheirnonrandomwalkcounterpartsbutmorestudiesarerequiredtobedonebeforereachingadeniteconclusionabouttheirrelativeperformance.Overall,westronglythinkthatsemiparametricproceduresholdsalotofpromiseforsmallareaestimationproblems,specicallyinsituationswheremultipletimevaryingobservationsofsomecharacteristicareavailableforthesmallareas. 103 PAGE 104 Inmydissertation,Ihaveconcentratedontheapplicationofsemiparametricmethodologiesinanalyzingunorthodoxdatascenariosoriginatingindiverseeldslikecasecontrolstudiesandsmallareaestimation.Intheformerscenario,Ihaveusedpenalizedsplinestomodellongitudinalexposureprolesanditsinuencepatternonthecurrentdiseasestatusforagroupofcasesandcontrols.Indoingso,Ihavecometotheconclusionthatpastexposureobservationsmayhavesignicanteffectonthepresentdiseasestatus.Ourmodelingframeworkisquitegeneralandexibleinthesensethatitcanbeusedtomodelanypossiblepatternsofexposureprolesandalsoitcancapturecomplextimevaryingpatternsofinuenceoftheexposurehistoryonthecurrentdiseasestatus.WeappliedourmodelingframeworkonanestedcasecontrolstudyofprostatecancerwheretheexposurewastheProstateSpecicAntigen(PSA).Inthesecondscenario,wehaveusedsemiparametricprocedurestomodeltheincometrajectoriesofdifferentsmallareasandhaveusedthatinformationtoestimatethemedianincomesofthosesmallareasatagiventimepointinthefuture.OurmodelbasedestimatesseemedtoperformbetterthantheusualBureauofCensusestimateswhicharebasedontheincomeobservationsfromaparticulartimepointandhencearenonlongitudinalinnature.Wehavealsoextendedthesemiparametricmodelingframeworktothebivariatescenarioinestimatingthemedianincomeofvaryingfamilysizesforeachsmallarea.Inboththesecases,thesemiparametricincomeestimatesnotonlyimprovesonthecensusestimatesbutarealsocomparabletoestimatesbasedontimeseriesmodels.Thus,wecanconcludethatsemiparametricmethodology,ifproperlyapplied,holdsalotofpromiseforcomplicateddatadrivensituationsarisingindiversestatisticalsettingsliketheoncementionedabove. Theexibilityandpowerofthenonparametricandsemiparametricproceduresimmediatelyimpliesthatamultitudeofinterestingbutusefulextensionscanbecarried 104 PAGE 105 1.4 ,selectionandproperpositioningofknotsisavitalaspectinanysmoothingprocedureinvolvingsplines.Traditionally,knotsareplacedatequallyspacedsamplequantilesoftheindependentvariablesandthat'swhatwehavedoneinboththecasecontrolandsmallareascenarios.Butthisprocedurehasitsfairshareofdrawbacksitwasevidentintheunivariatesmallareaproblemwheretheoriginalplacementoftheknotsfailedtoaccountforthelowdensityregionofthedatapatternwherethenonlinearitywasmostlyconcentrated.Thiswasprobablybecauseofthequantiledependentplacementprocedureoftheknots. Recently,therehasbeensomeresearchondatadrivenoradaptiveknotplacementproceduresinwhichthenumberandlocationsoftheknotsarecontrolledbythedataitselfratherthanbeingprespecied.Theadvantageofthisprocedureisthatfewernumberofknotswouldberequiredwhichwouldbeplacedinoptimallocationsalongthedomain.Thus,theresultingsplinetwillbeexibleenoughtocaptureanyunderlyingheterogeneityinthedatapattern.BothFrequentistandBayesianapproacheshavebeenproposedtowardsthisend.SomeFrequentistcontributionsinclude Friedman ( 1991 )and Stoneetal. ( 1997 )whousedforwardandbackwardknotselectionschemesuntilthebestmodelisidentied. ZhouandShen ( 2001 )usedanalternativealgorithmwhichledtotheadditionofknotsatlocationswhichalreadypossessedsomeknots.Bayesiantreatmentofthisproblemsrevolvesonthenotionoftreatingtheknotnumberandknotlocationsasfreeparameters.SomenotableBayesiancontributionsinclude 105 PAGE 106 ( 1998 )whoplacedpriorsonthenumberandlocationsoftheknots.Thentheysampledfromthefullposteriorsoftheparameters(includingknotlocationsandnumbers)usingreversiblejumpMCMCmethods( Green 1995 ).However,theyrestrictedtheknotstobelocatedonlyatthedesignpointsoftheindependentvariable. DiMatteoetal. ( 2001 )followedthesamebasicprocedureas Denisonetal. ( 1998 )buttheydidnotrestricttheknotstobelocatedonlyatthedesignpointsoftheexperiment.Theyalsopenalizedmodelswithunnecessarilylargenumberofknots. BottsandDaniels ( 2008 )proposedaexibleapproachforttingmultiplecurvestosparsefunctionaldata.Indoingso,theytreatedthenumbersandlocationsofknotsofthepopulationaveragedandsubjectspeciccurvesasdistinctrandomvariablesandsampledfromtheirposteriordistributionsusingreversiblejumpMCMCmethods.Theyusedfreeknotbsplinestomodelthepopulationaveragedandsubjectspeciccurves.Inalltheabovecontributions,Poissonpriorsareplacedontheknotnumberswhileatpriorsareplacedontheknotpositions.TheusefulnessandexibilityoftheBayesianapproachliesinthefactthatthenumberandlocationsofknotsareautomaticallydeterminedfromtheMCMCscheme.Thus,thismethodologyisoftenknownasBayesianAdaptiveRegressionSplines.However,thesamplingprocedureisquiteintensivesincetheparameterdimensionvariesateveryiteration.BottsandDanielssubstantiallyreducedthecomputationalburdenbydealingwiththeapproximateposteriordistributionofonlythenumberandpositionsoftheknotsbyintegratingouttheotherparametersbyusingLaplacetransformations. Animmediatebutworthwhileextensiontowhatwehavealreadydonewouldbetoincorporateanadaptiveknotselectionschemeintoboththecasecontrolandsmallareamodelingframeworks.Fortheformersetup,thiswouldcorrespondtodecipheringtheoptimalnumberofknotsforthepopulationmeanPSAtrajectoryandtheinuencefunction.So,dependingontheparticularstudyorthedatasetathand,anyunderlyingpatternintheinuenceprole(oftheexposuretrajectoryonthediseasestate)canbe 106 PAGE 107 Someotherinterestingextensionstoourworkcanbe 1. Incorporatinginformative(nonignorable)missingness( LittleandRubin 1987 )inthelongitudinalexposure(casecontrol)orincome(smallarea)proles. 2. IncorporatingnonparametricdistributionalstructureslikemixturesofDirichletprocesses( MacEachernandMuller 1998 ),Polyatrees( HansonandJohnson 2000 )onthesubject(orarea)specicrandomeffects. 3. Extendingthesemiparametriccasecontrolmodelingframeworktosituationsinvolvingmultiple(>2)orevencategoricaldiseasestates. Now,Ibrieyexplainsomeworkthatwearecurrentlyengagedindoing. 5.2.1IntroductionandBriefLiteratureReview LittleandRubin 1987 ).Broadlytheseareofthreetypesviz: 1. 2. 3. 107 PAGE 108 LittleandRubin ( 1987 ).Theseapproachesdifferinthewaytheyfactorthejointdistributionofthemissingdataandtheresponse.Intheformerapproach,thepopulationisrststratiedbythepatternofdropoutresultinginamodelforthewholepopulationthatisamixtureoverthepatterns.Ontheotherhand,theselectionmodellingapproachrstmodelsthehypotheticalcompletedataandthenamodelforthemissingdataprocess(conditionalonthehypotheticalcompletedata)isappendedtothecompletedatamodel.InthisstudywewillfocusonthePatternmixture(PM)modelingapproach. SupposeourstudyconsistsofNsubjects,eachofwhomcanbemeasuredatTtimepoints.LetYiandtheDirespectivelydenotetheresponsevectoranddropouttimefortheithsubject.DiissuchthatDi=8><>:tiftheithsubjectdropsoutbetweenthe(t1)thandtthobservationtimes.T+1iftheithsubjectisacompleter. So,fortheithsubject,yiandDiareassumedtobeassociatedordependent.Thus,inthisapproachmodelsarebuiltfor[YijDi]butinferencesarebasedonf(y)=XDf(yjD)P(D). AnimportantbutrealisticsituationthatmayariseinlongitudinalstudiesisthatthenumberofuniquedropouttimesT(visavis,thenumberoftimesasubjectismeasured)maybelarge.Asaresultthenumberofsubjectshavingaparticulardropouttimemaybequitesmall.Thus,straticationbydropoutpatternmayleadtosparse 108 PAGE 109 HoganandLaird ( 1998 )suggestedparameterstobesharedacrosspatterns. Hoganetal. ( 2004 )suggestedwaystogrouptheTdropouttimesintom PAGE 110 Diggleetal. 2002 )usedtocapturetheserialdependenceintheresponseprocess. Thereexistsanotherclassofmodelsknownasmarginalizedlatentvariablemodelswhichtakescareoftheexchangeableornondiminishingdependencepatternamongtherepeatedresponseobservationsusingrandomintercepts. SchildcroutandHeagerty ( 2007 )combinedthemarginalizedtransitionandlatentvariablemodelsbyproposingaunifyingmodelthattakesintoaccountbothserialandlongrangedependenceamongtheresponseobservations.Theirmodelcanbeusedinsituationswithmoderatetolargenumberofrepeatedmeasurementspersubjectwherebothserial(shortrange)andexchangeable(longrange)responsecorrelationcanbeidentied. Inthisstudy,wecombinethemethodologiesproposedin Heagerty ( 2002 ), SchildcroutandHeagerty ( 2007 )and RoyandDaniels ( 2008 )andproposeanewmodelwhichaccountsforbothserial(shortterm)andlongrangedependenceamongtheresponseobservationsinsituationswherethenumberofuniquedropouttimesislarge.Wegroupthedropouttimesusingalatentvariableapproachtakingintoaccounttheuncertaintyinthenumberofgroups.Wealsomodelthemarginalcovariateeffectsofinterest. 110 PAGE 111 Heagerty ( 1999 )proposedmarginallyspeciedlogisticmodelswhichleadtodirectmodelingofthemarginalcovariateeffects.LetYitandXitrespectivelybetheresponseobservationandthecovariatevectorcorrespondingtotheithindividualatthetthtimepoint,i=1,2,...,N;t=1,2,...,T.LetE(YitjXit,)bethemarginalmeanofYit.Itisspeciedas Theabovestructureisthemarginalregressionmodel.Now,inordertospecifythedependenceamong(Yi1,Yi2,...,YiT)thefollowingconditionalmodelisspecied wherebiN(0,).itcanbecomputedbysolvingthefollowingconvolutionequation Thusisafunctionorand.InthisstudywewillbeproposingamodelwhichwillmarginalizeovertherandomeffectsandthedropoutdistributiontodirectlymodelthemarginalcovariateeffectsofinteresttakingintoaccountboththeserialandexchangeabledependencestructureamongtheYit's. Letusbrieygooverthenecessarynotationswithrespecttosubjecti.LetYi=(Yi1,Yi2,...,YiT)betheresponsevector.LettheTuniquedropouttimesbegroupedintomclassesbythelatentindicatorsSi=(Si1,...,Sim).HereSijisanindicatorforclassj,j=1,...,m(m PAGE 112 1. Dependencebetweenresponseanddropouttimemodeledbythelatentclasses. 2. Shortrange(serialdependence)betweenYitand(Yit1,...,Yitp)modelledbyaMTM(p). 3. LongrangeornondiminishingdependenceamongtheYit'smodelledbythesubjectspecicrandomeffectsbi,i=1,...,N. WerstspecifytheMarginalmodelas Theabovemodelmarginalizesoverthesubjectspecicrandomeffectsandoverthelatentclassdistribution(implicitlyoverthedropoutdistribution)aswell.Inordertofullyspecifytheassociationduetorepeatedmeasurementsandnonignorabilityinthemissingnessprocess,wespecifyaconditionalmodelinadditiontothemarginalmodel.Byconditional,wemeanconditionedovertherandomeffectsandlatentclasses.WeassumethattherelevantinformationinthedropouttimesiscapturedbythelatentvariableSthisisobviousbecausethespeciclatentclassasubjectwouldbelongtowouldsolelydependonhis/herdropouttime.Thus,wespecifyamixturedistributionovertheselatentclasses,asopposedtooverDitself. Beforedelvingintothemodel,itisimportanttonotethattheconditionalmodelparametersarenotofmaininterest,andinfactwillbeviewedasnuisanceparameters.Thisisbecausewearenotinterestedinestimatingeithersubjectspeciceffects(i.e.effectsconditionalontherandomeffects)orclassspeciccovariateeffects(i.e.effectsofcovariatesonYgivenaparticulardropoutclass).Moreover,theconditionalmodelshouldbesospeciedthatitiscompatiblewiththemarginalmodel( 5 ).Aswewillseebelow,thisleadstoasomewhatcomplicatedmodel.Specifyingthisconditionalmodel 112 PAGE 113 WeassumethatYit,conditionalontherandomeffectsbiandlatentclassSi,arefromanexponentialfamilywithdistribution where,inthemostgeneralcase,[bijSij=1,Xi]N(0,2j(Xi))andit,k(Sij=1)=V0it,kjkforj=1,2,...,mandk=1,2,...,p,whereVitandZitarebothsubsetsofXit.Thus,thevarianceofbimaydependonthelatentclassandthecovariatevectorfortheithsubject.Moreover,(1k,2k,...,mk)determineshowthedependencebetweenYitandYitkvariesasafunctionofthecovariatesVit,kconditionalonthelatentclasses.Wealsomakethesumtozeroconstrainti.em=Pmj=1jforthepurposeofidentiability.Lastly,inthisconditionalmodel,eachsubjecthasitsownintercept,andtheeffectofeachcovariate,isallowedtodifferbydropoutclassviatheregressioncoefcients,(j). Theprobabilitiesofthelatentclassesgiventhedropouttimesarespeciedasproportionalodd'smodel( Agresti 2002 )givenby where0,10,2...0,M1and1areunknownparameters.Thustheclassprobabilitiesareassumedtobeamonotonefunctionofdropouttime(infact,linearonthelogitscale). 113 PAGE 114 Lastly,thedropouttimesDiareassumedtofollowamultinomialdistributionwithmassateachpossibledropouttimes,parameterizedby'.HerewemaketheimportantassumptionthatYitisindependentofDigivenSi.Ourmaintargetofinferencearethecovariateeffectsaveragedovertheclassesi.eMaveragedoverM.Theinterceptitin( 5 )isdeterminedbythefollowingrelationshipbetweenthemarginalandconditionalmodelsE(Yitj)=XDXSp(SijDi)P(Di)ZXAfE(Yitjyit1,...,yitp,bi,Si)p(yit1,...,yitpjbi,Si)gp(bijSi)dbi 114 PAGE 115 Proportionalityin( 5 )holdsbecauseweassumethatthemissingandobservedresponsesfromsubjectiareindependent,givenSiandbi(i.e.[YmijYi,bi,Si]=[Ymijbi,Si]).FollowingtheOPEFformulation,wehaveLi(YijYfig,Sij=1,bi,(j),)=expTXt=1yititTXt=1(it)=(mi)+TXt=1h(Yit,) PAGE 116 Wecanavoidtheintegral(w.r.tbi)in( 5 )ifwealsosamplethebi'salongwiththeotherparametersfromthefullposterior( 5 ).Inthatcase,thefullposteriormayberewrittenas where Forthemostgeneralcase,wehaveassumedanOPEFstructureforeachYitconditionalonthepast.Sincetheoutcomesarebinary,wecansimplifyittoaBernoullidistributioni.e wherecit=E(Yitjyit1,yit2,...,yitp,bi,Sij=1)=g1it+bi+MXj=1SijZ0ij(j)+pXk=1it,kyitk. 116 PAGE 117 1+e0j+1Di1+e0j1+1Di Now,asmentionedearlier,Diisthedropouttimefortheithsubject.Also,thereareTuniquedropouttimes.Let,fort=1,2,...,Tit=8><>:1iftheithsubjectdropsoutbetweenthe(t1)thandtthobservationtimes0otherwise. 1. LetNq(0,0)assumingthat8i=1,2,...,Nandt=1,2,...,T,Xitisqdimensional. 2. Let(1),(2),...,(m)iidNr(0,0).whererqsinceZitXit8i=1,2,...,Nandt=1,2,...,T. 3. Let21,22,...,2miidU(a,b)where0 PAGE 118 7. Forthetimebeingwekeepthepriorof,()unspecied. Now,combining( 5 5 )andthepriorsspeciedabove,wecanwritedownthefullposteriordistributionofmandw,(w,mjY,X,D)uptoaconstant.Thus,wecangetthefullconditionaldistributionofalltherelevantparametersandproceedwithsamplegenerationusingMCMC. TheassumptionofconditionalindependencebetweenYiandDigivenSiandthecovariatescanbeveriedbyperformingalikelihoodratiotest(Frequentist)orusingBayesfactors(Bayesian).Thenullmodelisgivenby( 5 )andthealternativemodelmaybewrittenas wheref(Di)maybeasmoothbutunspeciedfunctionofDi.Thus,thenullhypothesisofconditionalindependence(betweenYiandDigivenSiandXi)wouldbesimplyf(Di)=0.Thetestcanbecarriedoutbyrstttingthenullmodel(??).Then,theposteriorprobabilityofclassmembershipforeachsubjectcanbeestimatedby^P(Sij=1jDi,Yi,Xi,^w)=RLi(YijYfig,Sij=1,bi,^j,^)p(Sij=1jDi;^)p(Dij^)dF(bijSij,^2j) 5 )usingaweightedlikelihood(theweightsbeingtheaboveposteriorprobabilityofclassmembership).Analternativewayofdoingtheaboveconditionalindependencetestswouldbetousescoretestsbasedonsmoothingsplinesasusedinproportionalhazardsmodelsby Linetal. ( 2006 ). 118 PAGE 119 5 )hasthemostgeneralform.Wecansimplifyitbyassumingalineareffectofdropouttimeinwhichcasethealternative(simpler)modelwouldbe whereeachhj()isaknownfunctionandthe'sareparameters.ThenullhypotheseswouldbeH0:1=...=J=0.ThelineardropouteffectwouldimplyJ=1andh(Di)=Di.TheLRTcanthenbeperformedasbeforebyttingmodels( 5 )and( 5 )usingthesameweightsgivenabove.WecanalsouseBayesfactorsforcarryingouttheseanalysis. Heagerty ( 1999 )proposedMarginallySpeciedLogisticNormalmodelsforlongitudinalbinarydata.Heproposedtwomodels:therstonewasamarginallogisticregressionmodelwhichlinkstheaverageresponsetothecovariatesbythefollowingequation: HereYijandXijrespectivelydenotethebinaryresponseandtheexogenouscovariatevectorrecordedattimejfortheithsubject,i=1,2,...,N;j=1,2,...,ni.Thesecondmodelisaconditionalmodelwhichexplainsthewithinsubjectdependenceamong 119 PAGE 120 Animportantassumptionthatismadeisthatconditionalonbi=(bi1,bi2,...,bini),thecomponentsofYiareindependent.Finally,itisassumedthat(bijXi)N(0,i)whereimodelsthedependenceamongthebi's(andthus,indirectlyamongtheYi's)andcanbeobtainedasafunctionoftheobservationtimesti=(ti1,ti2,...,tini)andaparametervector. Heagerty ( 1999 )referredtothemodelsgivenin( 5 )and( 5 )asthemarginallyspeciedlogisticnormalmodels. Undertheabovemodellingframework,theparameterijcanbeexpressedasafunctionofboththemarginallinearpredictorij=X0ijandij,thestandarddeviationofbij.WritingbijasijzwherezN(0,1),ijcanbeobtainedasthesolutiontothefollowingconvolutionequation: whereh(.)istheinverseofthelogitlinkand(.)isthestandardnormaldensityfunction.Given(ij,ij),theaboveequationcanbesolvedforijusingnumericalintegrationandNewtonRaphsoniteration. 5 )willbeafunctionofthemarginalmeanparametersandtherandomeffectscovarianceparametersandshouldbecomputedforboththemaximumlikelihoodandestimatingequationmethodology( Heagerty 1999 ).Formaximumlikelihoodestimation,thecontributionoftheithsubjecttotheobserveddatalikelihoodisascertainedbyrstassumingalineartransformationoftheformbi=CiziwhereCiisaniqmatrixandziNq(0,Iqq).Theabovetransformationeffectivelylinksupbitoalowerdimensionalrandomeffectzi.Thecontributionoftheithsubject(totheobserveddatalikelihood)cannowbeexpressedasamixtureovertherandom 120 PAGE 121 whereq(zi)=qYk=1(zik).SinceLi(,)cannotbeevaluatedanalytically,numericalproceduresarerequiredtonditsvalue. Heagerty ( 2002 )usedGaussHermiteQuadraturetoperformthecalculationbutassumedq=1.Withincreasingvaluesofq,thecomputationalburdenincreasesexponentiallyandisnotfeasibleatall.Wearecurrentlytryingtodevelopalternativeandlesscomputationallyintensivemethodologiestoaccomplishtheaboveobjectives.WeareworkingwithMultivariateLogisticandMultivariatetdistributionsagainstaBayesianframeworkasin O'brienandDunson ( 2004 ).Wehopethatthismethodologywillprovideabetteralternativetothearduousnumericalmethodsmentionedbelow. 121 PAGE 122 logdj=log+dlog#+logj+d0Z0cZj(t)(t)dt Thus,thelikelihoodwillbe A )wehave Differentiating( A )w.r.tand#andsolvingtheresultingequationswehave A )andthenexponentiating,weobtaintheexpressionofL(,)in( 2 ). Again,differentiating( A )w.r.tj,wehave Itiseasytoshowthatifwereplace( A )in( A )andthenexponentiate,wegettheexpressionforL(#,)in( 2 ).Sincetheorderofmaximizationisimmaterial,itfollowsthat,L(,)andL(#,),oncemaximizedoverthenuisanceparameters(#and PAGE 123 Replacingtheexpressionofdjfrom( 2 ),wehave 2 ). (ii)First,weperformthetransformationfromto(,),where=JXj=1j.Thus,j=j,j=1,...,J.ThejacobianoftransformationwillbeJ1. Usingthistransformationin( A )andaftersomemanipulation,wehave 123 PAGE 124 A )w.r.t#weobtain Integrationof( A )w.r.tyields( 2 )aftersomeminormanipulation. (iii)Theorderinwhichp(#,,jy)isintegratedw.r.ttheparametersdoesnotmakeanydifferenceinthemarginalposteriordensityofp().Thus,integrationofp(w,jy)w.r.tworp(,jy)w.r.twillyieldthesamemarginalposteriordensityp(jy)of. 1. AsinSeamanandRichardson(2004),theassumptionofexistenceandnitenessofE0Z0cZq(t)(t)dtandE0Z0cZr(t)(t)dtisautomaticallysatisedprovidedthepriordensityp()ensuresthatE()existsandisnite. 2. Theposteriorproprietyofp(#,,jy)in( )canbeshowninasimilarwaytothatinSeamanandRichardson(2001). 3. Thepriordistributionp()ofinducesapriordistributionontheinuencefunctionf(t),ct0ginthelogisticcasecontrolmodelin( 23 )since(t)=0(t),ct0. LetP(D=djX(t)=Zk(t),ct0)=pdk,(d=0,1,...,r;k=1,...,K)andP(X(t)=Zk(t),ct0jD=0)=k=PKl=1l.LetndkbethenumberofindividualswithD=dandfX(t)=Zk(t),ct0g.ItcanbeshownthatP(X(t)=Zk(t),ct0jD=d)=kpdk=p0k KXl=1lpdl=p0l PAGE 125 KXl=1lpdl=p0l1CCCCCAndk. KXl=1ldl1CCCCCAndk. TheaugmentedmodelisgivenbyZdkjdkpoisson(dk)wherelog(dk)=log(#d)+log(dk)+log(k),log(0k)=log(k),d=1,...,r;k=1,...,K. 125 PAGE 126 NotingthatZ10expk(1+rXd=1#ddk)!(k)Prd=0ndk1dk/1+rXd=1#ddk!Prd=0ndk,wehave,byintegratingoutin( A ), Now,integratingout(#1,...,#r)from( A ),wehave Next,wemakethetransformationk='kand'=KXl=1lhavingjacobian'1.Hencethepriordistributionin( A )becomes(,#,',)/rYd=1#1d!'1KYk=11k!(). PAGE 127 A )canberewrittenas Integratingout'from( A ),wehave KXl=1ldl1CCCCCAndkKYk=11k!() From( A )and( A ),itisclearthatposteriorinferencefortheparameterofinterest,remainsthesameundereithertheprospectivelikelihoodLportheretrospectivelikelihoodLRaslongastheposteriorisproper.Itcanbeshownthattheposteriorwillbeproperforanyproperpriorforifn0k18k=1,...,K. 127 PAGE 128 WehavetoshowthatIMwhereMisanynitepositiveconstant. Integratingrstw.r.t,wehave 2Xi(iXiZibi1)01(iXiZibi1)d=jXiX0i1Xij1=2exp1 2XiW0i1Wi+Q 2PiW0i1XiPiX0i1Xi1PiX0i1Wi,Wi=iZibi1and1=diag(21,22,...,2t). Now,W0i1Wi=W0i1=21=2Wi=S0iSiwhereSi=1=2Wi.Similarly,W0i1Xi=S0iTi,X0i1Wi=T0iSiandX0i1Xi=T0iTiwhereTi=1=2Xi. 128 PAGE 129 B )becomes1 2XiS0iSiXiS0iTiXiT0iTi1XiT0iSi=1 2S0SS0T(T0T)1T0S=1 2S0IT(T0T)1T0S=Q,say whereS=(S01,...,S0m)0andT=(T01,...,T0m)0.Since(IT(T0T)1T0)isidempotent,S0IT(T0T)1T0Sisnonnegative,implyingQ0andthusexp(Q)1. Next,weconsiderintegrationw.r.t2i.e Assumingmax=max(1,...,t),wehave,8j=1,...,t,2j2max)Xij2jX0ijXij2maxX0ij)Pi,jXij2jX0ij2maxPi,jXijX0ijandthus Combining( B )and( B ),wehaveIjXi,jXijX0ijj1=2Z...Z(2max)(p+1)=2tYj=1(2j)m=2cj+1exp(dj=2j)d21...d2t 2expdk (B) 129 PAGE 130 Combining( B )and( B ),wehave where=().Sinceallthecomponentsoftheintegrandin( B )haveproperdistributions,theaboveintegralwouldbenitethusprovingposteriorpropriety. Fortherandomwalkmodel,theintegrandin( B )willhaveanadditionallikelihoodtermQtj=1L(vjjvj1,2v)andapriorterm(2v).Thederivationwouldthenproceedexactlyasaboveandtheintegrandin( B )willalsocontaintheseadditionalterms.Butsincebothoftheseareproperdistributions(normalandinversegammarespectively),Iwillstillbeniteundertheconditionsstatedinthetheorem. 2Xi,j(ijX0ijZ0ijbi)01j(ijX0ijZ0ijbi)d<1(B) inordertoproveposteriorpropriety. Usingthesametypeofalgebraicmanipulationsasintheunivariatecase,theL.H.Sof( B )canbeshowntobe 2Xi,jW0ij1jWij+1 2Q 130 PAGE 131 Asbefore,theexpressionwithintheexponentin( B )canberewrittenasK=1 2Xi,jS0ijSijXi,jS0ijTijXi,jT0ijTijXi,jT0ijSij=1 2S0IT(T0T)1T0S0. Thus, exp1 2Xi,jW0ij1jWij+1 2Q1 So,inordertoproveposteriorpropriety,wehavetoshow Hereristheorderofj,j=1,2,...,t.(r=2inourcase). Letj1,j2,...,jrbethedistincteigenvaluesof1j,j=1,2,...,t.Sincejisavariancecovariancematrix,itispositivedeniteandsymmetric.Hence,1jalsohasthesameproperties.Thus,jk>0,8k=1,2,...,r. Now,8j=1,2,...,r, PAGE 132 2jXi,jXijX0ijj1 2 Sincej1jj=rYk=1jk,8j=1,...,t, 2=rYk=1(jk)(m+djr1) 2 Now,replacing( B )and( B )intheexpressionofIin( B ),wehave 2Z..Z(min)p+q+2 2tYj=1rYk=1(jk)(m+djr1) 2exp"TV1j1j whereTdenotestrace.Letmin=lm,l2[1,...,t];m2[1,...,r]. Then,II1I2where 2ZrYfk=1,k6=mg(lk)(m+dlr1) 2(lm)(m+dlpq2)r1 2expTV1l1l 2ZrYfk=1,k6=mg(lk)p+q+2 2j1lj(m+dlpq2)r1 2expTV1l1l 2exp"TV1j1j 2jVjjm+dj whichisnite. Thus,inordertoshowposteriorpropriety,wehavetoprovethatI2<1. 132 PAGE 133 2j1lj(m+dlpq2)r1 2expTV1l1l BytheAMGMinequality,wehave, 21 2 21 2=1 2=1 2 where(l)kkdenotesthekthdiagonalelementof1l. Since1lhasaWishartdistribution,(l)kkkk2dl,(k=1,...,r)implyingthatPrk=1(l)kk<1. Combining( B )and( B ),wehave,IZ1 2j1lj(m+dlpq2)r1 2expTV1l1l 2ZrXk=1(l)kk(r1)(p+q+2) 2j1lj(m+dlpq2)r1 2expTV1l1l 2whereC=1 2 PAGE 134 Now, 2(r1)(p+q+2) 2r1rXk=1((l)kk)(r1)(p+q+2) 2 2(r1)(p+q+2) 2r1ErXk=1((l)kk)(r1)(p+q+2) 2 whichisnitebecause 2<18k=1,...,r)rXk=1((l)kk)(r1)(p+q+2) 2<1)ErXk=1((l)kk)(r1)(p+q+2) 2<1)ErXk=1(l)kk(r1)(p+q+2) 2<1 ThusIisniteimplyingposteriorpropriety. 134 PAGE 135 1. andisthep+K+1orderpriorvariancecovariancematrixof. 2. 3. andbistheq+M+1ordervariancecovariancematrixofb. 4. andisther+K+2ordervariancecovariancematrixof(,). 5. 2,+(Zi0Mib0iQi) 2where 6. 2NXi=1b2ij,j=0,...,q. 135 PAGE 136 2NXi=1ni+1,1 2NXi=1niXj=1yijp,(aij)0q,(aij)0bi2. 8. 2KXk=12p+k. 9. 2NXi=1q+MXj=q+1b2ij. 10. 2KXk=12r+k. Here,G(x,y)denotesaGammadensitywithshapeparameterxandrateparameteryrespectively. C.2.1SemiparametricUnivariateSmallAreaModel 1. PAGE 137 20+d 2mXi=1(ijX0ijZ0ijbi)2+dj 2mXi=1b2i+d 1. 137 PAGE 138 10. PAGE 139 Agresti,A.(2002).Categoricaldataanalysis.Wiley. Albert,J.andChib,S.(1993).Bayesiananalysisofbinaryandpolychotomousresponsedata.JournaloftheAmericanStatisticalAssociation88,669. Althman,P.(1971).Theanalysisofmatchedproportions.Biometrika58,561. Ashby,D.,Hutton,J.,andMcGee,M.(1993).SimpleBayesiananalysesforcasecontrolledstudiesincancerepidemiology.Statistician42,385. Battese,G.,Harter,R.,andFuller,W.(1988).Anerrorcomponentmodelforpredictionofcountycropareasusingsurveyandsatellitedata.JournaloftheAmericanStatisticalAssociation83,28. Bell,W.(1999).Accountingforuncertaintyaboutvariancesinsmallareaestimation.BulletinoftheInternationalStatisticalInstitute. Botts,C.andDaniels,M.(2008).AfexibleapproachtoBayesianmultiplecurvetting.ComputationalStatisticsandDataAnalysis52,5100. Bradlow,E.andZaslavsky,A.(1997).CaseinuenceanalysisinBayesianinference.JournalofComputationalandGraphicalStatistics6,314. Breslow,E.T.andDay,N.E.(1980).StatisticalMethodsinCancerResearch,Volume1.InternationalAgencyforResearchonCancer,Lyon. Breslow,E.T.,Day,N.E.,Halvorsen,K.T.,Prentice,R.L.,andSabai,C.(1978).Estimationofmultiplerelativeriskfunctionsinmatchedcasecontrolstudies.AmericanJournalofEpidemiology108,299. Breslow,N.(1996).Statisticsinepidemiology:Thecasecontrolstudy.JournaloftheAmericanStatisticalAssociation91,14. Carroll,R.J.,Wang,S.,andWang,C.Y.(1995).Prospectiveanalysisoflogisticcasecontrolstudies.JournaloftheAmericanStatisticalAssociation90,157. Catalona,W.,Partin,A.,Slawin,K.,andBrawer,M.(1998).Useofthepercentageoffreeprostatespecicantigentoenhancedifferentiationofprostatecancerfrombenignprostaticdisease:Aprospectivemulticenterclinicaltrial.JournaloftheAmericanMedicalAssociation19,1542. Corneld,J.(1951).Amethodofestimatingcomparativeratesfromclinicaldata:applicationstocancerofthelung,breast,andcervix.JournaloftheNationalCancerInstitute11,1269. Corneld,J.,Gordon,T.,andSmith,W.W.(1961).Quantalresponsecurvesforexperimentallyuncontrolledvariables.BulletinoftheInternationalStatisticalInstitute38,97. 139 PAGE 140 Denison,D.,Mallick,B.,andSmith,A.(1998).AutomaticBayesiancurvetting.JournaloftheRoyalStatisticalSociety,SeriesB60,333. Diggle,P.,Heagerty,P.,Liang,K.,andZeger,S.(2002).Theanalysisoflongitudinaldata,2ndEdition.NewYork:OxfordUniversityPress. Diggle,P.,Morris,S.,andWakeeld,J.(2000).Pointsourcemodelingusingmatchedcasecontroldata.Biostatistics1,89. DiMatteo,I.,Genovese,C.,andKass,R.(2001).Bayesiancurvettingwithfreeknotsplines.Biometrika88,1055. Durban,M.,Harezlak,J.,Wand,M.,andCarroll,R.(2004).Simplettingofsubjectspeciccurvesforlongitudinaldata.StatisticsinMedicine00,1. Eilers,P.andMarx,B.(1996).FlexiblesmoothingwithBsplinesandpenalties.StatisticalScience11,89. Ericksen,E.andKadane,J.(1985).Estimatingthepopulationincensusyear:1980andbeyond(withdiscussion).JournaloftheAmericanStatisticalAssociation80,98. Escobar,M.andWest,M.(1995).Bayesiandensityestimationandinferenceusingmixtures.JournaloftheAmericanStatisticalAssociation90,577588. Etzioni,R.,Pepe,M.,Longton,G.,Hu,C.,andGoodman,G.(1999).Incorporatingthetimedimensioninreceiveroperatingcharacteristiccurves:Acasestudyofprostatecancer.MedicalDecisionMaking19,242. Eubank,R.(1988).Splinesmoothingandnonparametricregression.NewYork:MarcelDekker. Eubank,R.(1999).Nonparametricregressionandsplinesmoothing.NewYork:MarcelDekker. Fan,J.andGijbels,I.(1996).Localpolynomialmodelinganditsapplications.ChapmanandHall. Fay,R.(1987).Applicationofmultivariateregressiontosmalldomainestimation,inR.Platek,J.N.K.Rao,C.E.Srndal,andM.P.Singh(Eds).SmallAreaStatistics. Fay,R.andHerriot,R.(1979).Estimationofincomefromsmallplaces:anapplicationofJamesSteinprocedurestocensusdata.JournaloftheAmericanStatisticalAssociation74,269. 140 PAGE 141 Friedman,J.(1991).Multivariateadaptiveregressionsplines.TheAnnalsofStatistics19,1. Gelfand,A.andGhosh,S.(1998).Modelchoice:Aminimumposteriorpredictivelossapproach.Biometrika85,1. Gelfand,A.andSmith,A.(1990).Samplingbasedapproachestocalculatingmarginaldensities.JournaloftheAmericanStatisticalAssociation85,398. Gelman,A.andRubin,D.(1992).Inferencefromiterativesimulationusingmultiplesequences(withdiscussion).StatisticalScience7,457. Ghosh,M.andChen,M.H.(2002).Bayesianinferenceformatchedcasecontrolstudies.Sankhya,B64,107. Ghosh,M.,Nangia,N.,andKim,D.(1996).Estimationofmedianincomeoffourpersonfamilies:ABayesiantimeseriesapproach.JournaloftheAmericanStatisticalAssociation91,1423. Ghosh,M.andRao,J.N.K.(1994).Smallareaestimation:Anappraisal.StatisticalScience9,55. Godambe,V.P.(1976).Conditionallikelihoodandunconditionaloptimumestimatingequations.Biometrika63,277. Green,P.(1995).ReversiblejumpMarkovChainMonteCarlocomputationandBayesianmodeldetermination.Biometrika82,711. Green,P.andSilverman,B.(1994).Nonparametricregressionandgeneralizedlinearmodels:aroughnesspenaltyapproach.ChapmanandHall/CRC. Gustafson,P.,Le,N.,andValle,M.(2002).ABayesianapproachtocasecontrolstudieswitherrorsincovariables.Biostatistics3,229. Hampel,F.,Ronchetti,E.,Rousseeuw,P.,andStahel,W.(1987).Robuststatistics:Theapproachbasedoninuencefunctions.Wiley. Hanson,T.andJohnson,W.(2000).Spatiallyadaptivepenaltiesforsplinetting.AustralianandNewZealandJournalofStatistics2,205. Heagerty,P.(1999).Marginallyspeciedlogisticnormalmodelsforlongitudinalbinarydata.Biometrics55,688. Heagerty,P.(2002).Marginalizedtransitionmodelsandlikelihoodinferenceforlongitudinalcategoricaldata.Biometrics58,342. 141 PAGE 142 Hogan,J.andLaird,N.(1998).Mixturemodelsforthejointdistributionofrepeatedmeasuresandeventtimes.StatisticsinMedicine16,239. Hogan,J.,Roy,J.,andKorkontzelou,C.(2004).Tutotialinbiostatistics:Handlingdropoutinlongitudinalstudies.StatisticsinMedicine23,1455. Jiang,J.andLahiri,P.(2006).Mixedmodelpredictionandsmallareaestimation.Test15,1. Johnson,V.(2004).ABayesian2testforgoodnessoft.AnnalsofStatistics32,2361. Lewis,M.,Heinemann,L.,MacRae,K.,Bruppacher,R.,andSpitzer,W.(1996).Theincreasedriskofvenomousthromboembolismandtheuseofthirdgenerationprogestagens:Roleofbiasinobservationalresearch.Contraception54,5. Lin,J.,Zhang,D.,andDavidian,M.(2006).Smoothingsplinebasedscoretestsforproportionalhazardsmodels.Biometrics62,803. Lindstrom,M.(1999).Penalizedestimationoffreeknotsplines.JournalofComputationalandGraphicalStatistics8,333. Lipsitz,S.,Parzen,M.,andEwell,M.(1998).Inferenceusingconditionallogisticregressionwithmissingcovariates.Biometrics54,295. Little,R.andRubin,D.(1987).StatisticalAnalysiswithMissingData.NewYork:Wiley&Sons. MacEachern,S.andMuller,P.(1998).EstimatingmixturesofDirichletprocessmodels.JournalofComputationalandGraphicalStatistics2,223. Mantel,N.andHaenszel,W.(1959).Statisticalaspectsoftheanalysisofdatafromretrospectivestudiesofdisease.JournaloftheNationalCancerInstitute22,719. Marshall,R.(1988).Bayesiananalysisofcasecontrolstudies.StatisticsinMedicine7,12231230. Morris,C.(1983).ParametricempiricalBayesinference:theoryandapplicaions.JournaloftheAmericanStatisticalAssociation78,47. Muller,P.,Parmigiani,G.,Schildkraut,J.,andTardella,L.(1999).ABayesianhierarchicalapproachforcombiningcasecontrolandprospectivestudies.Biometrics55,858. Muller,P.andRoeder,K.(1997).ABayesiansemiparametricmodelforcasecontrolstudieswitherrorsinvariables.Biometrika84,523. 142 PAGE 143 O'brien,S.andDunson,D.(2004).Bayesianmultivariatelogisticregression.Biometrics60,739. Opsomer,J.,Claeskens,G.,Ranalli,M.,andBreidt,F.(2008).Nonparametricsmallareaestimationusingpenalizedsplineregression.JournaloftheRoyalStatisticalSociety,SeriesB70,265. Paik,M.andSacco,R.(2000).Matchedcasecontroldataanalyseswithmissingcovariates.AppliedStatistics49,145. Park,E.andKim,Y.(2004).Analysisoflongitudinaldataincasecontrolstudies.Biometrika91,321. Prentice,R.L.andPyke,R.(1979).Logisticdiseaseincidencemodelsandcasecontrolstudies.Biometrika66,403. Rao,J.N.K.(2003).SmallAreaEstimation.WileyInterScience,NewYork. Rathouz,P.,Satten,G.,andCarroll,R.(2002).Semiparametricinferenceinmatchedcasecontrolstudieswithmissingcovariatedata.Biometrika89,905. Robinson,G.(1991).ThatBLUPisagoodthing:theestimationofrandomeffects.StatisticalScience6,15. Roeder,K.,Carroll,R.,andLindsay,B.(1996).Asemiparametricmixtureapproachtocasecontrolstudieswitherrorsincovariables.JournaloftheAmericanStatisticalAssociation91,722. Roy,J.(2003).Modelinglongitudinaldatawithnonignorabledropoutsusingalatentdropoutclassmodel.StatisticsinMedicine59,829. Roy,J.andDaniels,M.(2008).Ageneralclassofpatternmixturemodelsfornonignorabledropoutswithmanypossibledropouttimes.Biometrics64,538. Rubin,D.(1981).TheBayesianbootstrap.TheAnnalsofStatistics9,130. Ruppert,D.(2002).Selectingthenumberofknotsforpenalizedsplines.JournalofComputationalandGraphicalStatistics11,735. Ruppert,D.andCarroll,R.(2000).Spatiallyadaptivepenaltiesforsplinetting.AustralianandNewZealandJournalofStatistics2,205. Ruppert,D.,Wand,M.,andCarroll,R.(2003).SemiparametricRegression.CambridgeUniversityPress,Cambridge,U.K. Satten,G.andCarroll,R.(2000).Conditionalandunconditionalcategoricalregressionmodelswithmissingcovariates.Biometrics56,384. 143 PAGE 144 Schildcrout,J.andHeagerty,P.(2007).Marginalizedmodelsformoderatetolongseriesoflongitudnalbinaryresponsedata.Biometrics63,322. Seaman,S.R.andRichardson,S.(2001).Bayesiananalysisofcasecontrolstudieswithcategoricalcovariates.Biometrika88,1073. Seaman,S.R.andRichardson,S.(2004).EquivalenceofprospectiveandretrospectivemodelsintheBayesiananalysisofcasecontrolstudies.Biometrika91,15. Sinha,S.,Mukherjee,B.,andGhosh,M.(2004).Bayesiansemiparametricmodelingformatchedcasecontrolstudieswithmultiplediseasestates.Biometrics60,41. Sinha,S.,Mukherjee,B.,Ghosh,M.,Mallick,B.,andCarroll,R.(2005).SemiparametricBayesiananalysisofmatchedcasecontrolstudieswithmissingexposure.JournaloftheAmericanStatisticalAssociation100,591. Stone,C.,Hansen,M.,Kooperberg,C.,andTruong,Y.(1997).Polynomialsplinesandtheirtensorproductsinextendedlinearmodeling.TheAnnalsofStatistics25,1371. Wahba,G.(1990).Splinemodelsforobservationaldata.CBMSNSFRegionalConferenceSeriesinAppliedMathematics. Wand,M.(2003).Smoothingandmixedmodels.ComputationalStatistics18,223. Wand,M.andJones,M.(1995).KernelSmoothing.ChapmanandHall. Zelen,M.andParker,R.(1986).CasecontrolstudiesandBayesianinference.StatisticsinMedicine5,261269. Zhang,D.,Lin,X.,andSowers,M.(2007).Twostagefunctionalmixedmodelsforevaluatingtheeffectoflongitudinalcovriateprolesonascalaroutcome.Biometrics63,351. Zhou,S.andShen,X.(2001).Spatiallyadaptiveregressionsplinesandaccurateknotselectionschemes.JournaloftheAmericanStatisticalAssociation96,247. 144 PAGE 145 DhimanBhadrareceivedhisBachelorofScienceinstatisticsfromPresidencyCollege,Calcutta(India)in2002andMasterofScienceinstatisticsfromCalcuttaUniversityin2004.HejoinedtheDepartmentofstatisticsatUniversityofFloridainJanuary2005forpursuingaPhDinstatistics.HeplanstograduateinAugust2010. 145 