Citation |

- Permanent Link:
- http://ufdc.ufl.edu/UFE0041832/00001
## Material Information- Title:
- Bayesian Semiparametric Regression and Related Applications
- Creator:
- Bhadra, Dhiman
- Place of Publication:
- [Gainesville, Fla.]
- Publisher:
- University of Florida
- Publication Date:
- 2010
- Language:
- english
- Physical Description:
- 1 online resource (145 p.)
## Thesis/Dissertation Information- Degree:
- Doctorate ( Ph.D.)
- Degree Grantor:
- University of Florida
- Degree Disciplines:
- Statistics
- Committee Chair:
- Ghosh, Malay
- Committee Co-Chair:
- Daniels, Michael J.
- Committee Members:
- Agresti, Alan G.
Andresen, Elena M. - Graduation Date:
- 8/7/2010
## Subjects- Subjects / Keywords:
- Case control studies ( jstor )
Diseases ( jstor ) Income estimates ( jstor ) Median income ( jstor ) Modeling ( jstor ) School dropouts ( jstor ) Semiparametric modeling ( jstor ) Statistical estimation ( jstor ) Statistics ( jstor ) Trajectories ( jstor ) Statistics -- Dissertations, Academic -- UF bayesian, case, current, mcmc, odds, penalized, random, semiparametric - Genre:
- Electronic Thesis or Dissertation
bibliography ( marcgt ) theses ( marcgt ) government publication (state, provincial, terriorial, dependent) ( marcgt ) Statistics thesis, Ph.D.
## Notes- Abstract:
- Case-Control studies and small area estimation are two distinct areas of modern Statistics. The former deals with the comparison of diseased and healthy subjects with respect to risk factor(s) of a disease with the aim of capturing disease - exposure association specially for rare diseases. The later area is concerned with the measurements of characteristics of small domains - regions whose sample size is so small that the usual survey based estimation procedures cannot be applied in the inferential routines. Both these areas are important in their own right. Case-control studies forms one of the pillars of modern biostatistics and epidemiology and has diverse applications in various health related issues, specially those involving rare diseases like Cancer. On the other hand, estimates of characteristics for small areas are widely used by Federal and local governments for formulating policies and decisions, in allocating federal funds to local jurisdictions and in regional planning. My dissertation deals with the application of Bayesian semiparametric procedures in modeling unorthodox data scenarios that may arise in case control studies and small area estimation. The first part of the dissertation deals with an analysis of longitudinal case-control studies i.e case-control studies for which time varying exposure information are available for both cases and controls. In a typical case-control study, the exposure information is collected only once for the cases and controls. However, some recent medical studies have indicated that a longitudinal approach of incorporating the entire exposure history, when available, may lead to more precise estimates of the odds ratios of disease. We use semiparametric regression procedures to model the exposure profiles of the cases and controls and also the influence pattern of the exposure profile on the disease status. This enables us to analyze how the present disease status of a subject is influenced by his/her past exposure conditions conditional on the current ones. Analysis is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) algorithms. The proposed methodology is motivated by, and applied to a case-control study of prostate cancer where longitudinal biomarker information is available for the cases and controls. The second and third part of my dissertation deals with univariate and multivariate semiparametric procedures for estimating characteristics of small areas across the United States. In the second part, we put forward a semiparametric modeling procedure for estimating the median household income for all the states of the U.S. and the District of Columbia. Our models include a nonparametric functional part for accomodating any unspecified time varying income pattern and also a state specific random effect to account for the within-state correlation of the income observations. Model fitting and parameter estimation is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) methodology. It is seen that the semiparametric model estimates can be superior to both the direct estimates and the Census Bureau estimates. Overall, our study indicates that proper modeling of the underlying longitudinal income profiles can improve the performance of model based estimates of household median income of small areas. In the third part of the dissertation, we put forward a bivariate semiparametric modeling procedure for the estimation of median income of four-person families for the different states of the U.S. and the District of Columbia while explicitly accommodating for the time varying pattern in the income observations. Our estimates tend to have better performances than those provided by the Census Bureau and also have comparable performances to some established methodologies specially those involving time series modeling techniques. Based on our findings in parts two and three, we come to the conclusion that semiparametric and nonparametric regression models can be a attractive alternative to the more traditional modeling frameworks specially in situations where information on different characteristics of small areas are available at multiple time points in the past. ( en )
- General Note:
- In the series University of Florida Digital Collections.
- General Note:
- Includes vita.
- Bibliography:
- Includes bibliographical references.
- Source of Description:
- Description based on online resource; title from PDF title page.
- Source of Description:
- This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
- Thesis:
- Thesis (Ph.D.)--University of Florida, 2010.
- Local:
- Adviser: Ghosh, Malay.
- Local:
- Co-adviser: Daniels, Michael J.
- Electronic Access:
- RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-08-31
- Statement of Responsibility:
- by Dhiman Bhadra.
## Record Information- Source Institution:
- UFRGP
- Rights Management:
- Applicable rights reserved.
- Embargo Date:
- 8/31/2011
- Resource Identifier:
- 004979600 ( ALEPH )
769020145 ( OCLC ) - Classification:
- LD1780 2010 ( lcc )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

estimates of four-person families for 1989 using 1979 as the base year. They compared their estimates with the CPS median income estimates and Bureau of Census estimates by treating the decennial census values as "gold standard". They used both univariate and bivariate model formulations. In all the cases, the time series model with the adjusted census median income as covariates performed better than the ones with either the base year census median as covariates or both the base year and adjusted census medians as covariates. In all the cases, the time series model performed better than the non-time series one which only utilized the census median income figures for 1979, the CPS median income estimates for 1989 and the per capital income incomes for 1979 and 1989. Last but not the least, the bivariate time series model using the median incomes of four and five person families performed the best and outperformed both the CPS and Bureau of Census estimates of median income. Semiparametric regression methods have not been used in small area estimation contexts until recently. This was mainly due to methodological difficulties in combining the different smoothing techniques with the estimation tools generally used in small area estimation. The pioneering contribution in this regard is the work by Opsomer et al. (2008) in which they combined small area random effects with a smooth, non-parametrically specified trend using penalized splines (Eilers and Marx, 1996). In doing so, they expressed the non-parametric small area estimation problem as a mixed effects regression model and analyzed it using restricted maximum likelihood. They also presented theoretical results on the prediction mean squared error and likelihood ratio tests for random effects. Inference was based on a simple non-parametric bootstrap approach. They applied their model to a non-longitudinal, spatial dataset concerning the estimation of mean acid neutralizing capacity (ANC) of lakes in the north eastern states of U.S. Datta, G., Ghosh, M., Nangia, N., and Natarajan, K. (1993). Estimation of median income of four-person families : A Bayesian approach, in W.A. Berry, K.M. Chaloner and J.K. Geweke (Eds),. Bayesian Analysis in Statistics and Econometrics pages 129-140. Denison, D., Mallick, B., and Smith, A. (1998). Automatic Bayesian curve fitting. Journal of the Royal Statistical Society, Series B 60, 333-350. Diggle, P., Heagerty, P., Liang, K., and Zeger, S. (2002). The analysis of longitudinal data, 2nd Edition. New York : Oxford University Press. Diggle, P., Morris, S., and Wakefield, J. (2000). Point source modeling using matched case-control data. Biostatistics 1, 89-109. DiMatteo, I., Genovese, C., and Kass, R. (2001). Bayesian curve fitting with free knot splines. Biometrika 88, 1055-1071. Durban, M., Harezlak, J., Wand, M., and Carroll, R. (2004). Simple fitting of subject specific curves for longitudinal data. Statistics in Medicine 00, 1-24. Eilers, P. and Marx, B. (1996). Flexible smoothing with B-splines and penalties. Statisti- cal Science 11, 89-121. Ericksen, E. and Kadane, J. (1985). Estimating the population in census year : 1980 and beyond (with discussion). Journal of the American Statistical Association 80, 98-131. Escobar, M. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90, 577 588. Etzioni, R., Pepe, M., Longton, G., Hu, C., and Goodman, G. (1999). Incorporating the time dimension in receiver operating characteristic curves : A case study of prostate cancer. Medical Decision Making 19, 242-251. Eubank, R. (1988). Spline smoothing and nonparametric regression. New York : Marcel Dekker. Eubank, R. (1999). Nonparametric regression and spline smoothing. New York : Marcel Dekker. Fan, J. and Gijbels, I. (1996). Local polynomial modeling and its applications. Chapman and Hall. Fay, R. (1987). Application of multivariate regression to small domain estimation, in R. Platek, J.N.K. Rao, C.E. Srndal, and M.P. Singh (Eds). SmallArea Statistics. Fay, R. and Herriot, R. (1979). Estimation of income from small places : an application of James-Stein procedures to census data. Journal of the American Statistical Association 74, 269-277. 140 patterns which may result in unstable parameter estimates in those patterns since some of the parameters maybe unidentifiable. There are different ways to get around this problem Hogan and Laird (1998) suggested parameters to be shared across patterns. Hogan et al. (2004) suggested ways to group the T dropout times into m < T groups in an adhoc fashion. Roy (2003) proposed an automated mechanism to do the above grouping using a latent variable approach within the context of normal models for continuous data. This approach assumes the existence of a discrete latent variable that explains the dependence between the response vector and the dropout time and allows incorporation of uncertainty about the groupings, conditional on a fixed number of groups. Roy and Daniels (2008) extended the above approach by incorporating uncertainty in the number of classes through approximate Bayesian model averaging. In their approach, the marginal mean is assumed to follow a generalized linear model, while the mean conditional on the latent class and random effects is specified separately. Since the dimension of the parameter vector of interest (the marginal regression coefficients) does not depend on the assumed number of latent classes, they treat the number of latent classes as a random variable. A prior distribution is assumed for the number of classes and approximate posterior model probabilities are calculated. In order to avoid the complications with implementing a fully Bayesian model, they propose a simple approximation to these posterior probabilities. Lastly, they apply their methodology to a dataset dealing with the longitudinal study of depression in HIV-infected women. Heagerty (1999) proposed marginally specified logistic normal models for longitudinal binary data. In doing so, he proposed an alternative parametrization of the logistic normal random effects model and studied both likelihood and estimation equation approaches to parameter estimation. A notable feature of his approach was that the marginal regression parameters still permit individual level predictions or contrasts. Heagerty (2002) also proposed a general parametric class of serial 109 M-y (Zzu 6zs7Z" MVY = 6Z> Zy U and Zu q 1(06 - E zi-/ '; X'/ b v,) 5. [vl, /3,, 0, b, I, Ev, v2, X, Z] ~ N(MA ZE) where -1 and Mv m\q!-1 M~ V ') 6. [vjlP,7, 0, b, Zv, vj_, vj+, X,Z] ~ N(M,Z) (j = 2,... t 1) where - = (m l - M = (m1j 2-)- and 2 Z V 7. [vt /3,7, I, b, v, Vt-, X, Z] ~ N(Mtv, Z) where v m\q-1 t t Mv m\q!-1 t t an-1 Zv1) and -1) (q i tt A = S,+ (0, -X' -Z bi 9. [ZEv] ~/W(S 10. [Zo b]~ /W so 11. [ZE,-7]~ /W(S,. v)(0e, x, - 1,..., t) where Z b ,v)' assuming vo - bib', do 77', d. + 1) 138 "(q jOil - X Zi,"7 bi) + EvV . q( (06 X'/ Z'7 bi) + l(v,+ + v,)). X/ Zt, bi) + 8. [\| 7, b, V, vt_-, X, Z] ~ /W(Aj, d +m) (j (v v-1)(v viy1)', dv + t) I Suppose nab be the number of subjects for whom (D = a, D = b; a = 0, 1; b = 0, 1), D and D being the observed and predicted disease status for a particular subject. Then, no00 n11 (n n ll\( n01 nl n ( no n0 (no + nio n nn n ny I n 1 ( f nil noi + nill noo + noi nor + ni n n n n where n = noo + no0 + n10 + n11. The observed disease status (vis-a-vis case or control status) of a subject is obtained from the dataset while the predicted disease status is calculated from the posterior estimates of the parameters. At iteration n of the Gibbs sampler, we can calculate the quantity p(n) = (n)(D, = 1lX,(t+ ad), t e [-c, 0]) = L(n)(a +/'M,+ b'Qi,) where L(.) can be either the exact logit cdf or the approximate Student-t cdf (with 8 degrees of freedom). Based on the value of ,n), we can assign b if fn) > 0.5 0 if (n) < 0.5 Based on the values of {(Di, bi} ); i = 1,..., N}, we can form a 2 x 2 table, and hence can calculate a value of kappa, say, K(n) at iteration n of the Gibbs sampler. The posterior means and 95% credible intervals of K provide a measure of the amount of agreement that our model provides. 2.5.3 Case Influence Analysis Case influence (or case deletion) diagnostics are often used as a tool for model assessment in various statistical problems. The procedure hinges on the idea that the influence of a particular observation on a parameter can be measured by the difference in the parameter estimate based on the full data and the data with that observation deleted (Hampel et al., 1987). These diagnostics can be used to detect observations with an unusual effect on the fitted model and thus may lead to identification of data or model errors. Bradlow and Zaslavsky (1997) applied case influence tools in 5 CONCLUSION AND FUTURE RESEARCH ................... 104 5.1 Adaptive Knot Selection ............................ 105 5.2 Analyzing Longitudinal Data with Many Possible Dropout Times using Latent Class and Transitional Modelling . 107 5.2.1 Introduction and Brief Literature Review ..... 107 5.2.2 Modeling Framework .......................... 110 5.2.3 Likelihood, Priors and Posteriors ... 114 5.2.4 Specification of Priors ......................... 117 APPENDIX A PROOF OF BAYESIAN EQUIVALENCE RESULTS .... 122 B PROOF OF POSTERIOR PROPRIETY FOR THE SMALL AREA MODELS .128 B.1 Univariate Small Area Model ..... .. ... ... 128 B.2 Bivariate Small Area Model .......................... 130 C FULL CONDITIONAL DISTRIBUTIONS .. .. 135 C.1 Semiparametric Case Control Model . 135 C.2 Semiparametric Small Area Models . 136 C.2.1 Semiparametric Univariate Small Area Model .... 136 C.2.2 Univariate Random Walk Model .. .. 137 C.2.3 Bivariate Random Walk Model ..................... 137 R EFER ENC ES . . 139 BIOGRAPHICAL SKETCH ................................ 145 that ud and ed are mutually independent with u ~- N(0, are the sampling standard deviations corresponding to the CPS direct median income estimates obtained using the "generalized variance function" technique mentioned in Section 3.1.1. In the datasets provided by the Census Bureau, these estimates are given for all the states at each of the time points. The knots (-, ..., rK) are usually placed on a grid of equally spaced sample quantiles of xj's. From (3-1) and (3-2), we have OU = f(x) + bi + ud which reflects our basic assumption that the true unknown household median income may have an unspecified variational pattern with the IRS mean (or median) income. Thus, the covariate effect is expressed by the unspecified nonparametric function f(xy) which reflects the possible nonlinear effect of xy on 6y. 3.2.2.2 Model II : Semiparametric Random Walk Model (SPRWM) Since, for each state, the response and the covariates are collected over time, there may be a definite trend in their behavior. Thus, we added a time specific random component to (3-1) and modeled it as a random walk as follows Yu =X' + Z',>y + bi + v, + u, + eu = 0 + e (3-3) where 0y = X1/ + Z',- + b, + vj + u, Here, vj denotes the time specific random component. We assume that, (vjv_ -_, O-) ~ N(vj-_, O-) with vo = 0. Alternatively, we may write, vj = vj-_+ wj where wj ~- N(0, ov). This is the so-called random walk model and is similar to the systems equations used in dynamic linear models. Before proceeding to the next section, we may note that unlike the models of Ghosh et al. (1996), the models given in (3-2) and (3-3) incorporate state specific random effects (bi). This rectifies a limitation of the former as pointed out in Rao (2003). 3.4.3 Analytical Results Data on CPS median income and IRS mean incomes were available for 50 states and the District of Columbia for the time span 1995-2004. CPS median income ranged from $24,879.68 to $52,778.94 with a mean of $36,868.48 and standard deviation of $5954.94 while IRS mean annual income ranged from $27,910 to $72,769.38 with a mean of $41,133.45 and standard deviation of $7196.56. We fitted Model I (SPM) with all possible knot choices from 0 to 40 but the best results were achieved with 5 knots. The estimates (with 5 knots) improved significantly over the CPS estimates based on all the four comparison measures. Addition of more knots seemed to degrade the fit of the model. This may happen as pointed out in Ruppert (2002). On the other hand, the SAIPE model based estimates were slightly superior to the SPM estimates. Next, we fitted the semiparametric random walk model (SPRWM) to our data. Overall, the random walk structure lead to some improvement in the performance of the estimates. However, for the model with 5 knots, the performance of the estimates remained nearly the same. This may be because 5 knots is sufficient to capture the underlying pattern in the income trajectory and the random walk component doesn't lead to any further improvement. Last but not the least, the random walk model estimates, although generally better than those of the basic semiparametric model, still cannot claim to be superior to the SAIPE estimates for all the comparison measures. Table 3-1 reports the posterior mean, median and 95% Cl for the parameters of the SPRWM with 5 knots. It is of interest that the 95% Cl for 71, 74 and 75 doesn't contain 0 indicating the significance of the first, fourth and fifth knots. This is indicative of the relevance of knots in the penalized spline fit on the CPS median income observations. The same is true for the coefficients of SPM. In the most general case, y(t + a) can also be modeled as a P-spline i.e K* 7(t +a,) = o+01(t+a )+... + 0(t + )a O+kr(t + a (): k=l = r.dC(t+ a)' (2-4) where Vrc(t + a) = [1, (t + a),..., (t + af), (t a ) ,..., (t + af K) ] S= (40 .... OK* r)' and ((1, ..., K*) are the knots. As special cases of (2-4), we may consider y(t + a) = 0, in which case the covariate is the area under the PSA process {X,(t a), -c < t < 0} and ao is its effect on the disease probability (or logit of the disease probability). We can also assume 7(t + ad) = Oo + 01(t + ad) which signifies a linear pattern of the effect of the exposure trajectory on the disease probability. In the above models, the knots can be chosen on a grid of equally spaced quantiles of the ages. Replacing (2-2) and (2-4) in the R.H.S of (2-3), we have P(Di = 1X,(t +ad),-c< t = L (a+ (Pp,(t+ a )'Pi+-q,(t+a)'b,)((ta + a)'dt) = L(a +'Mi+ bQi) (2-5) where M, = p,,'(t a+ a)I ( t a)'dt and Q, = f (t + ai)'rc(t + a)'dt. For pre-chosen degrees of the basis functions and the knots, both Mi and Q, are matrices and are available in closed forms. We assume normal distributional forms for the spline coefficients in (2-2) and (2-4) in order to penalize the jumps of the spline at the knots. Thus, we have 3p+k ~ N(O, a)(k = 1,... K); b,q+m N(O, j)(m = 1...M) and /k+r ~ N(0, o-)(k = 1.... K*). Finally, the random subject specific deviation function g,(ay) is modeled as b, ~ N(0, oj)(i = 1 ..., N;j = 0 ..., q). 00 0 10 20 30 40 0 5 10 15 20 25 30 0 0 o distribution of the basic semiparametric and semiparametric random walk 0 I I I I I I I I I 0 10 20 30 40 0 5 10 15 20 25 30 Theoretical Quantiles of Chi-Square (9) Theoretical Quantiles of Chi-Square (9) A Basic Semiparametric Model B Semiparametric RW Model Figure 3-5. Quantile-quantile plot of RB values for 10000 draws from the posterior distribution of the basic semiparametric and semiparametric random walk models. The X-axis depicts the expected order statistics from a X2 distribution with 9 degrees of freedom. second assumption naturally holds in our case. Regarding the first one, since we have multiple observations over time for every state, there may be within-state dependence between those. Thus, instead of taking all the observations (i.e the CPS median income values), we decided to use the last observation for each state. For the basic semiparametric model (SPM), the above summary measures were respectively 0.049 and 0.5 while for the random walk model (SPRWM), these were 0.047 and 0.51. These measures suggest that both SPM and SPRWM fits the data quite well. Figure 3-5A and 3-5B shows the quantile-quantile plots of RB values obtained from 10000 samples of SPM and SPRWM with 5 knots. Both the plots demonstrate excellent agreement between the distribution of RB and that of a X2(9) random variable. Johnson points out that the Bayesian chi-square test statistic is also an useful tool for code verification. If the posterior distribution of RB deviates significantly from its null distribution, it may imply that the model is incorrectly specified or there are coding errors. Since the summary measures are quite close to the corresponding null values, Although a great amount of work has been done in the frequentist domain, Bayesian modeling for case-control studies did not really start until the late 1980's. The development of Markov chain Monte Carlo techniques lead to a rapid progression in this front. Althman (1971) is probably the first Bayesian work which considered several 2 x 2 contingency tables with a common odds ratio and performed a Bayesian test of association based on the common odds ratio. Later, Zelen and Parker (1986), Nurminen and Mutanen (1987) and Marshall (1988) considered identical Bayesian formulations of a case control model with a single binary exposure. These works dealt with inference from the posterior distribution of summary statistics like the log odds ratio, risk ratio and risk difference. Ashby et al. (1993) analyzed a case control study from a Bayesian perspective and used it as a source of prior information for a second study. Their paper emphasized the practical relevance of the Bayesian perspective in a epidemiological study as a natural framework for integrating and updating knowledge available at each stage. Muller and Roeder (1997) introduced a novel aspect to Bayesian treatment of case-control studies by considering continuous exposure with measurement error. Their approach is based on a nonparametric model for the retrospective likelihood of the covariates and the imprecisely measured exposure. They chose the non-parametric distribution to be a class of flexible mixture distributions, obtained by using a mixture of normal models with a Dirichlet process prior on the mixing measure (Escobar and West, 1995). The prospective disease model relating disease to exposure is assumed to have a logistic form characterized by a vector of log odds ratio parameters P. This paper pioneered the use of continuous covariates, measurement error and flexible non-parametric modeling of exposures in a Bayesian setting and brought to light the tremendous possibility of modern Bayesian computational techniques in solving complex data scenarios in case-control studies. Seaman and Richardson (2001) extended the binary exposure model of Zelen and Parker to any number of categorical APPENDIX A PROOF OF BAYESIAN EQUIVALENCE RESULTS Proof of Theorem 1. Let Ydj (d = 0, 1;j = 1,..., J) be independently distributed as Poisson(Adj) where logAd = log/ + dlog9 + logj + d4' / Z(t)(t)dt (A-1) Thus, the likelihood will be 1 J L(ji, O,i6, f fA= i ( )exp(-Ad,) d and hence the log likelihood will be 1 J 1(p,0,, 6)= -{ydjlog(Ad)- Ad} d= Oj 1 Now, replacing the expression of logAdj from (A-1) we have 1(p, ) = yyj (log+( dlog,' ogJ dq3' Zj(t)W(t)dt d=0j=1l c 1 J 0 -ddjexpp (d' Zj(t)W (t)dt) (A-2) d=Oj=1 -c Differentiating (A-2) w.r.t p and 0 and solving the resulting equations we have = EYyoj/CE and 0= J >yoj 5jexp (q' Zj(t)(J(t)dt) J J Replacing the above expressions in (A-2) and then exponentiating, we obtain the expression of L(6, 4) in (2-8). Again, differentiating (A-2) w.r.t 6j, we have J = d j 1...J (A-3) 5 Odexp d Zj(t)xW(t)dt) d J-c It is easy to show that if we replace (A-3) in (A-2) and then exponentiate, we get the expression for L(O, 4) in (2-9). Since the order of maximization is immaterial, it follows that, L(6, 4) and L(O, 4), once maximized over the nuisance parameters (0 and 6 122 families remains interesting nevertheless. Now, we will briefly discuss the estimation procedure that the U.S. Census Bureau used to follow towards that end. In estimating the median income of four-person families, the U.S. Census Bureau relied on data from three sources. The basic source was the annual demographic supplement to the March sample of the Current Population Survey (CPS) which used to provide the state specific median income estimates for different family sizes. The second source was the decennial census estimates for the year proceeding the census year i.e 1969, 1979, 1989 and so on. Lastly, the Census Bureau also used the annual estimates of per capital income (PCI) provided by the Bureau of Economic Analysis (BEA) of the U.S. Department of Commerce. Each of the above data sources (and the resulting estimates) have some disadvantages which neccesiated an estimation procedure that used a combination of all three to produce the final median income estimates. The CPS estimates were based on small samples which resulted in substantial variability. On the other hand, decennial census estimates, although having negligible standard errors, were only available every 10 years. Due to this lag in the release of successive census estimates, there was a significant loss of information concerning fluctuations in the economic situation of the country in general and small areas in particular. Lastly, the per capital income estimates didn't have associated sampling errors since they were not obtained using the usual sampling techniques. The details of the estimation procedure appears in Fay et al. (1993). The Census Bureau based their estimation procedure on a bivariate regression model suggested by Fay (1987). In doing so, they used median income observations for three and five person families in addition to those of four person families. The basic dataset for each state was a bivariate random vector with one component the CPS median income estimates of four person families and the other component being the weighted average of CPS median incomes of three and five person families, with weights 0.75 and 0.25 respectively. Both the regression equations used the base year expressed using truncated polynomial basis functions with varying degrees and number of knots, although other types of basis functions like B-splines or thin plate splines can also be used. We have worked with two types of models viz. a regular semiparametric model and a semiparamteric random walk model. For each of these models, analysis has been carried out using a hierarchical Bayesian approach. Since we chose non-informative improper priors for the regression parameters, propriety of the posterior has been proved before proceeding with the computations. Markov chain Monte Carlo methodologies, specifically, Gibbs sampling (Gelfand and Ghosh, 1998) has been used to obtain the parameter estimates. We have compared the state-specific estimates of median household income for 1999 with the corresponding decennial census values in order to test for their accuracy. In doing so, we observed that the semiparametric model estimates improve upon both the CPS and the SAIPE estimates. Interestingly, the positioning of the knots had significant influence on the results as will be discussed later on. We want to mention here that the SAIPE model had a considerable advantage over ours in that they used the census estimates of the median income for 1999 as a predictor. In small area estimation problems, the census estimates are regarded as the "gold standard" since these are the most accurate estimates available with virtually negligible standard errors. So, using those as explanatory variables was an added advantage of the SAIPE state level models. The fact that our estimates still improve on the SAIPE model based estimates is a testament to the flexibility and strength of the semiparametric methodology specially when observations are collected over time. It also indicates that it may be worthwhile to take into account the longitudinal income patterns in estimating the current income conditions of the different states of the U.S. The rest of the chapter is organized as follows. In Section 3.2 we introduce the two types of semiparametric models we have used. Section 3.3 goes over the hierarchical Bayesian analysis we performed. In Section 3.4, we describe the results of the data population (or cohort) over time is often impractical. Thus, case control studies are generally retrospective in nature. Case-control studies have consistently attracted the attention of statisticians, and as a result, a rich and voluminous body of work has developed over the years. Notable work in the Frequentist domain include Cornfield (1951) who pioneered the logistic model for the probability of disease given exposure. He was the first to demonstrate that the exposure odds ratio for cases versus controls equals the disease odds ratio for exposed versus unexposed and that the latter in turn approximates the ratio of the disease rates if the disease is rare. Let D and E be dichotomous factors respectively characterizing the disease and exposure status of individuals in a population. A common measure of association between D and E is the (disease) odds ratio P(D= 1IE= 1)/P(D= 0|1E= 1) P(D= IE = O)/P(D= 0|E = 0) By applying the Bayes theorem, the above expression can be rewritten as = P(E = 1D = 1)/P(E= 0D = 1) (1-2) P(E = l1D = O)/P(E = 0|D = 0) which is the exposure odds ratio. Another well known measure of association is the relative risk (RR) of disease for different exposure values given by P(D = 1 E = 1)/P(D = 1IE = 0). For rare diseases, both P(D = 0|E = 0) and P(D = 0|E = 1) are close to one and the disease odds ratio is approximately equal to the relative risk of disease. The classic paper by Mantel and Haenszel (1959) further clarified the relationship between a retrospective case-control study and a prospective cohort study. They considered a series of 2 x 2 tables as in Table 1-1 Table 1-1. A typical 2 x 2 table Disease Status Exposed Not Exposed Total Case nli no1i nli Control noii nooi noi Total eli eoi Ni 3.5 Model Assessment To examine the goodness-of-fit of the semiparametric models, we used a Bayesian Chi-square goodness-of-fit statistic Johnson (2004). This is essentially an extension of the classical Chi-square goodness-of-fit test where the statistic is calculated at every iteration of the Gibbs sampler as a function of the parameter values drawn from the respective posterior distribution. Thus, a posterior distribution of the statistic is obtained which can be used for constructing global goodness-of-fit diagnostics. To construct this statistic, we form 10 equally spaced bins ((k 1)/10, k/10), k = 1,..., 10, with fixed bin probabilities, pk = 1/10. The main idea is to consider the bin counts mk(O) to be random where 0 denotes a posterior sample of the parameters. At each iteration of the Gibbs sampler, bin allocation is made based on the conditional distribution of each observation given the generated parameter values i.e YU would be allocated to the kth bin if F(YU|) e ((k 1)/10, k/10), k = 1,..., 10. The Bayesian chi-square statistic is then calculated as R8(&)= m/k() npk 2 For the purpose of model assessment, two summary measures can be used, both derived from the posterior distribution of RB(O). First one is the proportion of times the generated values of RB exceeds the 0.95 quantile of a X distribution. Values quite close to 0.05 would suggest a good fit. The second diagnostic is the probability that RB(O) exceeds a X2 deviate i.e A = PI(RB() > X), X X Since the nominal value of this probability is 0.5, values close to 0.5 would suggest a good fit. The only assumptions for this statistic to work are that the observations should be conditionally independent and the parameter vector should be finite dimensional. The We assume that conditional on the past observations, Y, depends only on the previous p observations i.e (Y,_t t-2, ..., Yt-). Here we have to deal with the following three types of dependence structures : 1. Dependence between response and dropout time modeled by the latent classes. 2. Short range (serial dependence) between Y, and (Yt-_,..., ,-p) modelled by a MTM(p). 3. Long range or non-diminishing dependence among the Y,'s modelled by the subject specific random effects bi, i = 1,..., N. We first specify the Marginal model as T = E(YtX t,0) = g-l(t) (5-5) The above model marginalizes over the subject specific random effects and over the latent class distribution (implicitly over the dropout distribution) as well. In order to fully specify the association due to repeated measurements and nonignorability in the missingness process, we specify a conditional model in addition to the marginal model. By conditional, we mean conditioned over the random effects and latent classes. We assume that the relevant information in the dropout times is captured by the latent variable S this is obvious because the specific latent class a subject would belong to would solely depend on his/her dropout time. Thus, we specify a mixture distribution over these latent classes, as opposed to over D itself. Before delving into the model, it is important to note that the conditional model parameters are not of main interest, and in fact will be viewed as nuisance parameters. This is because we are not interested in estimating either subject-specific effects (i.e. effects conditional on the random effects) or class-specific covariate effects (i.e. effects of covariates on Y given a particular dropout class). Moreover, the conditional model should be so specified that it is compatible with the marginal model (5-5). As we will see below, this leads to a somewhat complicated model. Specifying this conditional model 112 priors on the inverse of the variance components ( ..... o-, a ,o- ). The prior distributions are assumed to be mutually independent. We choose small values (0.001) for the gamma shape and rate parameters to make the priors diffuse in nature so that inference is mainly controlled by the data distribution. Thus, we have the following priors : 3 ~ uniform(RP++), (pj)-1 ~ G(cj, d)(j = 1 ... t), (j)-1 ~ G(c, d), (7)-1 G(c,, d,) and (o)-1 ~ G(cv, dv). Here X ~ G(a, b) denotes a gamma distribution with shape parameter a and rate parameter b having the expression f(x) oc xa-lexp(-bx), x > 0. Since we have chosen improper priors for 0, posterior propriety of the full posterior have been shown. We have the following theorem Theorem 1. Let 2x = max(, ...,.2) = '.7. say, for some k e [1,..., t]. Then, posterior propriety holds if the following conditions are satisfied 1. (m p 5)/2 + ck > 0 and dk > 0 2. m/2 + cj 2 > 0 and dj >0,j = 1,..., t;j 4 k 3.3.3 Posterior Distribution and Inference The full posterior of the parameters given the data is obtained in the usual way by combining the likelihood and the prior distribution as follows m t p(f2Y, X, Z) x H L(Yi, Xi, Zili)7(/3)7(o)7(o) () (3-6) i=1 j=1 For the random walk model, there will be an additional term 7r(a2). By the conditional independence properties, we can factorize the full posterior as [0, ,b, a a2, { ..., }Y, X,Z] o [Ylo ][0|/3,, b,{ ..., X, Z][b|l ] x t [7 1- 1 [/3]7[- [ ]nb] j= 1 Our target of inference is {0,, i = 1,..., m;j = 1, ...t}, the true median household income of all the states. Since the marginal posterior distribution of 0, is analytically intractable, high dimensional integration needs to be carried out in a theoretical Now, integrating (A-5) w.r.t 0 we obtain J NJ p(e, O, rly) ox p(-) J y, j 1 c x exp (' yi Zt(t)(dt 0 Y (A-6) j 1 J -c j= 1 Integration of (A-6) w.r.t b yields (2-12) after some minor manipulation. (iii) The order in which p(O, 6, |0y) is integrated w.r.t the parameters does not make any difference in the marginal posterior density of p(0). Thus, integration of p(w, 01y) w.r.t w or p(O, 01y) w.r.t 0 will yield the same marginal posterior density p(0|y) of 0. Remarks : 1. As in Seaman and Richardson (2004), the assumption of existence and finiteness of E (04' J Zq(t)W(t)dt and E 4' Z,(t)V(t)dt is automatically satisfied provided the prior density p(O) ensures that E(O) exists and is finite. 2. The posterior propriety of p(O, 6, 0 y) in (A -10) can be shown in a similar way to that in Seaman and Richardson (2001). 3. The prior distribution p(O) of 0 induces a prior distribution on the "influence function" {1(t), -c < t < 0} in the logistic case-control model in (2 -3) since 7(t) = O'(t), -c < t < 0. Proof of Theorem 3. Let D denotes the disease status with r + 1 categories. As before, let {X(t), -c < t < 0} be the exposure trajectory with support S = {Z(t), ..., Zj(t), -c < t < 0}, the set of all exposure trajectories. Let P(D = dlX(t) = Zk(t), -c < t < 0) = Pdk, (d = 0,1, ..., r; k = 1,..., K) and P(X(t) = Zk(t), -c < t < 0|D = 0) = k/ 11. Let ndk be the number of individuals with D = d and X(t) = Zk(t), -c < t < 0}. It can be shown that 6kPdk/POk P(X(t) = Zk(t), -c < t < OD = d) = k pk S1PdI/Po 1=1 Here By = (01, 0y2), uy = (u6i, U2)', ey = (edl, e2)', bi = (bil, bi2)', = (/01, ... /pl, 02, q2 = (711, ... 7K1i, 712 ... 7K22), x1 ... x 0 0 ... 0 0 ... 0 1 Xi2 ... XU and Z ((X Tr )P ... (X TK11)p 0 ... 0 0 ... 0 (X2 712)q ... (X2 TK22) Analogous to the univariate case, we assume bi i'nd N(0, Xo), and 7 ~ N(0, 1,). e. and u. are mutually independent with e. 'ind N(0, :y) and u- ~ind N(0, qIj). For simplification purposes, we assume that Yo = diag(o-7, o-,), and 1E = diag(o-71, o-,) where o- is assumed to be known and is estimated from the data as in the univariate framework. The above bivariate model can easily be generalized to a multivariate framework if the need arise. 4.2.2.2 Bivariate random walk model In order to model any conspicuous trend in the income observations for a specific family size and/or a specific state, we add a time specific random component to the simple bivariate model (4-2) as follows Yu = U' Z'7 + b + v + u + e = o,+ e (4-3) where 0y = U 0/ + Z'y bi + vj + uy. As in Section 3.2.2.2, we assume that (v jvj_ Ev) N(vj-_, Ev) with vo = 0. Alternatively, we may write vj = vj-_ + wj where wj /i.i.d N(0, Iv). Table 4-4. Percentage improvements of bivariate non-random walk estimates over Census Bureau estimates Estimate ARB ASRB AAB ASD GNK.TS(4,3) -0.48% -2.52% 1.03% -2.01% GNK.NTS(4,3) -8.99% -22.45% -8.77% -21.33% BSPM(1)(4,3) 7.43% 0.00% 8.81% -1.46% BSPM(2)(4,3) 3.38% 15.38% 4.42% 12.61% GNK.TS(4,5) 22.19% 30.52% 21.23% 24.79% GNK.NTS(4,5) 0.31% -0.18% 0.33% -3.04% BSPM(4,5) 13.85% 23.08% 12.74% 13.57% GNK.TS(4,3+5) 2.94% 3.56% 2.84% 1.61% GNK.NTS(4,3+5) -9.36% -17.18% -9.56% -17.64% BSPM(1)(4,3+5) 8.45% 7.69% 8.90% 1.05% BSPM(2)(4,3+5) 2.37% 7.69% 4.37% 14.54% Now let us consider the bivariate random walk model. For the case with 4 and 3 person families, the lowest comparison measures were obtained for three models with degrees of freedoms and number of knots (3, 6), (5, 6) and (9, 1) respectively. We denote these models as BRWM(1)(4,3), BRWM(2)(4,3) and BRWM(3)(4,3) respectively. Each of these models significantly improves upon the CPS and Census Bureau estimates and are also superior to the bivariate time series and non-time series models proposed by Ghosh et al. (1996) (GNK). The random walk estimates also seem to improve marginally over those corresponding to the non-random walk semiparametric model. When we consider the median income estimates of 4 and 5 person families, the random walk model with degrees of freedom 5 and 1 knot in the trajectory seems to perform the best. The comparison measures are significantly better than the CPS, Bureau and the non-time series model of GNK. However, they fall marginally short of the time series estimates but fare better than the corresponding estimates obtained from the non-random walk model (BSPM(4, 5)). We denote this model as BRWM(4, 5). Lastly, for the model with median incomes of 4 person families and the weighted average incomes of 3 and 5 person families (with weights 0.75 and 0.25) as response vectors, the best results were obtained for the model with 5 degrees of freedom and 1 knot in the trajectory. The comparison measures were significantly better than the CPS, where W is finite if (m p 5)/2 + ck > 0, dk > 0, m/2+ cj 2 > 0 and dj > 0 for j = 1, ..., t;j / k. Combining (B-1) and (B-5), we have / < W I ... Jf {L{(Y, ,)L(ba) } L(i) ~) ()di2* (B-6) where f* = (0 3 b). Since all the components of the integrand in (B-5) have proper distributions, the above integral would be finite thus proving posterior propriety. For the random walk model, the integrand in (B-1) will have an additional likelihood term nli L(vI vjv_, oi) and a prior term 7(ao2). The derivation would then proceed exactly as above and the integrand in (B-5) will also contain these additional terms. But since both of these are proper distributions (normal and inverse gamma respectively), I will still be finite under the conditions stated in the theorem. B.2 Bivariate Small Area Model The proof of posterior propriety for the bivariate semiparametric model is outlined below. Proof of Theorem : Here, the parameter space is Q* = (0, 0, 7, b, Zo, -7, {( ,.... }). Here also, due to the same logic as in the univariate case, we just need to show I p()p(0| y, b, { l,..., J})d3 < oo / ( ,,- ,/ ,6 or, J exp( (, X.3 Z71 b),) (0 X.3 Z>7 bi) df < oo (B-7) in order to prove posterior propriety. Using the same type of algebraic manipulations as in the univariate case, the L.H.S of (B-7) can be shown to be | X -X WX/- exp W./WW (B-8) 2J /J 130 Thus, the likelihood function for the ith state, (4-4) will have an extra component corresponding to v given by L(v lvj_ v,) which has a normal distribution with mean vj_1 and covariance matrix 1,. 4.3.2 Prior Specification To complete the Bayesian specification of our model, we need to assign prior distributions to the unknown parameters. We assume noninformative improper uniform prior for the polynomial coefficients (or fixed effects) 3 and proper conjugate Inverse Wishart priors on the variance covariance matrices ({f1,..., q}, 01, ). The prior distributions are assumed to be mutually independent. We choose the inverse Wishart parameters in such a way that the priors are diffuse in nature so that inference is mainly controlled by the data distribution. Thus, we have the following priors : 3 ~ uniform(RP ++2), v ~_ IW(Sj, dj)(j = 1, ... t~, IW(S, d7), 1o IW(So, do) and I, IW(S,, d,) Here X ~ IW(A, b) denotes a inverse Wishart distribution with scale matrix A and degrees of freedom b having the expression f(X) oc IXI-(b+p+1)/2exp(-tr(AX-1)/2), p being the order of A. 4.3.3 Posterior Distribution and Inference The full posterior of the parameters given the data is obtained in the usual way by combining the likelihood and the prior distribution as follows m t p(f|Y, U, Z) oc n L(Yi, Ui, Zi|n,)7(0)7r(o)7(I ) [H (q) (4-5) i=1 j= 1 For the random walk model there will be an additional term 7r(,). By conditional independence properties, we can factorize the full posterior as [0, 3, 7, b, o {, i1, .... 't} Y, U, Z] oc [Y le][el 3, 7, b, { Wi,..., W }, X, Z] t x [bl E][7l E[/3][E0] f[5o[[L] j= 1 Our target of inference is {06,, i = 1,..., m;j = 1, ...t}, the true median income for of four-person families for all the states. Since the marginal posterior distribution To my mother and to the memory of my father comparable performances to some established methodologies specially those involving time series modeling techniques. Based on our findings in parts two and three, we come to the conclusion that semiparametric and nonparametric regression models can be a attractive alternative to the more traditional modeling frameworks specially in situations where information on different characteristics of small areas are available at multiple time points in the past. REFERENCES Agresti, A. (2002). Categorical data analysis. Wiley. Albert, J. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669-679. Althman, P. (1971). The analysis of matched proportions. Biometrika 58, 561-576. Ashby, D., Hutton, J., and McGee, M. (1993). Simple Bayesian analyses for case-controlled studies in cancer epidemiology. Statistician 42, 385-389. Battese, G., Harter, R., and Fuller, W. (1988). An error component model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association 83, 28-36. Bell, W. (1999). Accounting for uncertainty about variances in small area estimation. Bulletin of the International Statistical Institute . Botts, C. and Daniels, M. (2008). A fexible approach to Bayesian multiple curve fitting. Computational Statistics and Data Analysis 52, 5100-5120. Bradlow, E. and Zaslavsky, A. (1997). Case influence analysis in Bayesian inference. Journal of Computational and Graphical Statistics 6, 314-331. Breslow, E. T. and Day, N. E. (1980). Statistical Methods in Cancer Research, Volume 1. International Agency for Research on Cancer, Lyon. Breslow, E. T, Day, N. E., Halvorsen, K. T, Prentice, R. L., and Sabai, C. (1978). Estimation of multiple relative risk functions in matched case-control studies. Ameri- can Journal of Epidemiology 108, 299-307. Breslow, N. (1996). Statistics in epidemiology : The case-control study. Journal of the American Statistical Association 91, 14-28. Carroll, R. J., Wang, S., and Wang, C. Y. (1995). Prospective analysis of logistic case control studies. Journal of the American Statistical Association 90, 157-169. Catalona, W., Partin, A., Slawin, K., and Brawer, M. (1998). Use of the percentage of free prostate-specific antigen to enhance differentiation of prostate cancer from benign prostatic disease : A prospective multicenter clinical trial. Journal of the American Medical Association 19, 1542-1547. Cornfield, J. (1951). A method of estimating comparative rates from clinical data: applications to cancer of the lung, breast, and cervix. Journal of the National Cancer Institute 11, 1269-1275. Cornfield, J., Gordon, T, and Smith, W. W. (1961). Quantal response curves for experimentally uncontrolled variables. Bulletin of the International Statistical Institute 38, 97-115. 139 when available, may lead to more precise estimates of the odds ratios of disease. We use semiparametric regression procedures to model the exposure profiles of the cases and controls and also the influence pattern of the exposure profile on the disease status. This enables us to analyze how the present disease status of a subject is influenced by his/her past exposure conditions conditional on the current ones. Analysis is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) algorithms. The proposed methodology is motivated by, and applied to a case-control study of prostate cancer where longitudinal biomarker information is available for the cases and controls. The second and third part of my dissertation deals with univariate and multivariate semiparametric procedures for estimating characteristics of small areas across the United States. In the second part, we put forward a semiparametric modeling procedure for estimating the median household income for all the states of the U.S. and the District of Columbia. Our models include a nonparametric functional part for accommodating any unspecified time varying income pattern and also a state specific random effect to account for the within-state correlation of the income observations. Model fitting and parameter estimation is carried out in a hierarchical Bayesian framework using Markov chain Monte Carlo (MCMC) methodology. It is seen that the semiparametric model estimates can be superior to both the direct estimates and the Census Bureau estimates. Overall, our study indicates that proper modeling of the underlying longitudinal income profiles can improve the performance of model based estimates of household median income of small areas. In the third part of the dissertation, we put forward a bivariate semiparametric modeling procedure for the estimation of median income of four-person families for the different states of the U.S. and the District of Columbia while explicitly accommodating for the time varying pattern in the income observations. Our estimates tend to have better performances than those provided by the Census Bureau and also have marginal regression model. But this goal cannot be achieved using a non-linear link function since it doesn't hold for the marginal covariate effects. Heagerty (1999) proposed marginally specified logistic models which lead to direct modeling of the marginal covariate effects. Let Y, and Xit respectively be the response observation and the covariate vector corresponding to the ith individual at the tth time point, i = 1, 2,..., N ; t = 1, 2,..., T. Let E(YtXit, /) be the marginal mean of Y,. It is specified as logit [E(Y tX,t,/3)] = X/3 (5-2) The above structure is the marginal regression model. Now, in order to specify the dependence among (Y,1, Y2,..., -T) the following conditional model is specified logit [E( YXit, bi)] = At + bi (5-3) where bi N(0, 0). Ai, can be computed by solving the following convolution equation P(Yt = 1)= P(Y,t Xit, bi)dF(bi) (5-4) Thus A is a function or / and 0. In this study we will be proposing a model which will marginalize over the random effects and the drop-out distribution to directly model the marginal covariate effects of interest taking into account both the serial and exchangeable dependence structure among the Yi's. Let us briefly go over the necessary notations with respect to subject i. Let Y = (Y,, Y, ..., YT) be the response vector. Let the T unique dropout times be grouped into m classes by the latent indicators Si = (Si, ..., Sim). Here S is an indicator for class j,j = 1,..., m (m < T) such that S { 1 if the ith subject is in class Otherwise. 0 otherwise. I I I I I I I I 0 20 40 60 80 100 120 140 Deleted Case A 1p I I I I I I I I 0 20 40 60 80 100 120 140 Deleted Case B Yo 1 ,7 77~l,,l~l 1 T ''I 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Deleted Case Deleted Case C 1 D Disease Probability Figure 2-2. Sensitivity of 31, 0 Q i and disease probability estimates to case-deletions. L A ll. -IJ LJA A 1 ,LLL kIL|J|L h 2 iJ,I o N- E s LUJ E 0 0 o o. ' '1' I' *1.I I" 1 I I' o oC 0 - 0 LC 0 62 o C LO I II I . t 1, 1 i r.r i[r ,'t r .i 0 .. I "1I d trajectory on the binary disease outcome. Inference on these two models will be done simultaneously and is described in Section 3. Our modeling framework bears some resemblance to that of Zhang et al. (2007) who used a two stage functional mixed model approach for modeling the effect of a longitudinal covariate profile on a scalar outcome. They proposed a linear functional mixed effects model for modeling the repeated measurements on the covariate. The effect of the covariate profile on the scalar outcome was modeled using a partial functional linear model. In doing so, they treated the unobserved true subject-specific covariate time profile as a functional covariate. For fitting purposes, they developed a two-stage nonparametric regression calibration method using smoothing splines. Thus, estimation at both the stages was conveniently cast into a unified mixed model framework by using the relation between smoothing splines and mixed models. The key differences between their framework and ours is that we use Bayesian inferential techniques to simultaneously estimate the parameters of the exposure and disease models. Moreover, instead of a linear modeling framework, we use a combination of linear and logistic models since our response is binary. Exposure Trajectory Model The exposure trajectory model is given by vy = Xi(ay) + e- = f(ay) + gi(ay) + e- (2-1) where e ~- N(0, o-2), f(a) is the population mean function modeling the overall PSA trend as a function of age for all the subjects while gi(a) is the subject specific deviation function reflecting the deviation of the ith subject specific profile from the mean population profile. The reason for modeling exposure as a function of age is that for a randomly chosen subject with unknown disease status, the PSA value at a certain time point should depend on the subject's age at that time point controlling for the time with respect Thus, we have | XV Y -1X' I > I A min XyUX | =| Z> X' |-, < I(AmnZ xJ- x I =11. -x, xx/ 1-1/2 < min pxq2 ^ ^-V<(A) 2 YLXUXW Since I W 1 = J Ajk, V 1... t k=1 (m+d,-r-1) (m+d,-r-1) 1 1 2 I (A k) 2 k=1 Now, replacing (B-11) and (B-12) S< xx | .('-n)- / | XUX'y | .. (A in where T denotes "trace". Let Am"n Then, I < /1 x 2I where in the expression of I in (B-10), we have p+q+2 t r (mdj-r-1) V1 J-1 2 H (,Ajk) 2 exp -T 2( d ...d2 j1( k=1 (B-13) = Aim, / [1 ..., t]; m [1 ..., r]. i f (m +d-r-1) (m+dp-p-q-2)-r- l1 = I> XyX 2 (Ak) 2 (AIn) 2 i, {k= 1,k m} Sp q 2 (m+d -p-q-2)-r-1 = | XyX'- | n (A/,k) 2 1 |--1 2 ij {k=l,k m} and 1exp -T(V) 2 ) d exp -T (V2)] d1 t m .n dj -r-1 /2 { 2 -Td md(VJ Fr-1d-..} 2r 2-d 2 2 f= 1J7i} which is finite. Thus, in order to show posterior propriety, we have to prove that /2 < oo. 132 (B-11) (B-12) (B-14) to represent the separate effect of matching in each matched set. Ghosh and Chen (2002) developed general Bayesian inferential techniques for matched case-control problems in the presence of one or more binary exposure variables. Their framework was more general than that of Zelen and Parker (1986). Unlike Diggle et al. (2000), they based their analysis on unconditional rather than the conditional likelihood after elimination of the nuisance parameters. Their framework included a wide variety of links like complimentary log links and some symmetric and skewed links in addition to the usual logit and probit links. Recently Sinha et al. (2004) and Sinha et al. (2005) proposed a unified Bayesian framework for matched case-control studies with missing exposures. They also motivated a semiparametric alternative for modeling varying stratum effects on the exposure distributions. The parameters were estimated in a Bayesian framework by using a non-parametric Dirichlet process prior on the stratum specific effects in the distribution of the exposure variable and parametric priors on all other parameters. The interesting aspect of the Bayesian semiparametric methodology is that it can capture unmeasured stratum heterogeneity in the distribution of the exposure variable in a robust manner. They also extended the proposed method to situations with multiple disease states. In a typical case-control study design, the exposure information is collected only once for the cases and controls. However, some recent medical studies Lewis et al. (1996) have indicated that a longitudinal approach of incorporating the entire exposure history, when available, may lead to a gain in information on the current disease status of a subject vis-a-vis more precise estimation of the odds ratio of disease. It may also provide insights on how the present disease status of a subject is being influenced by past exposure conditions conditional on the current ones. Unfortunately, proper and rigorous statistical methods of incorporating longitudinally varying exposure information inside the case control framework have not yet been properly developed. In this work, Non-ignorable missingness can be handled by two distinct classes of models viz pattern-mixture and selection models, first formulated by Little and Rubin (1987). These approaches differ in the way they factor the joint distribution of the missing data and the response. In the former approach, the population is first stratified by the pattern of dropout resulting in a model for the whole population that is a mixture over the patterns. On the other hand, the selection modelling approach first models the hypothetical complete data and then a model for the missing data process (conditional on the hypothetical complete data) is appended to the complete data model. In this study we will focus on the Pattern mixture (PM) modeling approach. Suppose our study consists of N subjects, each of whom can be measured at T time points. Let Yi and the Di respectively denote the response vector and dropout time for the ith subject. Di is such that Di t if the ith subject drops out between the (t l)th and tth observation times. T 1 if the ith subject is a complete. Here we assume that a subject is first measured at baseline (t = 0). Thus, there be T unique dropout times. In the PM approach, it is assumed that subjects with different dropout times have different response distribution i.e f (y I ) D f (Yi) = f(yi, Di) f(yi) f (Di) (5-1) So, for the ith subject, yi and Di are assumed to be associated or dependent. Thus, in this approach models are built for [Y, Di] but inferences are based on f(y) = Sf(ylD)P(D). D An important but realistic situation that may arise in longitudinal studies is that the number of unique dropout times T (vis-a-vis, the number of times a subject is measured) maybe large. As a result the number of subjects having a particular dropout time may be quite small. Thus, stratification by dropout pattern may lead to sparse 108 "* C 8 0 * o 0 oI I I I I C)* 0 E 30000 40000 50000 60000 70000 IRS Mean Income Figure 3-4. Positions of 5 knots after realignment. The knots are the bold faced triangles at the bottom. The region between the dashed and bold lines is the additional coverage area gained from the realignment. rearrangement. Based on the number of data points inside this region, it is clear that a much larger proportion of observations has been captured with the knot realignment. No knots are in the region beyond the bold vertical lines (i.e beyond IRS mean 56000) possibly due to the very low density of the observations in that area. Overall, it seems that, the new knots can capture some of the underlying non-linear pattern in the dataset which the old knots failed to achieve. We also experimented by placing all the knots in the low density region (beyond IRS mean = 47000) but the results were not satisfactory. This indicates that the knots should be uniformly placed throughout the range of the independent variable to get an optimal fit. We have worked with 5 knots because it performed consistently well for both the SPM and SPRW models. On fitting the semiparametric models with the new knot alignment, we did achieve some improvement in the results. Table 3-2 reports C'j I I- I 1-1 30000 40000 50000 60000 70000 IRS Mean Income Figure 3-4. Positions of 5 knots after realignment. The knots are the bold faced triangles at the bottom. The region between the dashed and bold lines is the additional coverage area gained from the realignment. rearrangement. Based on the number of data points inside this region, it is clear that a much larger proportion of observations has been captured with the knot realignment. No knots are in the region beyond the bold vertical lines (i.e beyond IRS mean 56000) possibly due to the very low density of the observations in that area. Overall, it seems that, the new knots can capture some of the underlying non-linear pattern in the dataset which the old knots failed to achieve. We also experimented by placing all the knots in the low density region (beyond IRS mean = 47000) but the results were not satisfactory. This indicates that the knots should be uniformly placed throughout the range of the independent variable to get an optimal fit. We have worked with 5 knots because it performed consistently well for both the SPM and SPRW models. On fitting the semiparametric models with the new knot alignment, we did achieve some improvement in the results. Table 3-2 reports like counties, cities and other substate areas. Due to the ten year lag in the release of successive census values, there was a large gap in information concerning fluctuations in the economic situation of the country in general and local areas in particular. The establishment of the SAIPE program has largely mitigated this issue. The current methodology of the SAIPE program is based on combining state and county estimates of poverty and income obtained from the American Community Survey (ACS) with other indicators of poverty and income using the Fay-Herriot class of models (Fay and Herriot, 1979). The indicators are generally the mean and median adjusted gross income (AGI) from IRS tax returns, SNAP benefits data (formerly known as Food Stamp Program data), the most recent decennial census, intercensal population estimates, Supplemental Security Income Receipiency and other economic data obtained from the Bureau of Economic Analysis (BEA). Estimates from ACS are being used since January 2005 on the recommendation of the National Academy of Sciences Panel on Estimates of Poverty for Small Geographic Areas (2000). Income and poverty estimates until 2004 were based on data from the Annual Social and Economic Supplement (ASEC) of the Current Population Survey (CPS). Apart from various poverty measures, the SAIPE program provides annual state and county level estimates of median household income. At this point, direct ACS estimates of median household income are only available for the period 2005-2008. Thus, for illustration purpose, we have considered data from ASEC for the period 1995-1999 in order to estimate the state level median household income for 1999. This is because, the most recent census estimates correspond to the year 1999 and these census values can be used for comparison purposes. The SAIPE regression model for estimating the median household income for 1999 use as covariates, the median adjusted gross income (AGI) derived from IRS tax returns and the median household income estimate for 1999 obtained from the 2000 Census. The response variable is the direct estimate of median household income for 1999 obtained from the Since logit P( S =1 D), AOk A1D,,we have, P(S = )= P(S, +...+ S = D,) P(S, + ...+ S,_ = D,) eAD O (0e)- 1 e= i(eAj- eA/jl) (5-13) 1 eAo + A1D,) + e-1 + A1D,) Now, as mentioned earlier, D, is the dropout time for the ith subject. Also, there are T unique dropout times. Let, for t = 1,2,..., T 1 if the ith subject drops out between the (t l)th and tth observation times 0 otherwise. Thus 1b, = ('il, i,' ., ... iT) = (0, 0, ..., 0) would imply that the ith subject is a complete. So, D, = t <> = 1 and D, = T+ 1 => (',i, ',, ..., ,iT) = (0, 0, ... 0). Let pt denote the probability of dropping out between times t 1 and t, t = 1, 2,... T. So, for the ith subject, the density of D, would be Multinomial i.e P(D = d) = ... ( ... r)1- d, = 1, 2,..., T+ 1 (5-14) 5.2.4 Specification of Priors We assume that the number of latent classes m follows a truncated Poisson distribution with rate parameter j, truncated at an integer between 1 and T (the number of unique dropout times) i.e p(m) oc m =0,1,...,s where 1 < s < R For the other parameters, we assume the following priors 1. Let 0 ~ Nq(30, Z/o) assuming that Vi = 1,2,..., N and t = 1,2,... T, Xit is q dimensional. 2. Let all), a(2) ..., a(m) -"d Nr((ao, Zao). where r < q since Zt C Xit Vi = 1,2,..., N and t= 1, 2, ..., T. 3. Let o1, ,..., -ld U(a, b) where 0 < a < b < oo. 117 Posterior Sampling The full posterior of the parameters is given by m q p(2Y, D, a) N x L(Yi, Di, ai, l )() ( )( ) (-) where P(1) = ( /3 ...j3p) and (1) = ( ,0 ...., r). The full posterior can be factorized as [Q|2Y,D,a] oc [Y|3, b, a] [f / {A[Z Di, Ai,,/,a,f ,bi,][Ail,]}dzidA] [(2) ] =1. " N q q ] H\-}[b b^[bUlj2][0(1)][ 1 ]- ]- ] 3-e] i lj O j=O where 0 is the entire parameter space. Our main parameter of interest is 0 in (2-5). Since, the marginal posterior distribution of 0 is analytically intractable, we construct an MCMC algorithm to sample from its full conditionals. In doing so, we use multiple chains and monitor convergence of the samplers using Gelman and Rubin diagnostics (Gelman and Rubin, 1992). 2.4 Bayesian Equivalence As mentioned in Section (1.2), Seaman and Richardson (2004) showed that for certain choices of the priors on the log odds, posterior inference for the parameter of interest based on a prospective logistic model can be shown to be equivalent to that based on a retrospective one. As a result, a prospective modeling framework can be used to analyze case-control data which are generally collected retrospectively. Here we show that the Bayesian equivalence results of Seaman and Richardson (2004) can be extended to the semiparametric framework we have proposed. This enables us to use a prospective logistic framework (as described in Section (2.2.2)) to analyze the PSA dataset. Our modeling framework hinges on the idea that for every subject, instead of a single exposure observation, a series of past exposure observations are available. We use this "exposure trajectory" or "exposure profile" in analyzing the present 4. A ~ Nm(Ao, Zo) 5. (Oi1, 2, ...., (7-) ~ Dirichlet(71 r2, ..., rT) 6. 6, ..., 6 ii"d Nr(60, ) for the same reasons as in (3). 7. For the time being we keep the prior of 4, 7r(4) unspecified. Now, combining (5-10 5-14) and the priors specified above, we can write down the full posterior distribution of m and w, 7r(w, mlY, X, D) upto a constant. Thus, we can get the full conditional distribution of all the relevant parameters and proceed with sample generation using MCMC. The assumption of conditional independence between Y, and Di given 5, and the covariates can be verified by performing a likelihood ratio test (Frequentist) or using Bayes factors (Bayesian). The null model is given by (5-6) and the alternative model may be written as m p g{E(Yt Yk, k < t, b,, S,, Di)} = At + SUyZ + 7itkYt-k + b f(D,) (5-15) j=1 k=1 where f(Di) maybe a smooth but unspecified function of Di. Thus, the null hypothesis of conditional independence (between Y, and Di given 5, and Xi) would be simply f(Di) = 0. The test can be carried out by first fitting the null model (??). Then, the posterior probability of class membership for each subject can be estimated by f Li(YilY{-i S, = 1, bi, & ~)p(S -= l|Di; )p(Dil,|)dF(b|S,, 2) P(5, = IDi, Yi, Xi, al) = Li(Di, Yi, CV) Li(Di, Yi, w) where w is obtained by performing a full Bayesian analysis on the full conditionals of w. The Likelihood Ratio test (LRT) is then performed by fitting models (??) and (5-14) using a weighted likelihood (the weights being the above posterior probability of class membership). An alternative way of doing the above conditional independence tests would be to use score tests based on smoothing splines as used in proportional hazards models by Lin et al. (2006). 118 Since the number of latent classes m is treated as a random variable itself, we assume a prior for m along with w. Let the priors be respectively denoted by 7(m), 7r(0), r(a), {7(o ), = 1, 2..., m)}, r(A), r(p), Tr(), and 7(6). So the full posterior of m and w is given by N m 7(w, m Y, X, D) = J Li(w Y,, X,, Di)(m)r(/3)7(a)7(A)7(p)7r()r(5){f 7(u72)} i= / 1 (5-9) We can avoid the integral (w.r.t b,) in (5-8) if we also sample the big's along with the other parameters from the full posterior (5-9). In that case, the full posterior may be rewritten as 7(w, mlY, X, D)= ere L*(w Yi, Xi, N m [ L*(wlYi, Xi, Di) 7(m) 7(0)X)7(a) 7(A) 7(p) 7(0)) (6){ nT7(72)} i=1 /= 1 (5-10) m Di) = Li(Yi|Y{_i,}, Sy= 1, b,, a ), )p(5y= 1|D,; A) (5-11) For the most general case, we have assumed an OPEF structure for each Y, conditional on the past. Since the outcomes are binary, we can simplify it to a Bernoulli distribution (5-12) where p = E(Ytlyt-1, Yt-2 ,.... Yit-p, bi, Sy 1) = g- Air bi p S- it, kYit-k k k= 1 116 wh M j=1 x p(Dily)p(bilSy = 1, ~72) Li(Yi|Y{-_i, SU = 1, bi, al), 0) H G'c) (I g)(1- out over what has been already done above. I will briefly go over some of the possible extensions below. These extensions are independent of the specific area or setting where they are applied i.e these equally apply to the case control and small area scenarios we have mentioned before. 5.1 Adaptive Knot Selection As mentioned before, we have used penalized splines to model the exposure and influence profiles in the case control framework and the income trajectories in the small area estimation problem. As explained in Section 1.4, selection and proper positioning of knots is a vital aspect in any smoothing procedure involving splines. Traditionally, knots are placed at equally spaced sample quantiles of the independent variables and that's what we have done in both the case control and small area scenarios. But this procedure has its fair share of drawbacks it was evident in the univariate small area problem where the original placement of the knots failed to account for the low density region of the data pattern where the non-linearity was mostly concentrated. This was probably because of the quantile dependent placement procedure of the knots. Recently, there has been some research on data-driven or "adaptive" knot placement procedures in which the number and locations of the knots are controlled by the data itself rather than being pre-specified. The advantage of this procedure is that fewer number of knots would be required which would be placed in "optimal" locations along the domain. Thus, the resulting spline fit will be flexible enough to capture any underlying heterogeneity in the data pattern. Both Frequentist and Bayesian approaches have been proposed towards this end. Some Frequentist contributions include Friedman (1991) and Stone et al. (1997) who used forward and backward knot selection schemes until the "best" model is identified. Zhou and Shen (2001) used an alternative algorithm which led to the addition of knots at locations which already possessed some knots. Bayesian treatment of this problems revolves on the notion of treating the knot number and knot locations as free parameters. Some notable Bayesian contributions include 105 Table 2-2. Posterior means and 95% confidence intervals of odds ratio for / = (-10, -5) for the linear influence model Age at Diagnosis 50 60 70 80 Mean 4.99 3.27 2.22 1.56 95% C.1 (1.96, 10.41) (1.91, 5.36) (1.67, 2.98) (1.10, 2.29) 2.6.3 Overall Model Comparison For both the constant and linear influence models, we calculated the PPL criterion (described in Section 5.1) corresponding to different trajectory intervals and number of knots. These values are given in Table 2-3. The PPL values for the linear model were smaller than those corresponding to the constant influence model. Thus, we can conclude that for the prostate cancer data, the class of linear influence models fit better than the class of constant influence models. For both setups, the model with 0 knots has the worst fit (highest PPL criterion) across all trajectory lengths. For a given trajectory, the models tend to improve with an increase in the number of knots until a certain number of knots is reached. Further increase of knots tend to worsen the fit; this agrees with the findings of Ruppert (2002). The important point to note here is that the number of knots and the length of the exposure trajectory seem to interact in their effect on model fit. The best fitting constant influence model seem to be the one with exposure trajectory (-10, 0) and 3 knots. For the linear influence setup, the PPL criterion has a decreasing trend as longer exposure trajectories are taken into account. Thus, inclusion of past exposures result in an improvement of model fit. This may be indicative of the fact that past exposure observations contain significant amount of information about the current disease status. In addition, for the trajectory interval / = (-10, -5), the PPL criteria corresponding to the linear and constant influence models are moderately small. Thus, exposure observations recorded 5-10 years prior to diagnosis also provide a modest amount of information toward predicting the current disease status, corroborating the conclusions One of the major qualitative difference between the above model and our semiparametric models is that the former doesn't have a state specific random effect. In fact, it would also be interesting to compare the above model with the basic semiparametric model (SPM) with 0 knots i.e Y, = 3o + ixj + bi + u. + ey (3-8) where bi ~i.i.d N(0, o-) while ud and ed have the same distribution as above. Clearly, the only difference between (3-7) and (3-8) is that the former contains a time specific random component while the latter contains a area specific random component. Ghosh et al. (1996) showed that the estimates from the bivariate version of the GNK model (3-7) performs much better than the census bureau estimates in estimating the median household income of 4-person families in the United States. Table 3-6 depicts the comparison measures corresponding to the above models. Table 3-6. Comparison measures for time series and other model estimates Estimate ARB ASRB AAB ASD CPS 0.0415 0.0027 1,753.33 5,300,023 SAIPE 0.0326 0.0015 1,423.75 3,134,906 GNK 0.0397 0.0025 1709.58 5,229,869 SPM(0) 0.0337 0.0017 1408.7 3,137,978 SPM(5)* 0.028 0.0012 1173.71 2,334,379 SPRWM(5)* 0.0295 0.0013 1256.08 2,747,010 It is clear that, although the estimates from the GNK model perform slightly better than the CPS, those are quite inferior to the semiparametric and SAIPE estimates. This may be because the state specific random effects in the semiparametric models can account for the within-state correlations in the income values, something which the GNK model fails to do. Since the comparison measures for SPM(0) are much lower than those for the GNK model, we can also conclude that the area specific random effect is much more critical than a time specific random component in this situation. we think that our models provide a satisfactory fit to the data set and also that there are no coding errors. 3.6 Discussion The proper estimation of median household income for different small areas is one of the principal goals of the U.S. Census Bureau. These estimates are frequently used by the Federal Government for the administration and maintenance of different federal programs and also for the allotment of federal grants to local jurisdictions. Although these estimates are available annually for every state, the U.S. Census Bureau generally uses a non-longitudinal approach in their estimation procedure based on the Fay-Herriot model (Fay and Herriot, 1979). In this study, we have proposed a semiparametric class of models which exploit the longitudinal trend in the state-specific income observations. In doing so, we have modeled the CPS median income observations as an "income trajectory" using penalized splines (Eilers and Marx, 1996). We have also extended the basic semiparametric model by adding a time series random walk component which can explain any specific trend in the income levels over time. We have used as our covariate, the mean adjusted gross income (AGI) obtained from IRS tax returns for all the states. Analysis has been carried out in a hierarchical Bayesian framework. Our target of inference has been the median household incomes for all the states of the U.S. and the District of Columbia for the year 1999. We have evaluated our estimates by comparing those with the corresponding census estimates of 1999 using some commonly used comparison measures. Our analysis has shown that information of past median income levels of different states do provide strength towards the estimation of state specific median incomes for the current period. In fact, if there is an underlying non-linear pattern in the median income levels, it may be worthwhile to capture that pattern as accurately as possible and use that in the inferential procedure. In terms of modeling the underlying observational pattern, the positioning of knots proved to be both important and interesting. The 3 ESTIMATION OF MEDIAN HOUSEHOLD INCOMES OF SMALL AREAS : A BAYESIAN SEMIPARAMETRIC APPROACH .. ................ 3.1 Introduction . . 3.1.1 SAIPE Program and Related Methodology . 3.1.2 Related Research ..................... 3.1.3 Motivation and Overview . . 3.2 M odel Specification ........................ 3.2.1 General Notation .. .. .. .. .. .. . 3.2.2 Semiparametric Income Trajectory Models . 3.2.2.1 Model I : Basic Semiparametric Model (SPM) . 3.2.2.2 Model II : Semiparametric Random Walk Model 3.3 Hierarchical Bayesian Inference . . 3.3.1 Likelihood Function .. . 3.3.2 Prior Specification .. .. .. .. .. 3.3.3 Posterior Distribution and Inference . 3.4 Data Analysis .......................... 3.4.1 Comparison Measures and Knot Specification . 3.4.2 Computational Details ................... 3.4.3 Analytical Results . . 3.4.4 Knot Realignment .. . 3.4.5 Comparison with an Alternate Model . 3.5 Model Assessment ......................... 3.6 D discussion . . 4 ESTIMATION OF MEDIAN INCOME OF FOUR PERSON FAMILIES MULTIVARIATE BAYESIAN SEMIPARAMETRIC APPROACH .. ...... . . ...... . . ...... ...... . . . . (SPRWM) . . . . ...... . . ...... . . ...... . . . . . . ...... ...... : A . . ...... . . . . . . ...... ...... . . . . . . . . . . . . . . ...... . . . . . . . . 4.1 Introduction . . 4.1.1 Census Bureau Methodology . 4.1.2 Related Literature. . 4.1.3 Motivation and Overview . 4.2 Model Specification ................ 4.2.1 Notation . . 4.2.2 Semiparametric Modeling Framework . 4.2.2.1 Simple bivariate model . 4.2.2.2 Bivariate random walk model . 4.3 Hierarchical Bayesian Analysis . 4.3.1 Likelihood Function . 4.3.2 Prior Specification . 4.3.3 Posterior Distribution and Inference . 4.4 Data Analysis .................... 4.4.1 Comparison Measures and Knot Specification 4.4.2 Computational Details . 4.4.3 Analytical Results. . 4.5 Conclusion and Discussion . 59 59 61 62 65 65 66 66 67 68 68 68 69 70 71 72 73 74 78 80 82 85 85 85 87 89 90 90 91 91 92 93 93 94 94 95 96 97 98 102 . BAYESIAN SEMIPARAMETRIC REGRESSION AND RELATED APPLICATIONS By DHIMAN BHADRA A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010 aim is to examine whether past exposure observations can contribute significantly towards predicting the current disease status of a subject given his/her current exposure information. In doing so, we will also test how differential lengths of the PSA trajectories affect the current probability of disease for a particular individual. For the purpose of our analysis, we have used a linear p-spline (p = 1) with a subject specific slope parameter to model the exposure trajectory as follows K Y =/3o + 31(t + ai) + -/3,k+(td + ad 7 )+ + bi(tu + a) + e, (2-15) k=l For the prospective disease model (2-3), we considered two specific scenarios viz. constant influence, 7(t + af) = 0o and linear influence, 7(t + af) = Oo + 0l(t + af). The results for these two cases are summarized below. 2.6.1 Constant Influence Model In this parametrization, the area under the PSA process, {X,(t + af), -c t < 0} acts as the covariate and 0o signifies its effect on the disease probability. We have used different values of "c" (time, in years, by which we go back in the past to record the exposure history of a subject) to analyze the effect of differential areas under the PSA process on the current disease state. On fitting the above model, we observed that for all trajectory lengths, 0o is significant (its 95% credible interval does not contain 0). For any particular interval (i.e choice of c), the posterior means and 95% credible intervals of 00 do not change much with the number of knots (K). In addition, 0o increases as the trajectory length decreases i.e as we move closer to the point of diagnosis. This is likely related to the scale of the area under the PSA process but it also seems to support the well known medical fact that total PSA is a better discriminator of prostate cancer at times closer to diagnosis than at times further off (Catalona et al., 1998). To assess the impact of only the past PSA observations on the current disease state, we considered the exposure interval I = (-10, -5) and 3 knots in the trajectory. The posterior mean of 0o is 0.298 LIST OF FIGURES Figure page 2-1 Longitudinal exposure (PSA) profiles of 3 randomly sampled cases (1st column) and 3 randomly sampled controls (2nd column) plotted against age. 36 2-2 Sensitivity of /3, 0o, 1i and disease probability estimates to case-deletions. 56 3-1 Longitudinal CPS median income profiles for 6 states plotted against IRS mean and median incomes. (1st column : IRS Mean Income; 2nd column : IRS Median Inco m e ). . . .. 63 3-2 Plots of CPS median income against IRS mean and median incomes for all the states of the U.S. from 1995 to 1999. ........... ......... 65 3-3 Exact positions of 5 and 7 knots in the plot of CPS median income against IRS mean income. The knots are depicted as the bold faced triangles at the bottom ........................................ 75 3-4 Positions of 5 knots after realignment. The knots are the bold faced triangles at the bottom. The region between the dashed and bold lines is the additional coverage area gained from the realignment. ... 76 3-5 Quantile-quantile plot of RB values for 10000 draws from the posterior distribution of the basic semiparametric and semiparametric random walk models. The X-axis depicts the expected order statistics from a X2 distribution with 9 degrees of freedom .................. .................. 81 Nurminen, M. and Mutanen, P. (1987). Exact Bayesian analysis of two proportions. Scandinavian journal of Statistics 14, 67-77. O'brien, S. and Dunson, D. (2004). Bayesian multivariate logistic regression. Biometrics 60, 739-746. Opsomer, J., Claeskens, G., Ranalli, M., and Breidt, F. (2008). Non-parametric small area estimation using penalized spline regression. Journal of the Royal Statistical Society, Series B 70, 265-286. Paik, M. and Sacco, R. (2000). Matched case-control data analyses with missing covariates. Applied Statistics 49, 145-156. Park, E. and Kim, Y (2004). Analysis of longitudinal data in case-control studies. Biometrika 91, 321-330. Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case control studies. Biometrika 66, 403-411. Rao, J. N. K. (2003). Small Area Estimation. Wiley Inter Science, New York. Rathouz, P., Satten, G., and Carroll, R. (2002). Semiparametric inference in matched case-control studies with missing covariate data. Biometrika 89, 905-916. Robinson, G. (1991). That BLUP is a good thing : the estimation of random effects. Statistical Science 6, 15-31. Roeder, K., Carroll, R., and Lindsay, B. (1996). A semiparametric mixture approach to case-control studies with errors in covariables. Journal of the American Statistical Association 91, 722-732. Roy, J. (2003). Modeling longitudinal data with non-ignorable dropouts using a latent dropout class model. Statistics in Medicine 59, 829-836. Roy, J. and Daniels, M. (2008). A general class of pattern mixture models for nonignorable dropouts with many possible dropout times. Biometrics 64, 538-545. Rubin, D. (1981). The Bayesian bootstrap. The Annals of Statistics 9, 130-134. Ruppert, D. (2002). Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics 11, 735-757. Ruppert, D. and Carroll, R. (2000). Spatially adaptive penalties for spline fitting. Australian and New Zealand Journal of Statistics 2, 205-224. Ruppert, D., Wand, M., and Carroll, R. (2003). Semiparametric Regression. Cambridge University Press, Cambridge, U.K. Satten, G. and Carroll, R. (2000). Conditional and unconditional categorical regression models with missing covariates. Biometrics 56, 384-388. 143 normal prior variances to make the priors diffuse in nature so that inference is mainly controlled by the data distribution. 2.3.3 Posterior Computation Likelihood Approximation As mentioned in Section 3.1, we have used the data augmentation algorithm of Albert and Chib (1993) to approximate the likelihood and thus simplify posterior inference. They showed that a logistic regression model on binary outcomes can be well approximated by an underlying mixture of normal regression structure on latent continuous data. In doing so, it can be shown that a logit link is approximately equivalent to a Student-t link with 8 degrees of freedom. As in Albert and Chib (1993), we introduce latent variables Z, Z,, ..., ZN such that Di = 1 if Z, > 0 and Di = 0 otherwise. Let Z, be independently distributed from a t distribution with location Hi = a O/'Mi, + b'Qif, scale parameter 1 and degrees of freedom v. Equivalently, with the introduction of the additional random variable A,, the distribution of Z, can be expressed as scale mixtures of normal distribution Zi|A, N(Hi, A, 1), A, Gamma(v/2, 2/v) where the Gamma pdf is proportional to A /2-lexp(-vAi/2). Using this approximation, we can replace the logit link by a mixture of normals and can rewrite (2 -6) as L(Yi,Di,ai|ji) oc f{p(YuS, Sa} A Ip(Z|Hi, 1/Ai) G(Ai\v/2, 2/v)dzidA j= 1 N q x p(j3(2)),o)p( (2)jaia)p(b(2) ,) 1 gP(b2 i ) i=1 j=0 where, p(Ula, b) denotes a normal density with mean a and variance b while G(Vla, b) denotes a gamma density with shape a and rate b. Moreover, S, = p-,,(a,)'3 + Cq,,(a,)'b, and A, = {/(Z, > 0)I(Di = 1)+ I(Z, < 0)I(Di = 0)}. census median (b) and the adjusted census medians (c) corresponding to four person families and the weighted average of three and five person families as covariates. The base year census median denotes the median income estimate obtained from the most recent decennial census while the adjusted census median (c) for the current year is obtained by the relation Adjusted census median (c) = PC ) x census median (b) PCI (b) Here PCI(c) and PCI(b) denotes the per capital income estimates produced by the BEA for the current and base years respectively. Thus, in the above expression, the current year adjusted census median estimate is obtained by adjusting the base year census median by the proportional growth in the PCI between the base year and the current year. In the regression equation, the base year census median adjusts for any possible overstatement of the effect of change in the PCI in estimating the current median incomes. Finally, the Census Bureau used an empirical Bayesian (EB) technique (Fay (1987); Fay et al. (1993)) to calculate the weighted average of the current CPS median income estimate and the estimates obtained from the regression equation. 4.1.2 Related Literature The estimation of median incomes for small areas have received sustained attention over the years. Datta et al. (1993) extended and refined the ideas of Fay (1987) and proposed a more appealing empirical Bayesian procedure. They also performed an univariate and multivariate hierarchical Bayesian analysis of the same problem and showed that both the EB and HB procedures resulted in significant improvement over the CPS median income estimates for the univariate and multivariate models. However, the multivariate model resulted in considerably lower standard error and coefficient of variation than the univariate model although the point estimates were similar. Later, Ghosh et al. (1996) (henceforth referred to as GNK) presented a Bayesian time series analysis of the same problem by exploiting the inherent repetitive nature of the CPS median income estimates. In doing so, they estimated the statewide median income exposures. They achieved this by replacing the usual binomial model by a multinomial one and using a MCMC scheme to estimate the log odds ratio of disease at each category with respect to the baseline category. As in Muller and Roeder, they assumed a prospective logistic likelihood and a flexible prior for the exposure distribution and derived the implied retrospective likelihood. Muller et al. (1999) considered any number of continuous and binary exposures. However, in contrast to Seaman and Richardson, they specified a retrospective likelihood and then derived the implied prospective likelihood. They also addressed the problem of handling categorical and quantitative exposures simultaneously. Continuous covariates can be treated in the Seaman and Richardson framework by discretizing them into groups and little information is lost if the discretization is sufficiently fine. Gustafson et al. (2002) treated the problem of measurement errors in exposure by approximating the imprecisely measured exposure by a discrete distribution supported on a suitably chosen grid. In the absence of measurement error, the support is chosen as the set of observed values of the exposure, a device that resembles the Bayesian Bootstrap (Rubin, 1981). They assigned a Dirichlet(1, 1,..., 1) prior on the probability vector corresponding to the grid points. Seaman and Richardson (2004) proved equivalence between the prospective and retrospective likelihood in the Bayesian context. Specifically, they showed that posterior distribution of the log-odds ratios based on a prospective likelihood with a uniform prior distribution on the log odds (that an individual with baseline exposure is diseased) is exactly equivalent to that based on a retrospective likelihood with a Dirichlet prior distribution on the exposure probabilities in the control group. Thus, Bayesian analysis of case-control studies can be carried out using a logistic regression model under the assumption that the data was generated prospectively. Diggle et al. (2000) introduced Bayesian analysis for matched case controls studies when cases are individually matched to controls. They introduced nuisance parameters Table 2-1 shows the posterior means and 95% credible intervals of the odds ratios corresponding to different trajectory lengths and age at diagnosis when m = 0.5. For a fixed trajectory length, the odds ratios decrease as age at diagnosis increases. This Table 2-1. Estimates of odds ratios for different trajectory lengths for a 0.5 vertical shift of the exposure trajectory for the Age (-3,0) (-5,0) (-8,0) 50 3.96 (2.10, 7.63) 4.57 (2.32, 8.73) 5.26 (2.41, 11.01) 55 3.34 (2.02, 5.78) 3.75 (2.15, 6.43) 4.19 (2.24, 7.77) 60 2.83 (1.92, 4.39) 3.08 (2.00, 4.77) 3.36 (2.08, 5.46) 65 2.41 (1.79, 3.35) 2.55 (1.83, 3.59) 2.70 (1.90, 3.91) 70 2.06 (1.62, 2.70) 2.12 (1.64, 2.77) 2.19 (1.68, 2.89) 75 1.78 (1.41,2.32) 1.77 (1.41, 2.24) 1.79 (1.41, 2.31) 80 1.54 (1.16, 2.12) 1.48 (1.13, 1.98) 1.46 (1.11, 1.99) and age at diagnosis linear influence model (-10,0) 5.46 (2.46, 11.24) 4.32 (2.29, 7.95) 3.44 (2.11,5.63) 2.76 (1.92, 4.02) 2.22 (1.69, 2.97) 1.80 (1.41, 2.38) 1.47 (1.09, 2.07) seems to support the notion that younger subjects tend to have more aggressive form of prostate cancer than older ones and thus are most likely to be benefited from early detection (Catalona et al., 1998). For most ages at diagnosis, the odds ratios steadily increase as longer exposure trajectories are considered i.e as past exposure observations are taken into account. However, the rate of increase is higher for lower age at diagnosis. Thus, consideration of past exposure observations in addition to recent ones result in a significant gain in information about the current disease status of a subject. Finally, for the highest age at diagnosis considered (80), the odds ratios decrease as longer exposure trajectories are considered. This may imply that for a subject with very high age at diagnosis, his/her past exposure observations may not contain significant amounts of information about the present disease status. As before, we fitted the disease model on the interval / = (-10, -5). The posterior mean and 95% credible interval of po and 01 are respectively 1.24 (0.29, 2.19) and -0.015 (-0.029, 0.003) implying that exposure observations recorded 5-10 years prior to diagnosis also has a significant effect on the current disease status. The posterior means and 95% credible intervals of the odds ratios shown in Table 2-2 corroborate the above conclusion. started to receive some attention. Park and Kim (2004) are one of the first contributors to this area. They proposed an ordinary logistic model to analyze longitudinal case control data but ignored the longitudinal nature of the cohort. They also showed that ordinary generalized estimating equations (GEE) based on an independent correlation structure fails in this framework. 2.1.1 Setting Case-control study designs generally incorporate exposure information for a single time point in the past. In some situations however, an entire exposure history may be available for the cases and controls containing relevant exposure information collected at multiple time points in the past. However, proper and rigorous statistical methods of incorporating longitudinally varying exposure information inside the case control framework have not yet been adequately developed. This may be due to the obvious complications in properly handling a longitudinal exposure profile and thereby integrating it in an existing case control framework. But once done, there may be significant payoffs notably, the ability to learn how the present disease status of a subject is being influenced by his/her past exposure conditions conditional on the current ones. It can also lead to valuable insights about differences in the exposure patterns between the cases and controls over a long time span. In analyzing the effect of a longitudinally varying exposure profile on a binary outcome variable (like disease status), some of the possible challenges are : (1) The longitudinal exposure observations may be unbalanced in nature i.e the number of observations and also the observation times may differ from subject to subject; (2) The exposure trajectory may be highly nonlinear; (3) The exposure observations may be subject to considerable measurement error and (4) The effect of the exposure profile on the disease outcome may itself be complex and can even change over time. In view of the above challenges, we propose to use functional data analytic techniques, specially nonparametric regression methodology to model both the time The likelihood for the augmented model will be LA =exp(- Adk hAdk/ndk1) d=O k=l d=O k=l K r r K cx exp (- 6k 1(i +zd)dk HH( ,'/:5). ndk k=1 dl d=0 k=1 exp k d (+ dldk (6k)E0 d ndk fJ(od)1 kndk dk)n Sk=1 d=1 k=l d=l d=l k=l The posterior based on the augmented likelihood will be n) r K ( ') , 6, Idn) N LA k = d 1 7r ) (d-1 kl- (A-8) r \ Noting that /exp (k ((1 Odldk) (k)2= Oc Sd ) we have, by integrating out 6 in (A-8), -ld6k OC ( r K K r Yd0 do 7(7, dn) x n (Oddk) ndk i ddk dN wk1 k=1 d= 1 S(rd=l 1 Now, integrating out (1, ..., r,) from (A-8), we have r 0o n - Z rdrldk d= 1 (A-9) ( K K r K r '1 n, T(7q,6 n) oc exp Z j k H(k)E-O nd-1 H( )(dk) ndk f N 6kd Sk=1 k=l d=lk=l d=l k=l Next, we make the transformation 6k = Ok and o the prior distribution in (A-7) becomes K 6i having jacobian '-1. Hence = 1 126 (A-1 0) ( r ( K d=1 k=1 situation where exposure observations for cases and controls are collected at a single time point in the past. Some medical studies however have suggested that it may be worthwhile to take into account an entire exposure history, if available, in assessing the disease-exposure relationship. Case-control studies involving longitudinal exposure trajectories is a relatively unexplored area. At the same time, it is a promising one given the wide variety of longitudinal data analytic tools that are now available. Moreover, recent developments in the area of semiparametric and nonparametric regression analysis have added more flexibility in this direction specially when exposure trajectories have complicated and unknown functional forms. In this work, we have applied semiparametric regression techniques in analyzing longitudinal case control studies. We have used penalized regression splines in modeling the exposure trajectories for the cases and the controls. Thus our framework can be used even when exposure observations are collected at different time points across subjects i.e when exposures are unbalanced in nature. The exposure trajectory is used as the predictor in a prospective logistic model for the binary disease outcome. We have also modeled the slope parameter of the disease model as a p-spline to account for any time varying influence pattern of the exposure trajectory on the current disease status. In doing so, we have summarized the exposure history for the cases and controls in a flexible way which allowed us to consider differential lengths of the exposure trajectory in analyzing its effect on the current disease status. In order to simplify the analysis, we used the logit-mixture of normal approximation (Albert and Chib, 1993). We showed that the Bayesian equivalence results of Seaman and Richardson (2004) essentially holds for our framework, thus allowing us to use a prospective logistic model having fewer nuisance parameters although the dataset was collected retrospectively. Analysis have been carried out in an hierarchical Bayesian framework. Parameter estimates and associated credible intervals are obtained using MCMC samplers. We have applied our methodology to a longitudinal case control and the 95% credible interval is (0.196, 0.421). Thus, even exposure observations recorded as far as 5-10 years prior to diagnosis seem to have a significant influence on the current disease status of a subject. We formally compared the different models using the PPL criterion in Section 2.6.3. 2.6.2 Linear Influence Model We next fitted the model permitting a linear pattern of influence of the exposure trajectory on the disease outcome. For all trajectory lengths, 0o and 01 were significant since the 95% credible intervals excluded 0. To better understand the influence of differential lengths of exposure trajectories on disease status, we calculated the odds ratios for different age at diagnosis and trajectory lengths. Suppose the exposure trajectory for the ith subject changes from {X,(t + ad), -c < t < O} to {Z,(t + ad), -c < t < 0}. The corresponding odds ratio of disease is given by P(Di = 1Z,(t+af), -c< t <) P(D X(t a),-c < t < 0) Di =0IZit a ,-ct0 P(Di = O1Xit +a ,- t0 P(Di = OlZ,(t+ af),-C < t <0) XP(D, lX,(t ad),C < t < O) = exp [ {Z,(t a d) X(t+ ad)}I (t a )dt. (2-16) Parameterizing {Z,(t + af), -c < t < 0} as p,r,(t + af)'l q+ (q,(t + a)'d,, as in (2-2), we can rewrite (2-16) as exp [( )' ( p,(t + afd),c(t + ad)'dt) (0-c xexp [(d, b,)' ( q, (t + ad r,)((dt + ad'dt . If there is an uniform increase of "m" in the trajectory i.e {Z,(t + af) X,(t + a) = m, t e [-c, 0]}, (this can also be looked upon as a vertical shift of the trajectory upwards by "m"), the above odds ratio simplifies to exp m 7(t+ af)dt = exp [cm(o0 + (af c/2)01)] (2-17) S -C (i) Assuming w = logO, the posterior density of (w, 4) is *0 A] j {exp (w+ 4'c Zj(t)W(t)dt)} p*,w, 0|1y) N p(O) H 0 -J+^ (2-11) -1 + exp w+f Z/ (t)W(t)dt J- (ii) Assuming 0 = (60, ..., 0) and j = 6j/ 6k, the posterior density of (0, 4) is k=1 J i1 Oieexp (d Zj(t)W4t))dt) p(, 0y) N p() H j Yd+ (2-12) j{1 d=o Y-exp df Zj(At)Wt)dt j1 \ -c / (iii) The marginal posterior densities of 4 obtainable from p(w, 0|y) and p(0, 0|y) are the same. The proofs of the above theorem are similar in nature to those in Seaman and Richardson (2004) and are given in the Appendix A. Since we have considered near uniform prior for a and our prior on 4 ensures the existence and finiteness of E(O), the conditions of Theorem 2 are essentially satisfied for our framework. Based on the above results, it can be concluded that the marginal posterior distribution of 4 the parameter of interest, will be the same regardless of whether we fit a prospective or retrospective model. Thus, we can analyze the PSA data using the prospective semiparametric modeling framework described above. Bayesian equivalence can also be shown in the more general case of multicategory case control setup, i.e when there are multiple (> 2) disease states. We have the following result Theorem 3. Let, {X(t), -c < t < 0} be any exposure trajectory with support {Z1(t),..., ZK(t), -c < t < 0}, the set of all observable exposure trajectories. Let there are r + 1 disease categories. Suppose Ydk (d = 0, 1, ..., r; k = 1,..., K) be independently to diagnosis. In other words, the same exposure observation recorded at the same time relative to diagnosis for two subjects with widely different age ranges should have different significance. We represent both f(ay) and gi(ay) using p-splines as follows K f(ad) = o+ 1 ad + ... /3pay /Wa, T/)p =+ pO,(a ) + k=l M gi(ay) = bio+ bila + ...+ bqa bi,qm(ao m) =q,,(aJ)'bi, (2-2) m=l where p,,P(a [) = [1, a, ..., ay, (ad -7i)P.. (a rK)]' and Pq,,(ad) = [1, a, ..., a, (a- K1),..., (a KM)+]' are truncated polynomial basis functions of degrees p and q with knots (Tr-,..., TK) and (, ..., KM) respectively (Durban et al., 2004). Generally, M < K. Disease Model The prospective disease model is given by P(Di = l|Xi(t +a), -c < t < 0) = L(a + Xi(t + a)7(t + a)dt (2-3) where L(.) is the logistic distribution function, X,(t+ a) is the true, error-free unobserved subject-specific exposure profile modeled as f(t + ad) + gi(t + ad) while y(t + ad) is an unknown smooth function of age which reflects the time pattern of the effect of the PSA trajectory on the current disease status for the ith subject. In (2-3), we use the relation ay = t + afd to model the exposure trajectory X(.) and the influence function 7(.) as a function of time with respect to diagnosis. In doing so, we can easily assess the effect of the trajectory on the current disease state at any given point before diagnosis for a particular subject. "c" is the time by which we go back in the past to record the exposure history for the ith subject; e.g. c = 8 would imply that, for the ith subject, the exposure observations recorded since eight years prior to diagnosis are being considered for analysis. Thus, by changing the value of c, the effect of differential lengths of PSA trajectories on the current disease status can be studied. Fay, R., Nelson, C., and Litow, L. (1993). Estimation of median income of four-person families by state, in Statistical Policy Working Paper 21, Indirect Estimators in Federal Programs. Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics 19, 1-141. Gelfand, A. and Ghosh, S. (1998). Model choice : A minimum posterior predictive loss approach. Biometrika 85, 1-11. Gelfand, A. and Smith, A. (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association 85, 398-409. Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statistical Science 7, 457-511. Ghosh, M. and Chen, M.-H. (2002). Bayesian inference for matched case control studies. Sankhya, B 64, 107-127. Ghosh, M., Nangia, N., and Kim, D. (1996). Estimation of median income of four-person families : A Bayesian time series approach. Journal of the American Statistical Association 91, 1423-1431. Ghosh, M. and Rao, J. N. K. (1994). Small area estimation : An appraisal. Statistical Science 9, 55-76. Godambe, V. P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika 63, 277-284. Green, P. (1995). Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711-732. Green, P. and Silverman, B. (1994). Nonparametric regression and generalized linear models : a roughness penalty approach. Chapman and Hall/CRC. Gustafson, P., Le, N., and Valle, M. (2002). A Bayesian approach to case-control studies with errors in covariables. Biostatistics 3, 229-243. Hampel, F, Ronchetti, E., Rousseeuw, P., and Stahel, W. (1987). Robust statistics : The approach based on influence functions. Wiley. Hanson, T. and Johnson, W. (2000). Spatially adaptive penalties for spline fitting. Australian and New Zealand Journal of Statistics 2, 205-224. Heagerty, P. (1999). Marginally specified logistic normal models for longitudinal binary data. Biometrics 55, 688-698. Heagerty, P. (2002). Marginalized transition models and likelihood inference for longitudinal categorical data. Biometrics 58, 342-351. study dealing with the association between prostate specific antigen (PSA) and prostate cancer. We analyzed our model using differential lengths of exposure trajectories. In doing so, we have concluded that past exposure observations do provide significant information towards predicting the current disease status of a subject. Specifically, we have shown that across all age at diagnosis groups, the odds of disease steadily increase as past exposure observations are taken into account in addition to the recent ones. We also observed that for a fixed trajectory length, the odds of disease steadily decrease as the age at diagnosis increases corroborating the medical fact that younger subjects tend to have more aggressive form of prostate cancer and thus are most likely to be benefitted from early detection. We performed model comparison using posterior predictive loss (Gelfand and Ghosh, 1998). This criterion indicated that models with longer exposure trajectories tend to perform better than those with shorter trajectories. Lastly, model assessment was performed on the optimal model using the kappa statistic and case deletion diagnostics. Both these tools suggested that our model fits relatively well to the data. Some interesting extensions can be done to our setup. For richer datasets, it will be interesting to model the subject specific deviation functions as p-splines. In addition, we have only assumed constant and linear parameterizations of the influence function of the prospective disease model. For a larger data set, a p-spline formulation can also be used for the influence function which may bring out any underlying non-linear pattern of influence of the exposure trajectory on the current disease status. Although we have used a binary disease outcome, it will be interesting to extend our framework to accommodate multi-category disease states. Our modeling framework can also be generalized by incorporating a larger class of nonparametric distributional structures (like Dirichlet processes or Polya trees) for the subject specific random effects. yields the following well known mixed effects model representation : y = X + Z- +e (1-17) where Cov(e) = o-l and 7 and e are independent. Bayesian P-splines have recently become popular because they combine the flexibility of non-parametric models and the exact inference provided by the Bayesian inferential procedure. This is even more true because of the seamless fusion of penalized splines into the mixed model framework (Wand, 2003) as shown above. This equivalence also carries over to the manner in which smoothing is done. Smoothing can be achieved by imposing penalties on the spline coefficients, 7 as shown in (1-14) or by assuming a distributional form for 7, for example 7 ~ NK(O, 721K). In the Bayesian context, priors are placed on -2 and the other parameters and usual posterior sampling is carried out. Since samples are generated from the smoothing parameter alongside the other parameters, this method is also known as automatic scatterplot smoothing. In all the problems tackled in this dissertation, we will be using Bayesian inferential procedures on penalized splines as shown above. respectively) yield the same profile likelihood of 0. Thus, inferences about the parameter of interest 4 can be obtained using the prospective likelihood which has fewer nuisance parameters than the retrospective one. Proof of Theorem 2.(i) The posterior density of (0, 6, 4) is (A-4) J 1 J p(0'6, 6,y) o p(4)fj 6 i-1 If (AdJ)~exp(-A,) j 1 d Oj 1 Replacing the expression of Adj from (2-10), we have p(O, 6,41y) oc x P() {Oexp Zjy(t)M (t)dt)} y+a-1 j-1 exp ([I +exp (1 Z' Z,(t)(t))dt)) 6 Integrating out 6, from the above expression, we have p(0, |y) oc o ) F(y+j + aj) exp Z Wt) j1 1 + exp(0/ Z (t)x(t) dt) d j Now, performing the transformation from 0 to w yields expression (2-11). J (ii) First, we perform the transformation from 6 to (0, b), where = yj. Thus, j=1 6j = Ojy, j = 1,...,J. The jacobian of transformation will be J-1. Using this transformation in (A-4) and after some manipulation, we have J p(iO, 0, 4,\ly) oc p(a)Y+++-1-yI'+-1I ojY a'-1exp ('YJ ZJ(t)W(t)dt j 1 -c x exp[ (A-5) 123 j= 1 ,'exp Z (t) (t)dt) \ -c / Here the area specific effects vi are assumed to be independently and identically distributed (i.i.d) with mean 0 and constant variance o2, e, = kyy where k, is known and ey's are i.i.d random variables independent of v,'s with mean 0 and constant variance a Often normality of vi and e,'s are assumed. For these models, the parameters of interest are the small area means Y, or the totals Y,. Battese et al. (1988) studied the nested error regression model (1-10) in estimating the area under corn and soyabeans for counties in North-Central Iowa using sample survey data and satellite information. In doing so, they came up with an empirical best linear unbiased predictor (EBLUP) for the small area means. Over the years, numerous extensions have been proposed for the above modeling frameworks including multivariate Fay-Herriot models, generalized linear models, spatial models and models with more complicated random-effects structure etc. Rao (2003) presented a nice overview of the different estimation methods while Jiang and Lahiri (2006) reviewed the development of mixed model estimation in the small area context. A proper review of model based small area estimation will be incomplete without explaining the EBLUP, EB and HB approaches that are being widely used in this context. As shown above, small area models are special cases of general linear mixed models involving fixed and random effects such that small area parameters can be expressed as linear combinations of these effects. Henderson (1950) derived the BLUP estimators of small area parameters in the classical frequentist framework. These are so called because they minimize the mean squared error among the class of linear unbiased estimators and do not depend on normality. So, they are similar to the best linear unbiased estimators (BLUEs) of fixed parameters. The BLUP estimator takes proper account of the between area variation relative to the precision of the direct estimator. An EBLUP estimator is obtained by replacing the parameters with the asymptotically consistent estimator. Robinson (1991) gives an excellent account of BLUP theory and some applications. In an EB approach, the posterior distribution of the parameters of Table 2-3. Posterior predictive losses (PPL) for the constant and linear influence models for varying exposure trajectories and knots Knots Model (-2,0) (-5,0) (-8,0) (-10,0) (-10,-5) 0 Constant 47.54 47.02 47.20 47.65 47.81 Linear 43.61 43.32 43.17 43.33 43.82 1 Constant 46.61 46.65 46.77 46.57 45.29 Linear 42.80 42.83 42.91 42.90 42.94 2 Constant 45.83 45.50 45.72 46.32 44.69 Linear 43.20 43.01 42.74 42.66 43.33 3 Constant 45.47 45.23 45.24 44.82 45.17 Linear 43.47 43.05 42.72 42.73 43.43 4 Constant 45.35 45.67 45.27 45.31 45.54 Linear 43.70 43.13 42.56 42.61 43.47 5 Constant 46.67 46.06 45.42 45.75 46.01 Linear 43.91 43.20 43.12 42.93 43.48 reached earlier. For the linear setup, the model with exposure trajectory I = (-8, 0) and 4 knots perform the best (has the lowest PPL criterion among all the models considered). 2.6.4 Model Assessment As mentioned before, the number of knots and length of exposure trajectory tend to interact in influencing the fit of the constant and linear influence models. Thus, for a fixed trajectory length, the optimal model can be selected as the one with the lowest value of the PPL criterion across all the knot choices. For the linear influence model, the lowest PPL value was recorded for / = (-8, 0) and 4 knots. So, we perform our model assessment procedure on this model. For this model, the posterior mean of K was about 0.6 with 95% credible interval (0.535, 0.680) which indicates substantial agreement beyond what is expected by chance. We next performed case deletion analysis. We deleted each subject (with all the observations) rather than each observation for a subject. Figure 2-2 (a)-(c) shows the case deleted posterior means and 95% credible intervals for 31, 0o and 01. (In disease status of a subject. In the spirit of our dataset, we assume that the exposure observations are continuous. Let the exposure profile for the ith subject be X,(t) = {X,, ...,X,,, i = 1, ..., N; -c < t < 0} where X, is thejth exposure observation recorded for the ith subject. Let X = {X1, ..X,1 ...X XNv ..., Xvnn} be the set of all exposure observations. Since an exposure trajectory is composed of a finite set of exposure observations, the discretizing mechanism proposed by Rubin (1981) and later by Gustafson et al. (2002) can be applied to the trajectory as a whole i.e {X,(t), -c < t < 0} can be assumed to be a discrete random variable with support {Z,(t),..., Zj(t), -c < t < 0}, the set of all observable exposure trajectories where {Z(t), -c < t < 0,j = 1,...,J} is a finite collection of elements in the support of the X,'s. Let Yoj and Yj be the number of controls and cases having exposure profile {Z,(t), -c < t < 0}. We denote the "Null" or "baseline" trajectory as X(t) =0,-c < t <0}. The odds ratio of disease corresponding to Zj(t), -c < t < 0} with respect to baseline exposure is exp (/ Zj(t)7(t)dt) Assuming that a control has exposure profile {Zj(t), -c < t < 0} with probability 6/ J=16k, it can be easily shown that 6,exp Z( t) ( t)dt P(X(t) = Z(t), -c < t < 0D = 1) = Z(t)(t)d S kexp (J_ Zk(t)7(t)dt) k=-c Thus, the retrospective likelihood is Ydj i j 5exp (dJ Zj(t)7(t)dt) L(56, ) = co _j -- (2-7) d=-O 1 6Skexp(d Zk()7(t)dt) k=1 c APPENDIX C FULL CONDITIONAL DISTRIBUTIONS C.1 Semiparametric Case Control Model The full conditional distribution of the parameters for the semiparametric case control model are as follows : 1. [/pa, A, b, A,, a,, Y, D, a] ~ N(M3, V) where /j + N n N -1 V- ( YZZ Pp,(a,)cp, (a +.)' I AIM 'M:) , e l j1 =i=1 N n, N M3 = Vp ( Y ,,(au)(yd q(aP)b,) + AMc(Zi a bQ), Je j=1 i= 1 and Zp is the p + K + 1 order prior variance-covariance matrix of 3. 2. [Zia, 4, bi, A,, Di] ~ N(a + O'MI + b'Qi, A, ) truncated at the left (right) by 0 if Di = 1(Di = 0), i = 1,..., N. 3. [bil|, a, ,, A, ao, ao, Y, D, a] N(Mb, b)( 1,..., N) where v = (zb + a (q,(a,).q,(a,)'+ AQ,/')Q , e =1 Mb = vb V q,,(a,)(y. _.(a)') + Ai (Z a 'Mi) , and Zb is the q + M + 1 order variance-covariance matrix of b. 4. [0*| b, A, ao, Y,D,a] ~ N(Ma,, V,) where* = (a, 4)', / N -1 V. = + (1, 0'M + b ,Q,)A,(1, 0'M, + b/Q,) , i 1 N Mo* = A i (1, 'Mi + yQj)/Zi and Z,* is the r + K* + 2 order variance-covariance matrix of (a, 0). 5. [Ail, a, b,Y, D,a] ~ G( +, v + (Z, a -'M -bQ ) where 6. [(r)-11, a, b, Y, D, a] G( 1, b) j= 0 q. (i 12 135 March 2000 CPS. Bayesian techniques are used to weigh the contributions of the CPS median income estimates and the regression predictions of the median income based on their relative precision. The standard deviations of the error terms are estimated by fitting a model to the estimates of sampling error covariance matrices of the CPS median household income estimates for several years. The mean function in this model is referred to as a "generalized variance function" (Bell, 1999). Noninformative prior distributions are placed on the regression parameter corresponding to the IRS median income since it was found to be statistically significant even in the presence of census data, both in the 1989 and 1999 models. 3.1.2 Related Research Estimation of median income for small areas contributes to the policy making process of many Federal and State agencies. Before the establishment of the SAIPE program, the estimation of median income for four-person families was of general interest. The Census Bureau used the ideas suggested by Fay (1987) in this regard. Estimation was carried out in an empirical Bayes (EB) framework suggested by Fay et al. (1993). Later, Datta et al. (1993) extended the EB approach of Fay (1987) and also put forward univariate and multivariate hierarchical Bayes (HB) models. The estimates from their EB and HB procedures significantly improved over the CPS median income estimates for 1979. Ghosh et al. (1996) exploited the repetitive nature of the state-specific CPS median income estimates and proposed a Bayesian time series modeling framework to estimate the statewide median income of four-person families for 1989. In doing so, they used a time specific random component and modeled it as a random walk. They concluded that the bivariate time series model utilizing the median incomes of four and five person families performs the best and produces estimates which are much superior to both the CPS and Census Bureau estimates. In general, the time series model always performed better than its non-time series counterpart. Instead of proportional odd's model, we can also assume proportional hazards model i.e log log1 -P( S =l Di )] k = A/Di, k=l ... m 1 j 1 The other option would be to assume an ordinal probit formulation for the probabilities of the latent classes given by -1 P( S = Di =Ak A1D,, k= 1,...,m- 1 The predicted probabilities obtained from the ordinal probit model are similar to those obtained from the proportional odd's model. Moreover, an advantage of the former model is that, sampling from its posterior distribution is particularly efficient. For this reason, the ordinal probit model is sometimes preferred if a Bayesian analysis needs to be performed. Lastly, the drop-out times Di are assumed to follow a multinomial distribution with mass at each possible drop-out times, parameterized by p. Here we make the important assumption that Y, is independent of Di given 5,. Our main target of inference are the covariate effects averaged over the classes i.e PM averaged over M. The intercept Ai, in (5-6) is determined by the following relationship between the marginal and conditional models E(Ytl|) = Z p(SilDi)P(Di) J {E(Y |tyt-_ 1.... Yt-p, bi, S)p(yt_1,.... yt-plb, S)} D S A x p(bilSi)dbi where A = {it_, ..., Yt-p}. 5.2.3 Likelihood, Priors and Posteriors Let, the set of all parameters be denoted by w = (3, a, a ,..., o-, 6). We partition the complete response data for subject i, Yf into observed (values of Yf prior to dropout) components, denoted by Yi and missing (response observations after Using the above transformation, (A-10) can be rewritten as d= Ok=1 d= lk=1 r K k= ndkK z [exp(_ )(5)Z H ) 'o1 ( i (]fio" K no(- 1) ( Hifio )k Sr K r kd 7Onk 1 \x nd d=1 k=) I k=1 d=1k= r K k n ndkK r ld=1 \k=l \1 d LR ( 1 71(k ) (A-13) Integrating out (/ from (A-11), we have From (A-9) and (A-12), it is clear that posterior inference for the parameter of interest, ir remains the same under either the prospective likelihood L, or the retrospective likelihood LR as long as the posterior is proper. It can be shown that the posterior will be proper for any proper prior for 1n if nok > 1 V k = 1,..., K. 127 an underlying non-linear relationship with the CPS median income (Figure 3-2A), and so it is more suited to a semiparametric analysis. 3.4.1 Comparison Measures and Knot Specification Our dataset originally contained the median household income of all the states of the U.S. and the District of Columbia for the years 1995-2004. However, we only used the information for the five year period 1995-1999 since our target of inference are the state specific median household incomes for 1999. We evaluated the performance of our estimates by comparing them to the corresponding census figures for 1999. This is because, in small area estimation problems, the census estimates are often treated as "gold standard" against which all other estimates are compared. However, such a comparison is only possible for those years which immediately precede the census year e.g. 1969, 1979, 1989 and 1999. In order to check the performance of our estimates, we plan to use four comparison measures. These were originally recommended by the panel on small area estimates of population and income set up by the Committee on National Statistics in July 1978 and are available in their July 1980 report (p. 75). These are * Average Relative Bias (ARB) = (51)-1 Y Ici eil i Ci 2 Average Squared Relative Bias (ASRB) = (51)-1 Y Ici -e12 Ci Average Absolute Bias (AAB) = (51)-1 1 |c, e, Average Squared Deviation (ASD) = (51)-1 'i1(c, e,)2 Here c, and e, respectively denote the census and model based estimate of median household income for the ith state (i = 1,..., 51). Clearly, lower values of these measures would imply a better model based estimate. The basic structure of our models would remain the same as in Section 3.2.2. We have used truncated polynomial basis for the P-spline component in both the models. Since Fig 2a doesn't indicate a high degree of non-linearity, we have restricted a penalty function as shown in (1-14). A major difference between smoothing splines and penalized splines is that, in the former, all the unique data points are used as knots but in the latter the number of knots are much smaller resulting in more flexibility. Infact, penalized splines can be seen as a generalization of regression and smoothing splines. The wide applicability of penalized splines in diverse settings is mainly due to its correspondence with linear mixed effects models. Infact, penalized splines can be shown to be best linear unbiased predictors (BLUP)'s in a mixed model framework. To see this, we rewrite (1-14) as n S = {yi f (xi 7)2 + AO'D (1-15) i=1 where 0 = (, (/3, )', =. ( 1 3p )',7 (71, 72 ..., 7K)' and D is a known positive semi-definite penalty matrix such that D 0(p+l)x(p+l) 0(p+l)x(K) 0(K)x(p+l) K1 Different types of penalties can be accommodated by specifying different forms of D. For example, the penalty / f (2(x 1, 7) used for smoothing splines can be achieved with D being the sample second moment matrix of the second derivatives of the spline basis functions. However, the above form of D only penalizes the spline coefficients (71 ..., 7K). Specifically, the penalty in (1-14) corresponds to setting -: = I. Let X be the matrix with the ith row Xi = (1, xi, ..., x) and Z be the matrix with the ith row Zi = {(xi Ti), ..., (x, Ti)P). Using this formulation in (1-15) with the basis function in (1-12) and dividing by the error variance ao, we have 1 S y = I- XP- z112 + 11712 (1-16) oe oe By assuming that is a vector of random effects with Cov(-) = o-I where o2 = o. /A while 0 as the set of fixed effects parameters, the above penalized spline framework APPENDIX B PROOF OF POSTERIOR PROPRIETY FOR THE SMALL AREA MODELS B.1 Univariate Small Area Model The proof of posterior propriety for the basic univariate semiparametric model (Model I) is outlined below. The necessary changes to the proof for the random walk model are mentioned at the end. Proof of Theorem : The basic parameter space is Q = (0, 0, 7, b, o-, o {,...,) where 0 = (0'i,..., 0')' and b = (bl,..., b,)'. Let S= ... / p(|Y, X, Z)dQ = ... {L(Yi j i)L(0i j,-7, bi, d Xi, Zi)L(bi b )I L(, i7 (0)7(07b) ( 7) 7 ( j)df i=l j=1 (B-1) We have to show that / < M where M is any finite positive constant. Integrating first w.r.t 3, we have / = w(/3) [L(O /3, b ,2, Xi, Zi)d/ = exp[- (Oi- Xi/3 Z -- bil)'W- (Oi- Xi3 Zi7 bil)] d i = X we Xi exp W'-Wi + (B-2) where Q = wlx ) xl 1X ) X(Z Xl -~ W Wi = 0 Z,7 b1l and V-1 = diag(b-2, .b-2 .. 2). Now, W -1'Wi, = W '-1/2-1/2Wi = S'S, where Si = X-1/2W,. Similarly, W/ -1Xi = S'Ti, X'V-JWi = TtSi and X'V- Xi = T'T, where T, = X-1/2X,. 128 7. [(,7a)-11,, ,b,Y,D,a] G nI i= 1 q,,(ay)'bi) . N ni 1, (yui p,,(aij)' i= 1 j=1 8. [(,72)-110, a, (K + 1 K ) 8. [(V )- |4, b, Y, D, a] G 2-+1, Yr * k=1 9_ 1M N q+M 2 9. [()- bY, D,a]G -+1,+ Y b . C.2 Semiparametric Small Area Models i=1 j= q1 K' 10. [(r )-ll|, ,0, bY, D,a] G +-1,2-Y k=1 Here, G(x, y) denotes a Gamma density with shape parameter x and rate parameter y respectively. C.2 Semiparametric Small Area Models C.2.1 Semiparametric Univariate Small Area Model The full conditional distributions of the parameters for the univariate semiparametric small area model are as follows : 1. [O, 3 2,7 2,b, X, Z] ~ N(Mb, V) where --( --1 1a 1 1 Y (X -- Z -- hi b ) V = + and Mo + 6 6 2. [bi /3, b, N(, 2a, X, Z] N(Mb, Vb) where + = and M = (0 X'..- Z' .) . b j= 1 j= 1 j= 1 3. [3|7, 0, b, b2, X, Z] ~ N(M3, V3) where V,3= ( and M,3= (- (m X-.)) i 1= 1 i=i 1 = 1 i= 1 j= 1 4. [y|/, b, b,2, 2 X, Z] ~ N(M7, V.) where V= --+'/) and M. = -( ,Z 11 d-). 136 where Q = ( WId'X X ( XyWX. 'x ( XyJ--i VW, and W, = O, Z - ij ij j bi. As before, the expression within the exponent in (B-8) can be rewritten as K* = -\ ( STS-( U) (5 TTU) (5T/Su) iJ j j i,j S S' [I- T(T'T)-T']S < 0. 2 where S = (S' ..., S')', T = (T ,..., T')', S, = V1/2W, and T, = V/2X,. Thus, exp- W.' W -1 + < 1 (B-9) /,J So, in order to prove posterior propriety, we have to show ./ /. t (m d r 1) [ V trace J /)-1-l / = ... I| Xolx. -1/2 1 f'- 1 2 exp -trace 2 d 1...d 1 ij j=j 1 < oo (B-10) Here r is the order of j,j = 1, 2,..., t. (r = 2 in our case). Let A1, Aj2,..., Ajr be the distinct eigen values of .l,j = 1, 2,..., t. Since ,j is a variance covariance matrix, it is positive definite and symmetric. Hence, W-1 also has the same properties. Thus, Ajk > 0, Vk = 1, 2,... r. Now, Vj = 1,2, ..., r, 1JX 1 > Aj- YIr where Aj'" = min(A, A2..., Ar). y XuW JfX' > Y A7in X isrXje yx,,J u l'X. > Am/in5X .X where Am1n = min(A'. SX,,-,'X Amin Y XX is non-negative definite. ij ij step function, a spline of degree 1 is a piecewise linear function and so on. For example, f(.) can be represented as a linear combination of a pth degree truncated polynomial basis having K knots, given by 1,x ..., xP, (x- 'T)P ..., (X TK)P. (1-12) Here (x Tk)P is the function (x Tk)1/{x>T}. Using the above basis, a spline of degree p can be expressed as K f(xl3, 7) = 0 + ix+ ... + PXPk+ 7(x- Tk) (1-13) k=l Here, (/3, ..., /p) and (7, .... TK) are the coefficients of the polynomial and spline portions of the above structure and must be estimated. p = 1, 2, 3 corresponds to a linear, quadratic or cubic spline respectively. The above basis constitutes one of the most commonly used basis functions while other bases like radial basis or B-splines can also be used. It can be shown that there exists a very rich class of spline-generating functions which in turn greatly increases the scope and applicability of splines in various modeling frameworks. Moreover, the very structure of the splines makes them extremely good at capturing local variations in a pattern of observations, something which cannot be achieved using Fourier or Polynomial bases. One of the most important aspect of smoothing is the proper selection and positioning of the knots. This is because the knots act as "sensors" in relaying information about the underlying "true" observational pattern. Too few knots often lead to a biased fit while an excessive number of knots leads to overfitting vis-a-vis overparametrization and may even worsen the resulting fit. Thus, a sufficient number of knots should be used and they should be placed uniformly throughout the range of the independent variable. Generally, the knots are placed on a grid of equally spaced sample quantiles of x and a maximum of 35 to 40 knots suffices for any practical problem (Ruppert, 2002). Recently, there have been interesting contributions on knot four, three and five person families for the ith state and thejth year. Y, is assumed to estimate the true unknown median income Oi, (u = 1, 2, 3). The corresponding adjusted census medians are denoted by X,y, Xy and X,3. The years correspond to 1979,...,1989. For the univariate setup, the response and covariates are respectively Y,i and X,6. For the bivariate setup, the basic data vector is a duplet with first component YU1 and second component is either Y,2, Y 3 or 0.75Y,2 + 0.25Y,3. The adjusted census medians are chosen analogously. As mentioned before, our target of inference are the state specific median incomes of four person families for 1989. 4.4.1 Comparison Measures and Knot Specification In this study, our target of inference is the state specific median income corresponding to four-person families for the year 1989. We judged our estimates by comparing those to the corresponding census figures for 1989. In small area estimation problems, the census estimates are often treated as "gold standard" against which all other estimates are compared. However, such a comparison is only possible for those years which immediately precede the census year i.e 1969, 1979, 1989 and 1999. In order to check the performance of our estimates, we plan to use four comparison measures. These were originally recommended by the panel on small area estimates of population and income set up by the Committee on National Statistics in July 1978 and is available in their July 1980 report (p. 75). These are * Average Relative Bias (ARB)= (51)-1 Zic' ic e C 51 I e12 2 Average Squared Relative Bias (ASRB) = (51)-1 |i1 c-2 Ci Average Absolute Bias (AAB) = (51)-1 1 | c,- e,| Average Squared Deviation (ASD) = (51)-1 51(c, e,) these figures, the solid and dashed horizontal lines respectively indicates the estimated posterior mean and 95% credible intervals of the respective parameters based on the full data posterior. The solid points denote the importance weighted case-deleted posterior mean while the vertical lines segments are the 95% case-deleted posterior intervals). None of the subjects seem to be very influential on the parameter estimates. For every subject, we also looked at the difference in the predicted probability of disease based on the full data and with that subject deleted. Figure 2-2 (d) shows the plot of the posterior means of the difference probabilities and the corresponding confidence intervals. (In this figure, the solid line represents zero difference. The solid points represents the difference in disease probabilities based on the full and case deleted posteriors. The vertical line segments are the 95% posterior intervals of the differences). Surprisingly, the observation for case number 108 has a significant departure from the rest. On analyzing this subject, it was found that it had the unique combination of very high age and very high values of PSA. In fact it had the highest mean age in the sample, the highest age at diagnosis while the third highest mean Ptotal value. These characteristics may have contributed to the exceptionally high difference in the predicted probability of disease. We also performed case deletion analysis of the intercept parameters of the disease and trajectory models and the variance components. None of the subjects were found to be influential on the posterior estimates of these parameters. Thus, based on the above two measures, we may conclude that the semiparametric linear influence model with trajectory I = (-8, 0) and 4 knots seems to fit the observed data relatively well. 2.7 Conclusion and Discussion Case control studies have witnessed a wide variety of research over the years. Fundamental and far reaching contributions have been made both in the Frequentist and Bayesian domains. Generally, the bulk of research have dealt with the standard the comparison measures for the raw CPS estimates, SAIPE estimates and the semiparametric estimates with the knot realignment while Table 3-3 depicts the percentage improvement of the semiparametric estimates over the CPS and SAIPE estimates. Here, SPM(5)* and SPRWM(5)* respectively denote the semiparametric models with the realigned 5 knots. Table 3-2. Comparison SPRWM(5)* Estimate ARB CPS 0.0415 SAIPE 0.0326 SPM(5)* 0.028 SPRWM(5)* 0.0295 Table 3-3. Estimate SAIPE CPS measures for SPM(5)* and estimates with knot realignment ASRB AAB ASD 0.0027 1,753.33 5,300,023 0.0015 1,423.75 3,134,906 0.0012 1173.71 2,334,379 0.0013 1256.08 2,747,010 Percentage improvements of SPM(5)* and SPRWM(5)* estimates over SAIPE and CPS estimates Model ARB ASRB AAB ASD SPM(5)* 14.11% 20.00% 17.56% 25.54% SPRWM(5)* 9.51% 13.33% 11.78% 12.37% SPM(5)* 32.53% 55.55% 33.06% 55.96% SPRWM(5)* 28.92% 51.85% 28.36% 48.17% It is clear that, with the knot realignment, the comparison measures corresponding to the semiparametric estimates have decreased substantially, specially so for the SPM. The new comparison measures for the semiparametric models are quite lower than those corresponding to the SAIPE estimates. Thus, we may say that the semiparametric model estimates performs better than the SAIPE estimates with the realigned knots. This improvement is apparently due to the additional coverage of the observational pattern that is being achieved with the relocation of the knots. As a result of this increased coverage, a larger proportion of the underlying nonlinear pattern in the observations in being captured by the new knots. Although we have done this exercise with only 5 knots, it would be interesting to experiment with other types of knot alignment Table 3-1. Parameter estimates of SPRWM with 5 knots Parameter Mean Median 95% Cl 0o 4677.71 4660.08 (4633.31, 4758.7) /1 0.8156 0.816 (0.814, 0.817) 71 -0.154 -0.154 (-0.158, -0.149) 72 0.02 0.024 (-0.016, 0.040) 73 -0.008 -0.016 (-0.056, 0.066) 4 -0.093 -0.119 (-0.127, -0.037) 5 -0.165 -0.173 (-0.187, -0.139) 3.4.4 Knot Realignment As mentioned in Section 3.1.1, the SAIPE state models use the census estimates of median income (for 1999) as one of the predictor which essentially gives them a big edge over us. This may be one of the reasons why the estimates obtained from the semiparametric models are atmost comparable, but not superior to the SAIPE estimates. But that doesn't rule out the fact that the semiparametric models have room for improvement. In this section, we will look for any possible deficiencies in the our models and will try to come up with some improvements, if there is any. As mentioned in Section 3.4.1, selection and proper positioning of knots plays a pivotal role in capturing the true underlying pattern in a set of observations. Poorly placed knots does little in this regard and can even lead to an erroneous or biased estimate of the underlying trajectory. Ideally, a sufficient number of knots should be selected and placed uniformly throughout the range of the independent variable to accurately capture the underlying observational pattern. Figures 3-3A and 3-3B shows the exact positions of 5 and 7 knots in the plot of CPS median income against IRS mean income. In both the cases, the knots are placed on a grid of equally spaced sample quantiles of IRS mean income. In both the figures, the knots lie on the left of IRS mean = 50000, the region where the density of observations is high. The knots tend to lie in this region because they are selected based on quantiles which is a density-dependent measure. Thus, in both the figures, the coverage area of knots (i.e the part of the observational pattern which is captured by the knots) is the CHAPTER 3 ESTIMATION OF MEDIAN HOUSEHOLD INCOMES OF SMALL AREAS : A BAYESIAN SEMIPARAMETRIC APPROACH 3.1 Introduction Sample survey methodologies are widely used for collecting relevant information about a population of interest over time. Apart from providing population level estimates, surveys are also designed to estimate various features of subpopulations or domains. Domains may be geographic areas like state or province, county, school district etc. or can even be identified by a particular socio-demographic characteristic like a specific age-sex group. Sometimes, the domain-specific sample size may be too small to yield direct estimates of adequate precision. This led to the development of small area estimation procedures which specifically deal with the estimation of various features of small domains. Generally, observations on various characteristics of small areas are collected over time, and thus, may possess a complicated underlying time-varying pattern. It is likely that models which exploit the time varying pattern in the observations may perform better than classical small area models which do not utilize this feature. In this study, we present a semiparametric Bayesian framework for the analysis of small area level data which explicitly accomodates for the longitudinal pattern in the response and the covariates. 3.1.1 SAIPE Program and Related Methodology The Small Area Income and Poverty Estimates (SAIPE) program of the U.S. Census Bureau was established with the aim of providing annual estimates of income and poverty statistics for all states, counties and school districts across the United States. The resulting estimates are generally used for the administration of federal programs and the allocation of federal funds to local jurisdictions. There are also many state and local programs that depend on these estimates. Prior to the creation of the SAIPE program, the decennial census was the only source of income and poverty statistics for households, families and individuals related to small geographic areas quality (in terms of their "closeness" to the census estimates) of the estimates tended to improve as the knots were positioned more uniformly throughout the range of the independent variable. It became apparent that the contribution of the knots towards deciphering the underlying observational pattern improved substantially when those were properly placed with an optimal coverage area. This in turn improved the approximation of the curve vis-a-vis the true unknown observational pattern. This proved interesting because, still now, there is no absolute rule which controls the positioning of knots. Our final estimates proved to be superior, not only to the raw CPS estimates, but also to the current U.S. Census Bureau (SAIPE) estimates. Although the basic semiparametric model performed much better that the semiparametric random walk model with 5 knots, more experiments need to be done with different knot positions and number before anything conclusive can be said about their relative performance as a whole. But, it seems that, if adequate knots are used and if those are placed uniformly throughout the range of the independent variable, then a random walk component may not improve the fit any further provided there is no strong trend in the income levels. The main advantage of our modeling procedure is that it can be used for any possible patterns in the response (income, poverty etc) observations of small areas. In a subsequent work related to the estimation of median incomes of 4-person families, we have shown that the multivariate version of the basic semiparametric model perform quite well too and provide estimates which are consistently superior to the U.S. Census Bureau estimates. The above models can be extended in various ways based on the nature of the observational pattern and the quality (or richness) of the dataset. Some obvious extensions are given as follows : (1) In the models considered above, the spline structure f(xi) represents the population mean income trajectory for all the states combined. The deviation of the ith state from the mean is modeled through the random intercept b,. This implies that the state-specific trajectories are parallel. A more flexible LIST OF TABLES Table page 1-1 Atypical 2 x 2 table ........... ...... .............. 15 2-1 Estimates of odds ratios for different trajectory lengths and age at diagnosis for a 0.5 vertical shift of the exposure trajectory for the linear influence model 52 2-2 Posterior means and 95% confidence intervals of odds ratio for / = (-10, -5) for the linear influence model ... 53 2-3 Posterior predictive losses (PPL) for the constant and linear influence models for varying exposure trajectories and knots ... 54 3-1 Parameter estimates of SPRWM with 5 knots . ... 74 3-2 Comparison measures for SPM(5)* and SPRWM(5)* estimates with knot realignment . ... 77 3-3 Percentage improvements of SPM(5)* and SPRWM(5)* estimates over SAIPE and CPS estimates ..... 77 3-4 Parameter estimates of SPM(5)* .................. ....... 78 3-5 Parameter estimates of SPRWM(5)* ..... ....... 78 3-6 Comparison measures for time series and other m odel estim ates .. .. .. .. .. .. .. .. 79 4-1 Comparison measures for univariate estimates ..... 99 4-2 Percentage improvements of univariate estimates over Census Bureau estimates .... 99 4-3 Comparison measures for bivariate non-random w alk estim ates . . 100 4-4 Percentage improvements of bivariate non-random walk estimates over Census Bureau estimates ..... 101 4-5 Comparison measures for bivariate random walk model ... 102 CHAPTER 4 ESTIMATION OF MEDIAN INCOME OF FOUR PERSON FAMILIES :A MULTIVARIATE BAYESIAN SEMIPARAMETRIC APPROACH 4.1 Introduction Small area estimation techniques have been widely used for estimating various features of small domains domains for which the sample size is prohibitively small for the application of direct survey based estimation procedures. Small domains can be specific regions like a state, county or school district or can even be identified by a particular socio-demographic characteristic like a specific ethnic group. The U.S.. Census Bureau has always been concerned with the estimation of income and poverty characteristics of small areas across the United States. These estimates play a vital role towards the administration of federal programs and the allocation of federal funds to local jurisdictions. For example, state level estimates of median income for four-person families are needed by the U.S. Department of Health and Human Services (HHS) in order to formulate its energy assistance program to low income families. Since income characteristics for small areas are generally collected over time, there may well be a time varying pattern in those observations. Neglecting those patterns may lead to biased estimates which doesn't reflect the true picture. In this study, we put forward a multivariate Bayesian semiparametric procedure for the estimation of median income of four-person families for the different states of the U.S. while explicitly accommodating for the time varying pattern in the observations. 4.1.1 Census Bureau Methodology The estimation of median incomes for different family sizes used to be carried out by the U.S. Census Bureau until a few years ago. More recently, they have established the Small Area Income and Poverty Estimates (SAIPE) program which exclusively deals with the estimation of median household income and poverty estimates for small areas across the United States. But the estimation of the median income of four-person variables is too complex to be expressible using a known functional form. One of the main differences between parametric and nonparametric regression methodologies is that, in the former, the true shape of the functional pattern is determined by the model while in the latter, the shape is determined by the data itself. Suppose, the response y and the covariate x are related as yi = f(xi) e, = 1,2 ...n. (1-11) where f(x) is an unknown and unspecified smooth function of x and ei N(O0, o-2). The basic problem of "nonparametric regression" is to estimate the function f(-) using the data points (xi, yi). In doing so, it is typically assumed that beneath a rough observational data pattern there is a smooth trajectory. This underlying smooth pattern is estimated by various smoothing techniques. Broadly, there are four major classes of smoothers used to estimate f(.) viz Local polynomial kernel smoothers (Fan and Gijbels (1996); Wand and Jones (1995)), Regression splines (Eubank (1988), Eubank (1999)), Smoothing splines (Wahba (1990); Green and Silverman (1994)) and Penalized splines (Eilers and Marx (1996); Ruppert et al. (2003)). Each smoother has its own strengths and weaknesses. For example, local polynomial smoothers are computationally advantageous for handling dense regions while smoothing splines may be better for sparse regions. Here, we will briefly review the main characteristics of splines in general and penalized splines in particular. The basic idea behind splines is to express the unknown function f(x) using piecewise polynomials. Two adjacent polynomials are smoothly joined at specific points in the range of x known as "knots". The knots, say, ( 7-,.... -K) partition the range of x into K distinct subintervals (or neighborhoods). Within each such neighborhood, a polynomial of certain degree is defined. A polynomial spline of degree p has (p 1) continuous derivatives and a discontinuous pth derivative at any interior knot. The pth derivative reflects the "jump" of the splines at the knots. Thus, a spline of degree 0 is a the jth year. In that case, we may be interested in estimating (01,1 ..., 0,,O)' the median income of four-person families for all the states at time u. We may also want to estimate the difference in median incomes of four-person families at times v and u i.e (O1vi 01,1 ... *Omvi Om,,)'. Correspondingly, let X, = (Xyi, ..., Xs)' be the predictors corresponding to the ith state and jth year. 4.2.2 Semiparametric Modeling Framework We consider both univariate and bivariate income trajectory models for the family-size dataset. The univariate modeling framework is exactly the same as explained in Chapter 3. Here, we will explain the bivariate framework which is of two types viz a simple bivariate model and a bivariate random walk model. These can also be seen as extensions of the univariate models explained in Section 3.2.2. 4.2.2.1 Simple bivariate model The bivariate non-random walk model is given by KI Yi = Ao + a11xi + ... + a + 7kl(X kl + b; + Uiy + eiL k=l K2 Yij2 = 302 + /12Xy2 ... + ,2,2 + k2 (Xi2 7k2) + bi2 + UJ2 + e6 (4-1) k=l This is the most general structure since the degrees of the spline as well as the number and position of the knots are different for the two models. If for / = 1, 2,..., m;j = 1, 2,..., t, { Yi, X 1} and { Y,2, Xo2} have similar relationship, we can assume p = q and rkl = k2, k = 1, 2,..., K (= K2). Equation (4-1) can be rewritten as Y = U~ Z bi +u+ ey (4-2) = Oy +e., where 0o- = U0/3 + Z-y + bi + u-. 6 6rL IU 2.3 Posterior Inference 2.3.1 Likelihood Function Let Yi = (Y,1 ..., Yi)' and Di be the exposure vector and disease status while a, = (ai, ..., ain,)' and ti = (ti, ..., tin)' be the observed values of age and time with respect to diagnosis for the ith subject respectively. So, the response vector for the ith subject will be the pair (Yi, Di). Let 02 = (c, 3, 2fa, b, ,, ac, a e ..., C}) be the parameter space corresponding to the ith subject. Thus, the full parameter space will be given by 0 = E2 u ,2 U ... U vN. The likelihood for to the ith subject, conditional on the random effects is given by L(Y,, Di, ail i) oc p(Yi,|, a,, bi, 0 )p(Dia, /, 4P )p(3(2)l )p( (2) ) N q xp(b(2) o) l p(b 1iJ2) (2-6) il j 0 where p(Yi,|/, a,, bi, o- ) is the probability distribution corresponding to the trajectory model, p(Dla, 0, 4) denotes the logistic distribution corresponding to the disease model while the rest deals with the distributional structures on the spline coefficients and random effects. Since the trajectory model (2-1) has a normal distributional structure while the disease model (2-3) has a logistic structure, the likelihood function and hence the posterior have a complicated form. To alleviate this problem, we approximate the logistic distribution as a mixture of normals using a well known data augmentation algorithm proposed by Albert and Chib (1993). This is briefly explained in Section 3.3. 2.3.2 Priors To complete the Bayesian specification of our model, we need to assign prior distributions to the unknown parameters. We assume diffuse normal priors for the polynomial coefficients (/3, ..., /p) and (a, o,..., ,). For the variance components (o, ao, oa o {o, ..., ao}), we assume uniform priors with large upper bounds. The prior distributions are assumed to be mutually independent. We choose large values for the is necessary, as we will see, in order to account for the three types of dependencies mentioned above. We assume that Y,, conditional on the random effects bi and latent class 5,, are from an exponential family with distribution f( Yt Ik, k < t, bi, Si) = exp [{ t it (Tit)}/(mi) + h( t, ( )] where E(Y,t Yk, k < t, b,, Si) = g-l(it) = '(lit). Here Tlit is the linear predictor, b() is a known function, 0 is a scale parameter and m, is the prior weight. We next specify the conditional model as m p g{E(Yt Yk, k < t, bi, Si)} = Ait + SuZ() + 7t,kyit-k + b (5-6) j=1 k=l where, in the most general case, [bi, Sy = 1, X] ~ N(0, o(7(Xi)) and 7it,k(S = 1) = V'tkk forj = 1, 2,..., m and k = 1, 2,..., p, where Vi, and Zit are both subsets of Xt. Thus, the variance of bi may depend on the latent class and the covariate vector for the ith subject. Moreover, 62, 6, .... 6) determines how the dependence between Y, and Yt-k varies as a function of the covariates Vit,k conditional on the latent classes. We also make the sum-to-zero constraint i.e at = Y1 a for the purpose of identifiability. Lastly, in this conditional model, each subject has its own intercept, and the effect of each covariate, is allowed to differ by dropout class via the regression coefficients, aO). The probabilities of the latent classes given the drop-out times are specified as proportional odd's model (Agresti, 2002) given by logit P Su= = oDi /k A1Di, k = 1..., m 1. (5-7) j= 1 where Ao,1 < Ao,2 < ... < AO,M-1 and A1 are unknown parameters. Thus the class probabilities are assumed to be a monotone function of dropout time (in fact, linear on the logit scale). 113 effects assumed to be independent and identically distributed with mean 0 and constant variance a Lastly, big's are known positive constants and 3 = (/1, ..., 3p)' is the vector of regression coefficients. In order to infer about the small area means, Y,, direct estimators, Y, are assumed to be known and available. The linear model Oi =g( ) = Oi + e, i = 1,... m (1-8) is assumed where the sampling errors, e, are independent with Ep(eii0) = 0, Vp(eii0) = i,, i, known which implies that 0, are design-unbiased. By setting ov = 0 in (1-7), we have 0, = z'p which leads to synthetic estimators that does not account for local variation above and beyond that reflected in the auxiliary variables z,. Combining (1-7) and (1-8), we have 0, = z' + bivi + e, (1-9) which is a special case of a linear mixed model. Here, vi and e, are assumed to be independent. Fay and Herriot (1979) studied the above area level model (1-9) in the context of estimating the per capital income (PCI) for small places in the United States and proposed Empirical Bayes estimator for that case. Ericksen and Kadane (1985) used the same model with bi = 1 and known -2 to estimate the undercount in the decennial census of U.S. The area level model has also been used recently to produce model based county estimates of poor school age children in the United States. In the unit level model, it is assumed that unit specific auxiliary data xy (xil, ..., Xip)' are available for each population element j in each small area i. Moreover, it is assumed that the variable of interest, yy, is related to x, through a one-fold nested error linear regression model yU = x,3 + vi + eu, i = 1,..., m;j =1, ..., N (1-10) Henderson, C. (1950). Estimation of genetic parameters (abstract). Annals of Mathe- matical Statistics 21, 309-310. Hogan, J. and Laird, N. (1998). Mixture models for the joint distribution of repeated measures and event times. Statistics in Medicine 16, 239-257. Hogan, J., Roy, J., and Korkontzelou, C. (2004). Tutotial in biostatistics : Handling drop-out in longitudinal studies. Statistics in Medicine 23, 1455-1497. Jiang, J. and Lahiri, P. (2006). Mixed model prediction and small area estimation. Test 15, 1-96. Johnson, V. (2004). A Bayesian X2 test for goodness-of-fit. Annals of Statistics 32, 2361-2384. Lewis, M., Heinemann, L., MacRae, K., Bruppacher, R., and Spitzer, W. (1996). The increased risk of venomous thromboembolism and the use of third generation progestagens : Role of bias in observational research. Contraception 54, 5-13. Lin, J., Zhang, D., and Davidian, M. (2006). Smoothing spline based score tests for proportional hazards models. Biometrics 62, 803-812. Lindstrom, M. (1999). Penalized estimation of free-knot splines. Journal of Computa- tional and Graphical Statistics 8, 333-352. Lipsitz, S., Parzen, M., and Ewell, M. (1998). Inference using conditional logistic regression with missing covariates. Biometrics 54, 295-303. Little, R. and Rubin, D. (1987). Statistical Analysis with Missing Data. New York: Wiley & Sons. MacEachern, S. and Muller, P. (1998). Estimating mixtures of Dirichlet process models. Journal of Computational and Graphical Statistics 2, 223-238. Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute 22, 719-748. Marshall, R. (1988). Bayesian analysis of case-control studies. Statistics in Medicine 7, 1223 1230. Morris, C. (1983). Parametric empirical Bayes inference : theory and applications. Journal of the American Statistical Association 78, 47-54. Muller, P., Parmigiani, G., Schildkraut, J., and Tardella, L. (1999). A Bayesian hierarchical approach for combining case-control and prospective studies. Biometrics 55, 858-866. Muller, P. and Roeder, K. (1997). A Bayesian semiparametric model for case-control studies with errors in variables. Biometrika 84, 523-537. 142 Semiparametric regression methods have not been used in small area estimation contexts until recently. This was mainly due to methodological difficulties in combining the different smoothing techniques with the estimation tools generally used in small area estimation. The pioneering contribution in this regard is the work by Opsomer et al. (2008) in which they combined small area random effects with a smooth, non-parametrically specified trend using penalized splines. In doing so, they expressed the non-parametric small area estimation problem as a mixed effects regression model and analyzed it using restricted maximum likelihood. Theoretical results were presented on the prediction mean squared error and likelihood ratio tests for random effects. Inference was based on a simple non-parametric bootstrap approach. The methodology was used to analyze a non-longitudinal, spatial dataset concerning the estimation of mean acid neutralizing capacity (ANC) of lakes in the north eastern states of U.S. 3.1.3 Motivation and Overview The motivation of our work also originates from the repetitive nature of the CPS median income estimates. But, in contrast to the approach of Ghosh et al. (1996), we have viewed the state specific annual household median income values as longitudinal profiles or "income trajectories". This gained more ground because we used the state wide CPS median household income values for only five years (1995 1999) in our estimation procedure. Figure 3-1 shows sample longitudinal CPS median household income profiles for six states spanning 1995 to 2004 while Figure 3-2 shows the plots of the CPS median income against the IRS mean and median incomes for all the states for the years 1995 through 1999. It is apparent that CPS median income may have an underlying non-linear pattern with respect to IRS mean income, specially for large values of the latter. The above two features motivated us to use a semiparametric regression approach. In doing so, we have modeled the income trajectory using penalized spline (or P-spline) (Eilers and Marx, 1996) which is a commonly used but powerful function estimation tool in non-parametric inference. The P-spline is 0 0 C> 00 50 6 00 0 0 * Eo E I S Me o e 8r 32. o. P m I C, 0 0- c>. 10100 *0 ** 1 1 S. ana s e e e e. S of our models. We end with a discussion in Section 3.6. The appendix contains the The target of inference is generally By or some function of it. Specifically, in our context, incomes at times v and u i.e ()'. We denote by X the covariate corresponding to the ith Mtate and jthI CM 30000 40000 50000 60000 70000 20000 25000 30000 35000 IRS Mean Income IRS Median Income A IRS mean income plot B IRS median income plot Figure 3-2. Plots of CPS median income against IRS mean and median incomes for all the states of the U.S. from 1995 to 1999. analysis with regard to the median household income dataset. In Section 3.5, we discuss the Bayesian model assessment procedure we used to test the goodness-of-fit of our models. We end with a discussion in Section 3.6. The appendix contains the proofs of the posterior propriety and the expressions of the full conditional distributions. 3.2 Model Specification 3.2.1 General Notation Let Y = (Y,4,..., Y,js)' be the sample survey estimators of some characteristics OY = (01, ...0,)' for the ith small area at the jh time (/ = 1,2,...,m;j = 1,2,...,t). The target of inference is generally 0, or some function of it. Specifically, in our context, 0, = O, which denotes the median household income of the ith state at the jth year. We are interested in estimating (, ..., Omu,,)' i.e the median household income for all the states at time u. We may also want to estimate the difference in median household incomes at times v and u i.e (0v Oiu, ..., Omv Omu,,)'. We denote by X, the covariate corresponding to the ith state and jth year. * * distributed as Poisson(Adk) where log(Adk) = Ig(d) + og(rdk) + log(6k), log(Aok) = Iog(k). 0d being the baseline odds for disease category d and rl being the parameter of interest. Assume independent improper priors, r (Od) oc 1d, Tr(6k) oc 61 for 0 and 6 and a prior 7r(rl) forrl that is independent of 6 and 6 and proper i.e E(rl) exists and is finite. Let ndk be the number of individuals with D = d and {X(t) = Zk(t), -c < t < 0}. Then the following two statements holds (i) The posterior density of (rl, 6) is r K K r =o ndk 7(7, VIn) x nn( d)"dk)ndk d+ nldk 1 7r(77) d lk 1 k=1 d=1 d=1 K (ii) Assuming 0 = (6, ..., OK) and Ok = k/ 6 6, the posterior density of (0, r1) is = 1 ndk o(, 0|n) N K nOk r kdk K ) -1 k=1 d=1 k=1 /d k=1 (iii) The marginal posterior densities of rl obtainable from w(1r, O n) and (rl1, 0 n) are the same. The proof of the above theorem is given in Appendix A. 2.5 Model Comparison and Assessment 2.5.1 Posterior Predictive Loss We performed model comparison using the posterior predictive loss (PPL) criterion proposed by Gelfand and Ghosh (1998). This criterion is based on the idea that an optimal model should provide accurate prediction of a replicate of the observed data. varying exposure profile and also the influence pattern of the exposure profile on the binary outcome. Specifically, we model the underlying exposure trajectory using penalized splines or p-splines (Eilers and Marx (1996); Ruppert et al. (2003)). We also express the effect of the exposures on the current disease state as a penalized spline to account for any possible time varying patterns of influence. Analysis is carried out in a hierarchical Bayesian framework. Our modeling framework is quite flexible since it can accommodate any possible non-linear time varying pattern in the exposure and influence profiles. It is difficult to achieve the same goal in a purely parametric setting. In a case-control study, the natural likelihood is the retrospective likelihood, based on the probability of exposure given the disease status. Prentice and Pyke (1979) showed that the maximum likelihood estimators and asymptotic covariance matrices of the log-odds ratios obtained from a retrospective likelihood are the same as that obtained from a prospective likelihood (based on the probability of disease given exposure) under a logistic formulation for the latter. Thus, case-control studies can be analyzed using a prospective likelihood which generally involves fewer nuisance parameters than a retrospective likelihood. Seaman and Richardson (2004) proved a similar result in the Bayesian context. Specifically, they showed that posterior distribution of the log-odds ratios based on a prospective likelihood with a uniform prior distribution on the log odds (that an individual with baseline exposure is diseased) is exactly equivalent to that based on a retrospective likelihood with a Dirichlet prior distribution on the exposure probabilities in the control group. Thus, Bayesian analysis of case-control studies can be carried out using a logistic regression model under the assumption that the data was generated prospectively. We show that the results of Seaman and Richardson (2004) applies for the proposed semiparametric framework thus enabling us to perform the analysis based on a prospective likelihood even though a case control study is retrospective in nature. We perform model checking based on the posterior predictive loss criterion (Gelfand and Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy BAYESIAN SEMIPARAMETRIC REGRESSION AND RELATED APPLICATIONS By Dhiman Bhadra August 2010 Chair: Malay Ghosh Cochair: Michael J. Daniels Major: Statistics Case-Control studies and small area estimation are two distinct areas of modern Statistics. The former deals with the comparison of diseased and healthy subjects with respect to risk factors) of a disease with the aim of capturing disease exposure association specially for rare diseases. The later area is concerned with the measurements of characteristics of small domains regions whose sample size is so small that the usual survey based estimation procedures cannot be applied in the inferential routines. Both these areas are important in their own right. Case-control studies forms one of the pillars of modern biostatistics and epidemiology and has diverse applications in various health related issues, specially those involving rare diseases like Cancer. On the other hand, estimates of characteristics for small areas are widely used by Federal and local governments for formulating policies and decisions, in allocating federal funds to local jurisdictions and in regional planning. My dissertation deals with the application of Bayesian semiparametric procedures in modeling unorthodox data scenarios that may arise in case control studies and small area estimation. The first part of the dissertation deals with an analysis of longitudinal case-control studies i.e case-control studies for which time varying exposure information are available for both cases and controls. In a typical case-control study, the exposure information is collected only once for the cases and controls. However, some recent medical studies have indicated that a longitudinal approach of incorporating the entire exposure history, o * * S- *g- S 0) S 4 5 6 400 5 0 6 0 IRS Mean Income IRS Mean Income A Positioning of 5 Knots B Positioning of 7 Knots Figure 3-3. Exact positions of 5 and 7 knots in the plot of PS median income against region to the left of the dotted vertical lines. On the other hand, the non-linear pattern 8 |-<," 8 **<-"* o A 0 AAAA A 0 I< I-----------------------------------I ICM I -------------']------------- I 30000 40000 50000 60000 70000 30000 40000 50000 60000 70000 IRS Mean Income IRS Mean Income A Positioning of 5 Knots B Positioning of 7 Knots Figure 3-3. Exact positions of 5 and 7 knots in the plot of CPS median income against IRS mean income. The knots are depicted as the bold faced triangles at the bottom. region to the left of the dotted vertical lines. On the other hand, the non-linear pattern is tangible only in the low density area of the plot i.e the region lying to the right of IRS mean = 50000. Evidently, none of the knots lie in this part of the graph. Thus, we can presume that in both the cases (5 and 7 knots), the underlying non-linear observational pattern is not being adequately captured. As a natural solution to this issue, we decided to place half of the knots in the low density region of the graph while the other half in the high density region. The exact boundary line between the high density and low density regions is hard to determine. We tested different alternatives and came up with IRS mean = 47000 as a tentative boundary because it gave the best results. In both the regions, we placed the knots at equally spaced sample quantiles of the independent variable. Figure 3-4 shows the new knot positions for 5 knots. It is clear from Figure 3-4 that the new knots are more dispersed throughout the range of IRS mean than the old ones. The region between the bold and dashed vertical lines denotes the additional coverage that has been achieved with the knot the appendix). Markov chain Monte Carlo methodologies, specifically, Gibbs sampling (Gelfand and Smith, 1990) has been used to obtain the parameter estimates. We have compared the state-specific estimates of median household income for 1989 with the corresponding decennial census values in order to test for their accuracy. In doing so, we observed that the semiparametric model estimates improve upon both the CPS and the Census Bureau estimates. Interestingly, for all the above models, the semiparametric estimates are generally superior or at least comparable to the corresponding estimates from the time series models of Ghosh et al. (1996). This is a testament to the flexibility and strength of the semiparametric methodology specially when observations are collected over time. It also indicates that it may be worthwhile to take into account the longitudinal income patterns in estimating the current income conditions of the U.S. states. Lastly, the semiparametric modeling framework is very general and can be applied to any situation where various characteristics of small areas are collected over time. The rest of the chapter is organized as follows. In Section 4.2 we introduce the bivariate semiparametric modeling framework. Section 4.3 goes over the hierarchical Bayesian analysis we performed. In Section 4.4, we describe the results of the data analysis with regard to the median household income dataset. Finally, we end with a discussion and some references towards future work in Section 4.5. The appendix contains the proofs of the posterior propriety and the expressions of the full conditional distributions for our models. 4.2 Model Specification 4.2.1 Notation Let Y, = (Y, ..., ,s)' be the sample survey estimators of some characteristics 8, = (01, ..., s)' for the ith small area at the jth time (i = 1, 2,..., m;j = 1,2,..., t). In this study, we are concerned with the estimation of 0, or some function of it. For example, 0y, may be the median income of four-person families for the ith state at ourselves to a linear spline (p = 1). The selection of knots is always a subjective but tricky issue in these kind of problems. Sometimes experience on the subject matter may be a guiding force in placing the knots at the "optimum" locations where a sharp change in the curve pattern can be expected. Too few or too many knots generally create problems in terms of worsening the fit. This is because, if too few knots are used, the complete underlying pattern may not be captured properly, thus resulting in a biased fit. On the other hand, once there are enough knots to fit important features of the data, further increase in the number of knots have little effect on the fit and may even degrade the quality of the fit (Ruppert, 2002). Generally, at most 35 to 40 knots are recommended for effectively all sample sizes and for nearly all smooth regression functions. Following the general convention, we have placed the knots on a grid of equally spaced sample quantiles of the independent variable (IRS mean income). 3.4.2 Computational Details We implemented and monitored the convergence of the Gibbs sampler following the general guidelines given in Gelman and Rubin (1992). We ran three independent chains each with a sample size of 10,000 and with a burn-in sample of another 5,000. We initially sampled the O6's from t-distributions with 2 df having the same location and scale parameters as the corresponding normal conditionals given in the Appendix. This is based on the Gelman-Rubin idea of initializing certain samples of the chain from overdispersed distributions. However, once initialized, the successive samples of O6's are generated from regular univariate normal distributions. Convergence of the Gibbs sampler was monitored by visually checking the dynamic trace plots, acf plots and by computing the Gelman-Rubin diagnostic. The comparison measures deviated slightly for different initial values. We chose the least of those as the final measures presented in the tables that follows. 4.3 Hierarchical Bayesian Analysis In this section, the notations and expressions would correspond to the bivariate setup. The expressions for the univariate setup would be analogous and is mentioned in detail in Chapter 3. 4.3.1 Likelihood Function Let Yi = (Y ..., Yl)' be the response and Ui = (Ui, ..., Uit)' and Z, = (Zi, ..., Zt) be the covariate vectors corresponding to the ith state. Here, Y, = (Yy, Y,2)' and the expressions for U, and Z, are given above. Let 0, = (0i, 0, 7, bi, { i1,.... qjt} o, Y) be the parameter space corresponding to the ith state where i0 = (0i, ..., 0t)'. Thus, the full parameter space will be given by 0 = i2 x ... x ,,. For the bivariate non-random walk model, the likelihood function for the ith state would be given by L(Y,, Ui, Zil i) oc L(Y, il )L(OI/3, 7, bi, { 1, .... It}, Ui, Zi)L(bil o)L(-yl, ) t = {L(Yg 0o, Xy)L(Oy U' Z'7 bi, V)} L(bil o)L(7-y|,) j=1 (4-4) Here, L(X|l, 1) denotes a multivariate normal density with mean vector p and variance covariance matrix X. For the bivariate random walk model, the parameter space for the ith state would be fi = (,O, 0, 7, bi, v, {i' ...., t}, o,, :, ~) where v = (v, ..., v)' is the vector of time specific random effects. The hierarchical Bayesian framework is given by 1. (Y0e,) N/(eO, 0:) 2. (06/1, 7, bi, vj, qj) ~ N(X' + Z',7 + bi + vj, qjj) 3. (v lv-_~, ZE) ~ N(vj-_, ZE), assuming vo = 0 4. (bil, o) ~ N(0, Zo) 5. 7 ~ N(0, Z.) TABLE OF CONTENTS page ACKNOW LEDGMENTS .................... .............. 4 LIST OFTABLES ..................... ................. 8 LIST OF FIGURES .................... ................. 9 ABSTRACT ..................... ............... .... 10 CHAPTER 1 INTRODUCTION .................... ............... 13 1.1 Overview of Dissertation ............................ 13 1.2 Review of Case-Control Studies ..................... 14 1.3 Review of Small Area Estimation ....................... 21 1.4 Non-Parametric Regression Methodology ............. 25 2 BAYESIAN SEMIPARAMETRIC ANALYSIS OF CASE CONTROL STUDIES WITH TIME VARYING EXPOSURES ........................ 31 2.1 Introduction .................... ............... 31 2.1.1 Setting .................... .............. 32 2.1.2 Motivating Dataset : Prostate Cancer Study ............ 34 2.2 M odel Specification .. .. .. .. .. .. .. .. 35 2.2.1 N otation . .. 35 2.2.2 Model Framework.................... ......... 35 2.3 Posterior Inference . .. 40 2.3.1 Likelihood Function .. .. .. .. .. .. 40 2.3.2 Priors . . 40 2.3.3 Posterior Computation ......................... 41 2.3.3 Posterior Computation . 41 2.4 Bayesian Equivalence ............................. 42 2.5 Model Comparison and Assessment .. ... 46 2.5.1 Posterior Predictive Loss ..... .. .. .. 46 2.5.2 Kappa statistic . 47 2.5.3 Case Influence Analysis ..... .... 48 2.6 Analysis of PSA Data ............................. 49 2.6.1 Constant Influence Model ..... .... 50 2.6.2 Linear Influence Model ......................... 51 2.6.3 Overall Model Comparison ... 53 2.6.4 Model Assessment ......................... 54 2.7 Conclusion and Discussion ... 55 |

Full Text |

PAGE 2 2 PAGE 3 3 PAGE 4 IhadthegoodfortunetobeastudentattheDepartmentofStatisticsatUniversityofFlorida.ItisherethatIcameinclosecontactwithsomeofthepreeminentstatisticiansofthedayandlearntalotfromthem.Ideeplyacknowledgethetremendoushelp,encouragementandendlesssupportthatIreceivedfrommyadvisorProf.MalayGhosh,myco-advisorProf.MichaelJ.DanielsandProf.AlanAgrestithroughoutthehighsandlowsofdoingmyresearchwork.Theynotonlytaughtmestatisticsortheartofwritingpapersorsolvingproblems-theyintroducedmetothespiritofdiscoveryandthejoyoflearning,somethingthatwillstaywithmeforeverandwouldmotivatemeinwaysIcanneverimagine.Howeverthelistdoesn'tendheresinceeachandeverymemberofthefacultyopenedupnewdoorsformethroughwhichknowledgeowedpastandenrichedmealongtheway.Myendlessgratitudetoeachandeveryoneofthem.IalsowishtothankProf.BhramarMukherjee(currentlyattheDepartmentofBiostatisticsatUniversityofMichigan)forherhelpandinspirationovertheyears.Lastbutnottheleast,myunendinggratitudetomymotherwhosesacrice,unconditionalloveandblessingwasalwayswithme,guidingmealongtheway.Iwouldendbyconveyingmydeepestrespecttothememoryofmyfather-hewastherewithmealwaysthroughoutthisjourney. 4 PAGE 5 page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 13 1.1OverviewofDissertation ............................ 13 1.2ReviewofCase-ControlStudies ....................... 14 1.3ReviewofSmallAreaEstimation ....................... 21 1.4Non-ParametricRegressionMethodology .................. 25 2BAYESIANSEMIPARAMETRICANALYSISOFCASECONTROLSTUDIESWITHTIMEVARYINGEXPOSURES ....................... 31 2.1Introduction ................................... 31 2.1.1Setting .................................. 32 2.1.2MotivatingDataset:ProstateCancerStudy ............. 34 2.2ModelSpecication .............................. 35 2.2.1Notation ................................. 35 2.2.2ModelFramework ............................ 35 2.3PosteriorInference ............................... 40 2.3.1LikelihoodFunction ........................... 40 2.3.2Priors .................................. 40 2.3.3PosteriorComputation ......................... 41 2.4BayesianEquivalence ............................. 42 2.5ModelComparisonandAssessment ..................... 46 2.5.1PosteriorPredictiveLoss ........................ 46 2.5.2Kappastatistic ............................. 47 2.5.3CaseInuenceAnalysis ........................ 48 2.6AnalysisofPSAData ............................. 49 2.6.1ConstantInuenceModel ....................... 50 2.6.2LinearInuenceModel ......................... 51 2.6.3OverallModelComparison ....................... 53 2.6.4ModelAssessment ........................... 54 2.7ConclusionandDiscussion .......................... 55 5 PAGE 6 ................... 59 3.1Introduction ................................... 59 3.1.1SAIPEProgramandRelatedMethodology .............. 59 3.1.2RelatedResearch ........................... 61 3.1.3MotivationandOverview ........................ 62 3.2ModelSpecication .............................. 65 3.2.1GeneralNotation ............................ 65 3.2.2SemiparametricIncomeTrajectoryModels .............. 66 3.2.2.1ModelI:BasicSemiparametricModel(SPM) ....... 66 3.2.2.2ModelII:SemiparametricRandomWalkModel(SPRWM) 67 3.3HierarchicalBayesianInference ........................ 68 3.3.1LikelihoodFunction ........................... 68 3.3.2PriorSpecication ........................... 68 3.3.3PosteriorDistributionandInference .................. 69 3.4DataAnalysis .................................. 70 3.4.1ComparisonMeasuresandKnotSpecication ............ 71 3.4.2ComputationalDetails ......................... 72 3.4.3AnalyticalResults ............................ 73 3.4.4KnotRealignment ............................ 74 3.4.5ComparisonwithanAlternateModel ................. 78 3.5ModelAssessment ............................... 80 3.6Discussion ................................... 82 4ESTIMATIONOFMEDIANINCOMEOFFOURPERSONFAMILIES:AMULTIVARIATEBAYESIANSEMIPARAMETRICAPPROACH .......... 85 4.1Introduction ................................... 85 4.1.1CensusBureauMethodology ..................... 85 4.1.2RelatedLiterature ............................ 87 4.1.3MotivationandOverview ........................ 89 4.2ModelSpecication .............................. 90 4.2.1Notation ................................. 90 4.2.2SemiparametricModelingFramework ................ 91 4.2.2.1Simplebivariatemodel ................... 91 4.2.2.2Bivariaterandomwalkmodel ................ 92 4.3HierarchicalBayesianAnalysis ........................ 93 4.3.1LikelihoodFunction ........................... 93 4.3.2PriorSpecication ........................... 94 4.3.3PosteriorDistributionandInference .................. 94 4.4DataAnalysis .................................. 95 4.4.1ComparisonMeasuresandKnotSpecication ............ 96 4.4.2ComputationalDetails ......................... 97 4.4.3AnalyticalResults ............................ 98 4.5ConclusionandDiscussion .......................... 102 6 PAGE 7 .................... 104 5.1AdaptiveKnotSelection ............................ 105 5.2AnalyzingLongitudinalDatawithManyPossibleDropoutTimesusingLatentClassandTransitionalModelling ................... 107 5.2.1IntroductionandBriefLiteratureReview ............... 107 5.2.2ModelingFramework .......................... 110 5.2.3Likelihood,PriorsandPosteriors ................... 114 5.2.4SpecicationofPriors ......................... 117 APPENDIX APROOFOFBAYESIANEQUIVALENCERESULTS ................ 122 BPROOFOFPOSTERIORPROPRIETYFORTHESMALLAREAMODELS .. 128 B.1UnivariateSmallAreaModel ......................... 128 B.2BivariateSmallAreaModel .......................... 130 CFULLCONDITIONALDISTRIBUTIONS ...................... 135 C.1SemiparametricCaseControlModel ..................... 135 C.2SemiparametricSmallAreaModels ..................... 136 C.2.1SemiparametricUnivariateSmallAreaModel ............ 136 C.2.2UnivariateRandomWalkModel .................... 137 C.2.3BivariateRandomWalkModel ..................... 137 REFERENCES ....................................... 139 BIOGRAPHICALSKETCH ................................ 145 7 PAGE 8 Table page 1-1Atypical22table ................................. 15 2-1Estimatesofoddsratiosfordifferenttrajectorylengthsandageatdiagnosisfora0.5verticalshiftoftheexposuretrajectoryforthelinearinuencemodel 52 2-2Posteriormeansand95%condenceintervalsofoddsratioforI=(10,5)forthelinearinuencemodel .................... 53 2-3Posteriorpredictivelosses(PPL)fortheconstantandlinearinuencemodelsforvaryingexposuretrajectoriesandknots .......... 54 3-1ParameterestimatesofSPRWMwith5knots ................... 74 3-2ComparisonmeasuresforSPM(5)andSPRWM(5)estimateswithknotrealignment ................... 77 3-3PercentageimprovementsofSPM(5)andSPRWM(5)estimatesoverSAIPEandCPSestimates ..................... 77 3-4ParameterestimatesofSPM(5) 78 3-5ParameterestimatesofSPRWM(5) 78 3-6Comparisonmeasuresfortimeseriesandothermodelestimates ................................ 79 4-1Comparisonmeasuresforunivariateestimates .................. 99 4-2PercentageimprovementsofunivariateestimatesoverCensusBureauestimates ..................... 99 4-3Comparisonmeasuresforbivariatenon-randomwalkestimates .................................... 100 4-4Percentageimprovementsofbivariatenon-randomwalkestimatesoverCensusBureauestimates .................. 101 4-5Comparisonmeasuresforbivariaterandomwalkmodel ............. 102 8 PAGE 9 Figure page 2-1Longitudinalexposure(PSA)prolesof3randomlysampledcases(1stcolumn)and3randomlysampledcontrols(2ndcolumn)plottedagainstage. ...... 36 2-2Sensitivityof1,0,1anddiseaseprobabilityestimatestocase-deletions. .. 56 3-1LongitudinalCPSmedianincomeprolesfor6statesplottedagainstIRSmeanandmedianincomes.(1stcolumn:IRSMeanIncome;2ndcolumn:IRSMedianIncome). ........................................ 63 3-2PlotsofCPSmedianincomeagainstIRSmeanandmedianincomesforallthestatesoftheU.S.from1995to1999. ..................... 65 3-3Exactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Theknotsaredepictedastheboldfacedtrianglesatthebottom. ........................................ 75 3-4Positionsof5knotsafterrealignment.Theknotsaretheboldfacedtrianglesatthebottom.Theregionbetweenthedashedandboldlinesistheadditionalcoverageareagainedfromtherealignment. .................... 76 3-5Quantile-quantileplotofRBvaluesfor10000drawsfromtheposteriordistributionofthebasicsemiparametricandsemiparametricrandomwalkmodels.TheX-axisdepictstheexpectedorderstatisticsfroma2distributionwith9degreesoffreedom. ...................................... 81 9 PAGE 10 Case-ControlstudiesandsmallareaestimationaretwodistinctareasofmodernStatistics.Theformerdealswiththecomparisonofdiseasedandhealthysubjectswithrespecttoriskfactor(s)ofadiseasewiththeaimofcapturingdisease-exposureassociationspeciallyforrarediseases.Thelaterareaisconcernedwiththemeasurementsofcharacteristicsofsmalldomains-regionswhosesamplesizeissosmallthattheusualsurveybasedestimationprocedurescannotbeappliedintheinferentialroutines.Boththeseareasareimportantintheirownright.Case-controlstudiesformsoneofthepillarsofmodernbiostatisticsandepidemiologyandhasdiverseapplicationsinvarioushealthrelatedissues,speciallythoseinvolvingrarediseaseslikeCancer.Ontheotherhand,estimatesofcharacteristicsforsmallareasarewidelyusedbyFederalandlocalgovernmentsforformulatingpoliciesanddecisions,inallocatingfederalfundstolocaljurisdictionsandinregionalplanning.MydissertationdealswiththeapplicationofBayesiansemiparametricproceduresinmodelingunorthodoxdatascenariosthatmayariseincasecontrolstudiesandsmallareaestimation. Therstpartofthedissertationdealswithananalysisoflongitudinalcase-controlstudiesi.ecase-controlstudiesforwhichtimevaryingexposureinformationareavailableforbothcasesandcontrols.Inatypicalcase-controlstudy,theexposureinformationiscollectedonlyonceforthecasesandcontrols.However,somerecentmedicalstudieshaveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory, 10 PAGE 11 ThesecondandthirdpartofmydissertationdealswithunivariateandmultivariatesemiparametricproceduresforestimatingcharacteristicsofsmallareasacrosstheUnitedStates.Inthesecondpart,weputforwardasemiparametricmodelingprocedureforestimatingthemedianhouseholdincomeforallthestatesoftheU.S.andtheDistrictofColumbia.Ourmodelsincludeanonparametricfunctionalpartforaccomodatinganyunspeciedtimevaryingincomepatternandalsoastatespecicrandomeffecttoaccountforthewithin-statecorrelationoftheincomeobservations.ModelttingandparameterestimationiscarriedoutinahierarchicalBayesianframeworkusingMarkovchainMonteCarlo(MCMC)methodology.ItisseenthatthesemiparametricmodelestimatescanbesuperiortoboththedirectestimatesandtheCensusBureauestimates.Overall,ourstudyindicatesthatpropermodelingoftheunderlyinglongitudinalincomeprolescanimprovetheperformanceofmodelbasedestimatesofhouseholdmedianincomeofsmallareas. Inthethirdpartofthedissertation,weputforwardabivariatesemiparametricmodelingprocedurefortheestimationofmedianincomeoffour-personfamiliesforthedifferentstatesoftheU.S.andtheDistrictofColumbiawhileexplicitlyaccommodatingforthetimevaryingpatternintheincomeobservations.OurestimatestendtohavebetterperformancesthanthoseprovidedbytheCensusBureauandalsohave 11 PAGE 12 12 PAGE 13 EilersandMarx 1996 ). InChapter 2 ,Ipresentananalysisofacase-controlstudywhenlongitudinal,timevaryingexposureobservationsareavailableforthecasesandcontrols.Semiparametricregressionproceduresareusedtoexiblymodelthesubjectspecicexposureprolesandalsotheinuencepatternoftheexposureprolesonthediseasestatus.Thisenablesustoanalyzewhetherpastexposureobservationsaffectthecurrentdiseasestatusofasubjectconditionalonhis/hercurrentexposurecondition.Theproposedmethodologyismotivatedbyandappliedtoacasecontrolstudyofprostatecancerwherelongitudinalbiomarkerinformationareavailableforthecasesandcontrols.WealsoshowthedetailsofthehierarchicalBayesianimplementationofourmodelsandsomeequivalenceresultsthathaveenabledustouseaprospectivemodelingframeworkonaretrospectivelycollecteddataset. InChapter 3 ,IproposeaBayesiansemiparametricmodelingprocedureforestimatingthemedianhouseholdincomeofsmallareaswhenarea-speciclongitudinalincomeobservationsareavailable.Ourmodelsincludeanonparametricfunctional 13 PAGE 14 Chapter 4 dealswithanextensionofthemethodologyinChapter3whereabivariatesemiparametricprocedurehasbeenusedtoestimatethemedianincomeoffamiliesofvaryingsizesacrosssmallareas.Thiscanalsobeseenasanextensionofthetimeseriesmodelingframeworkof Ghoshetal. ( 1996 ).Weshowthatthesemiparametricmodelsgenerallyhavebetterperformancethantheirtimeseriescounterpartsandinafewsituations,theperformancesarecomparable.Wewanttoconveythemessagethatsemiparametricregressionmethodologycanprovideanattractivealternativetothetraditionalmodelingtechniquesspeciallywhentimevaryinginformationareavailableforsmallareas. InChapter 5 ,weprovideanoveralldiscussionofourresultsandalsopointtosomeinterestingopenproblemsandareasforfutureresearchthatmaybeworthpursuing. 14 PAGE 15 Case-controlstudieshaveconsistentlyattractedtheattentionofstatisticians,andasaresult,arichandvoluminousbodyofworkhasdevelopedovertheyears.NotableworkintheFrequentistdomaininclude Corneld ( 1951 )whopioneeredthelogisticmodelfortheprobabilityofdiseasegivenexposure.Hewasthersttodemonstratethattheexposureoddsratioforcasesversuscontrolsequalsthediseaseoddsratioforexposedversusunexposedandthatthelatterinturnapproximatestheratioofthediseaseratesifthediseaseisrare.LetDandEbedichotomousfactorsrespectivelycharacterizingthediseaseandexposurestatusofindividualsinapopulation.AcommonmeasureofassociationbetweenDandEisthe(disease)oddsratio ByapplyingtheBayestheorem,theaboveexpressioncanberewrittenas whichistheexposureoddsratio.Anotherwellknownmeasureofassociationistherelativerisk(RR)ofdiseasefordifferentexposurevaluesgivenbyP(D=1jE=1)=P(D=1jE=0).Forrarediseases,bothP(D=0jE=0)andP(D=0jE=1)areclosetooneandthediseaseoddsratioisapproximatelyequaltotherelativeriskofdisease.Theclassicpaperby MantelandHaenszel ( 1959 )furtherclariedtherelationshipbetweenaretrospectivecase-controlstudyandaprospectivecohortstudy.Theyconsideredaseriesof22tablesasinTable 1-1 Table1-1. Atypical22table DiseaseStatusExposedNotExposedTotal Casen11in10in1iControln01in00in0iTotale1ie0iNi PAGE 16 IXi=1n01in10i=Ni(1) ItmaybeofinteresttotestfortheequalityoftheoddsratiosacrosstheItablesi.e whichfollowsanapproximate2distributionwithI1degreesoffreedomunderthenullhypotheses.Thederivationofthevarianceoftheaboveestimatorinitiallyposedsomechallengebutwaseventuallyaddressedinseveralsubsequentpapers( Breslow 1996 ). BreslowandDay ( 1980 )markedthedevelopmentoflikelihoodbasedinferencemethodsforoddsratio.Methodstoevaluatethesimultaneouseffectsofmultiplequantitativeriskfactorsondiseaserateswerepioneeredinthe1960's. Inacase-controlstudy,theappropriatelikelihoodistheretrospectivelikelihoodofexposuregiventhediseasestatus. Corneldetal. ( 1961 )notedthatiftheexposuredistributionsinthecaseandcontrolpopulationsarenormalwithdifferentmeansbutacommoncovariancematrix,thentheprospectiveprobabilityofdisease(D)giventheexposure(X)hasthelogisticformi.e whereL(u)=1=1+exp(u).However,thereisaconceptualcomplicationinusingaprospectivelikelihoodbasedonP(DjX)whereasacase-controlsampling 16 PAGE 17 PrenticeandPyke ( 1979 )whoshowedthatthemaximumlikelihoodestimatorsandasymptoticcovariancematricesofthelog-oddsratiosobtainedfromtheretrospectivelikelihoodarethesameasthatobtainedfromaprospectivelikelihoodunderalogisticformulationforthelatter.Thus,case-controlstudiescanbeanalyzedusingaprospectivelikelihoodwhichgenerallyinvolvesfewernuisanceparametersthanaretrospectivelikelihood. Carrolletal. ( 1995 )extendedtheprospectiveformulationtothesituationofmissingdataandmeasurementerrorintheexposurevariables. Inacasecontrolset-up,matchingifoftenusedforselectingcomparablecontrolstoeliminatebiasduetoconfounding.Statisticaltechniquesforanalyzingmatchedcase-controldatawererstdevelopedby Breslowetal. ( 1978 ).Inthesimplestsetting,thedataconsistofmmatchedsets,say,S1,...,Sm,withMicontrolsmatchedwithacaseineachsetorstratum.Aprospectivestratiedlogisticdiseaseincidencemodelgivenby isassumed.i'sarethestratumspecicinterceptterms,treatedasnuisanceparametersandareeliminatedbyconditioningonthenumberofcasesineachstratum.Thegeneratedconditionallogisticlikelihoodyieldstheoptimumestimatingfunction( Godambe 1976 )forestimating.Theclassicalmethodsforanalyzingunmatchedandmatchedstudiessufferfromlossofefciencywhentheexposurevariableispartiallymissing. Lipsitzetal. ( 1998 )proposedapseudo-likelihoodmethodtohandlemissingexposurevariables. Rathouzetal. ( 2002 )developedamoreefcientsemiparametricmethodofestimationwhichtookintoaccountmissingexposuresinmatchedcasecontrolstudies. SattenandKupper ( 1993 ), PaikandSacco ( 2000 )and SattenandCarroll ( 2000 )addressedtheproblemofmissingexposurefromafulllikelihoodapproachbyassumingadistributionoftheexposurevariableinthecontrolpopulation. 17 PAGE 18 Althman ( 1971 )isprobablytherstBayesianworkwhichconsideredseveral22contingencytableswithacommonoddsratioandperformedaBayesiantestofassociationbasedonthecommonoddsratio.Later, ZelenandParker ( 1986 ), NurminenandMutanen ( 1987 )and Marshall ( 1988 )consideredidenticalBayesianformulationsofacasecontrolmodelwithasinglebinaryexposure.Theseworksdealtwithinferencefromtheposteriordistributionofsummarystatisticslikethelogoddsratio,riskratioandriskdifference. Ashbyetal. ( 1993 )analyzedacasecontrolstudyfromaBayesianperspectiveanduseditasasourceofpriorinformationforasecondstudy.TheirpaperemphasizedthepracticalrelevanceoftheBayesianperspectiveinaepidemiologicalstudyasanaturalframeworkforintegratingandupdatingknowledgeavailableateachstage. MullerandRoeder ( 1997 )introducedanovelaspecttoBayesiantreatmentofcase-controlstudiesbyconsideringcontinuousexposurewithmeasurementerror.Theirapproachisbasedonanonparametricmodelfortheretrospectivelikelihoodofthecovariatesandtheimpreciselymeasuredexposure.Theychosethenon-parametricdistributiontobeaclassofexiblemixturedistributions,obtainedbyusingamixtureofnormalmodelswithaDirichletprocessprioronthemixingmeasure( EscobarandWest 1995 ).Theprospectivediseasemodelrelatingdiseasetoexposureisassumedtohavealogisticformcharacterizedbyavectoroflogoddsratioparameters.Thispaperpioneeredtheuseofcontinuouscovariates,measurementerrorandexiblenon-parametricmodelingofexposuresinaBayesiansettingandbroughttolightthetremendouspossibilityofmodernBayesiancomputationaltechniquesinsolvingcomplexdatascenariosincase-controlstudies. SeamanandRichardson ( 2001 )extendedthebinaryexposuremodelofZelenandParkertoanynumberofcategorical 18 PAGE 19 Mulleretal. ( 1999 )consideredanynumberofcontinuousandbinaryexposures.However,incontrasttoSeamanandRichardson,theyspeciedaretrospectivelikelihoodandthenderivedtheimpliedprospectivelikelihood.Theyalsoaddressedtheproblemofhandlingcategoricalandquantitativeexposuressimultaneously. ContinuouscovariatescanbetreatedintheSeamanandRichardsonframeworkbydiscretizingthemintogroupsandlittleinformationislostifthediscretizationissufcientlyne. Gustafsonetal. ( 2002 )treatedtheproblemofmeasurementerrorsinexposurebyapproximatingtheimpreciselymeasuredexposurebyadiscretedistributionsupportedonasuitablychosengrid.Intheabsenceofmeasurementerror,thesupportischosenasthesetofobservedvaluesoftheexposure,adevicethatresemblestheBayesianBootstrap( Rubin 1981 ).TheyassignedaDirichlet(1,1,...,1)priorontheprobabilityvectorcorrespondingtothegridpoints. SeamanandRichardson ( 2004 )provedequivalencebetweentheprospectiveandretrospectivelikelihoodintheBayesiancontext.Specically,theyshowedthatposteriordistributionofthelog-oddsratiosbasedonaprospectivelikelihoodwithauniformpriordistributiononthelogodds(thatanindividualwithbaselineexposureisdiseased)isexactlyequivalenttothatbasedonaretrospectivelikelihoodwithaDirichletpriordistributionontheexposureprobabilitiesinthecontrolgroup.Thus,Bayesiananalysisofcase-controlstudiescanbecarriedoutusingalogisticregressionmodelundertheassumptionthatthedatawasgeneratedprospectively. Diggleetal. ( 2000 )introducedBayesiananalysisformatchedcasecontrolsstudieswhencasesareindividuallymatchedtocontrols.Theyintroducednuisanceparameters 19 PAGE 20 GhoshandChen ( 2002 )developedgeneralBayesianinferentialtechniquesformatchedcase-controlproblemsinthepresenceofoneormorebinaryexposurevariables.Theirframeworkwasmoregeneralthanthatof ZelenandParker ( 1986 ).Unlike Diggleetal. ( 2000 ),theybasedtheiranalysisonunconditionalratherthantheconditionallikelihoodaftereliminationofthenuisanceparameters.Theirframeworkincludedawidevarietyoflinkslikecomplimentaryloglinksandsomesymmetricandskewedlinksinadditiontotheusuallogitandprobitlinks.Recently Sinhaetal. ( 2004 )and Sinhaetal. ( 2005 )proposedauniedBayesianframeworkformatchedcase-controlstudieswithmissingexposures.Theyalsomotivatedasemiparametricalternativeformodelingvaryingstratumeffectsontheexposuredistributions.TheparameterswereestimatedinaBayesianframeworkbyusinganon-parametricDirichletprocessprioronthestratumspeciceffectsinthedistributionoftheexposurevariableandparametricpriorsonallotherparameters.TheinterestingaspectoftheBayesiansemiparametricmethodologyisthatitcancaptureunmeasuredstratumheterogeneityinthedistributionoftheexposurevariableinarobustmanner.Theyalsoextendedtheproposedmethodtosituationswithmultiplediseasestates. Inatypicalcase-controlstudydesign,theexposureinformationiscollectedonlyonceforthecasesandcontrols.However,somerecentmedicalstudies Lewisetal. ( 1996 )haveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory,whenavailable,mayleadtoagainininformationonthecurrentdiseasestatusofasubjectvis-a-vismorepreciseestimationoftheoddsratioofdisease.Itmayalsoprovideinsightsonhowthepresentdiseasestatusofasubjectisbeinginuencedbypastexposureconditionsconditionalonthecurrentones.Unfortunately,properandrigorousstatisticalmethodsofincorporatinglongitudinallyvaryingexposureinformationinsidethecasecontrolframeworkhavenotyetbeenproperlydeveloped.Inthiswork, 20 PAGE 21 GhoshandRao ( 1994 )provideanicereviewofthedifferenttypesofestimatorsandinferentialproceduresusedinsurveysamplingandsmallareaestimation. Sincesamplesurveysaregenerallydesignedforlargeareas,theestimatesofmeansortotalsobtainedthereofarereliableforlargedomains.Directsurveybasedestimatorsforsmalldomainsoftenyieldlargestandarderrorsduetothesmallsamplesizeoftheconcernedarea.Thisisduetothefactthattheoriginalsurveywasdesignedtoprovideaccuracyatamuchhigherlevelofaggregationthanforlocalareas.Thismakesitanecessitytoborrowstrengthfromadjacentorrelatedareastondindirectestimatorsthatincreasetheeffectivesamplesizeandthusincreasetheprecisionoftheresultingestimateforagivensmallarea.Broadlyspeaking,asmallareamodelhasageneralizedlinearformwithameanterm,arandomarea-speciceffecttermandameasurementerrortermwhichreectsthenoisefornotsamplingtheentiredomain. 21 PAGE 22 Duringthelast10-15years,modelbasedinferencehasbeenwidelyusedinthesmallareacontext.Thisismainlyduetothewiderangeoffunctionalitiesthatcomeswiththelinearmixedeffectsmodelingframework.Someofthemainadvantagesofthisframeworkare(i)Randomarea-speciceffectsaccountingforbetweenareavariationaboveandbeyondthatexplainedbyauxiliaryvariablesinthemodel.(ii)Differentvariationslikenon-linearmixedeffectsmodels,logisticregressionmodels,generalizedlinearmodelscanbeentertained.(iii)Areaspecicmeasuresofprecisioncanbeassociatedwitheachsmallareaestimateunliketheglobalmeasures.(iv)Complexdatastructureslikespatialdependence,timeseriesstructures,longitudinalmeasurementscanbeexploredand(v)Recentmethodologicaldevelopmentsforrandomeffectsmodelscanbeutilizedtoachieveaccuratesmallareainferences.Generally,therearetwokindsofsmallareamodelsdependingonwhethertheresponseisobservedattheareaortheunitlevel. 1. Area(oraggregate)levelmodelsrelatesmallareameanstoareaspecicauxiliaryvariables. 2. Unitlevelmodelsrelatetheunitvaluesofthestudyvariabletounit-specicauxiliaryvariables. Thebasicarealevelmodelisgivenby Hereiisoftenassumedtobeafunctionofthepopulationmean,Yioftheithsmallarea,zi=(zi1,...,zip)0isthecorrespondingauxiliarydata,vi'sareareaspecicrandom 22 PAGE 23 Inordertoinferaboutthesmallareameans,Yi,directestimators,^Yiareassumedtobeknownandavailable.Thelinearmodel isassumedwherethesamplingerrors,eiareindependentwithEp(eiji)=0,Vp(eiji)=i,iknown whichimpliesthat^iaredesign-unbiased.Bysetting2v=0in( 1 ),wehavei=z0iwhichleadstosyntheticestimatorsthatdoesnotaccountforlocalvariationaboveandbeyondthatreectedintheauxiliaryvariableszi.Combining( 1 )and( 1 ),wehave whichisaspecialcaseofalinearmixedmodel.Here,viandeiareassumedtobeindependent. FayandHerriot ( 1979 )studiedtheabovearealevelmodel( 1 )inthecontextofestimatingthepercapitaincome(PCI)forsmallplacesintheUnitedStatesandproposedEmpiricalBayesestimatorforthatcase. EricksenandKadane ( 1985 )usedthesamemodelwithbi=1andknown2vtoestimatetheundercountinthedecennialcensusofU.S.ThearealevelmodelhasalsobeenusedrecentlytoproducemodelbasedcountyestimatesofpoorschoolagechildrenintheUnitedStates. Intheunitlevelmodel,itisassumedthatunitspecicauxiliarydataxij=(xij1,...,xijp)0areavailableforeachpopulationelementjineachsmallareai.Moreover,itisassumedthatthevariableofinterest,yij,isrelatedtoxijthroughaone-foldnestederrorlinearregressionmodel 23 PAGE 24 Batteseetal. ( 1988 )studiedthenestederrorregressionmodel( 1 )inestimatingtheareaundercornandsoyabeansforcountiesinNorth-CentralIowausingsamplesurveydataandsatelliteinformation.Indoingso,theycameupwithanempiricalbestlinearunbiasedpredictor(EBLUP)forthesmallareameans. Overtheyears,numerousextensionshavebeenproposedfortheabovemodelingframeworksincludingmultivariateFay-Herriotmodels,generalizedlinearmodels,spatialmodelsandmodelswithmorecomplicatedrandom-effectsstructureetc. Rao ( 2003 )presentedaniceoverviewofthedifferentestimationmethodswhile JiangandLahiri ( 2006 )reviewedthedevelopmentofmixedmodelestimationinthesmallareacontext. AproperreviewofmodelbasedsmallareaestimationwillbeincompletewithoutexplainingtheEBLUP,EBandHBapproachesthatarebeingwidelyusedinthiscontext.Asshownabove,smallareamodelsarespecialcasesofgenerallinearmixedmodelsinvolvingxedandrandomeffectssuchthatsmallareaparameterscanbeexpressedaslinearcombinationsoftheseeffects. Henderson ( 1950 )derivedtheBLUPestimatorsofsmallareaparametersintheclassicalfrequentistframework.Thesearesocalledbecausetheyminimizethemeansquarederroramongtheclassoflinearunbiasedestimatorsanddonotdependonnormality.So,theyaresimilartothebestlinearunbiasedestimators(BLUEs)ofxedparameters.TheBLUPestimatortakesproperaccountofthebetweenareavariationrelativetotheprecisionofthedirectestimator.AnEBLUPestimatorisobtainedbyreplacingtheparameterswiththeasymptoticallyconsistentestimator. Robinson ( 1991 )givesanexcellentaccountofBLUPtheoryandsomeapplications.InanEBapproach,theposteriordistributionoftheparametersof 24 PAGE 25 Morris ( 1983 ).Lastbutnottheleast,intheHBapproach,apriordistributionisspeciedonthemodelparametersandtheposteriordistributionoftheparameterofinterestisobtained.Inferencesabouttheparametersarebasedontheposteriordistribution.Theparameterofinterestisestimatedbyitsposteriormeanwhileitsprecisionisestimatedbyitsposteriorvariance.RecentadvancesinMarkovchainMonteCarlotechnique,specicallyGibbsandMetropolisHastingssamplershaveconsiderablysimpliedthecomputationalaspectofHBprocedures. TheSmallAreaIncomeandPovertyEstimates(SAIPE)programoftheU.S.CensusBureauwasestablishedwiththeaimofprovidingannualestimatesofincomeandpovertystatisticsforallstates,countiesandschooldistrictsacrosstheUnitedStates.Theresultingestimatesaregenerallyusedfortheadministrationoffederalprogramsandtheallocationoffederalfundstolocaljurisdictions.TheSAIPEprogramalsoprovidesannualstateandcountylevelestimatesofmedianhouseholdincome.Generally,observationsonvariouscharacteristicsofsmallareasthatarecollectedovertimemaypossessacomplicatedunderlyingtime-varyingpattern.Itislikelythatmodelswhichtakesintoaccountthislongitudinalpatternintheobservationsmayperformbetterthanclassicalsmallareamodelswhichdonotutilizethisinformation.Inthisstudy,wepresentasemiparametricBayesianframeworkfortheanalysisofsmallarealeveldatawhichexplicitlyaccomodatesforthelongitudinaltimevaryingpatternintheresponseandthecovariates. 25 PAGE 26 Suppose,theresponseyandthecovariatexarerelatedas wheref(x)isanunknownandunspeciedsmoothfunctionofxandeiN(0,2e).Thebasicproblemofnonparametricregressionistoestimatethefunctionf()usingthedatapoints(xi,yi).Indoingso,itistypicallyassumedthatbeneatharoughobservationaldatapatternthereisasmoothtrajectory.Thisunderlyingsmoothpatternisestimatedbyvarioussmoothingtechniques.Broadly,therearefourmajorclassesofsmoothersusedtoestimatef(.)vizLocalpolynomialkernelsmoothers( FanandGijbels ( 1996 ); WandandJones ( 1995 )),Regressionsplines( Eubank ( 1988 ), Eubank ( 1999 )),Smoothingsplines( Wahba ( 1990 ); GreenandSilverman ( 1994 ))andPenalizedsplines( EilersandMarx ( 1996 ); Ruppertetal. ( 2003 )).Eachsmootherhasitsownstrengthsandweaknesses.Forexample,localpolynomialsmoothersarecomputationallyadvantageousforhandlingdenseregionswhilesmoothingsplinesmaybebetterforsparseregions.Here,wewillbrieyreviewthemaincharacteristicsofsplinesingeneralandpenalizedsplinesinparticular. Thebasicideabehindsplinesistoexpresstheunknownfunctionf(x)usingpiecewisepolynomials.Twoadjacentpolynomialsaresmoothlyjoinedatspecicpointsintherangeofxknownasknots.Theknots,say,(1,...,K)partitiontherangeofxintoKdistinctsubintervals(orneighborhoods).Withineachsuchneighborhood,apolynomialofcertaindegreeisdened.Apolynomialsplineofdegreephas(p1)continuousderivativesandadiscontinuouspthderivativeatanyinteriorknot.Thepthderivativereectsthejumpofthesplinesattheknots.Thus,asplineofdegree0isa 26 PAGE 27 Here(xk)p+isthefunction(xk)pIfx>kg.Usingtheabovebasis,asplineofdegreepcanbeexpressedas Here,(0,...,p)and(1,...,K)arethecoefcientsofthepolynomialandsplineportionsoftheabovestructureandmustbeestimated.p=1,2,3correspondstoalinear,quadraticorcubicsplinerespectively.TheabovebasisconstitutesoneofthemostcommonlyusedbasisfunctionswhileotherbaseslikeradialbasisorB-splinescanalsobeused.Itcanbeshownthatthereexistsaveryrichclassofspline-generatingfunctionswhichinturngreatlyincreasesthescopeandapplicabilityofsplinesinvariousmodelingframeworks.Moreover,theverystructureofthesplinesmakesthemextremelygoodatcapturinglocalvariationsinapatternofobservations,somethingwhichcannotbeachievedusingFourierorPolynomialbases. Oneofthemostimportantaspectofsmoothingistheproperselectionandpositioningoftheknots.Thisisbecausetheknotsactassensorsinrelayinginformationabouttheunderlyingtrueobservationalpattern.Toofewknotsoftenleadtoabiasedtwhileanexcessivenumberofknotsleadstooverttingvis-a-visoverparametrizationandmayevenworsentheresultingt.Thus,asufcientnumberofknotsshouldbeusedandtheyshouldbeplaceduniformlythroughouttherangeoftheindependentvariable.Generally,theknotsareplacedonagridofequallyspacedsamplequantilesofxandamaximumof35to40knotssufcesforanypracticalproblem( Ruppert 2002 ).Recently,therehavebeeninterestingcontributionsonknot 27 PAGE 28 Friedman ( 1991 ); Stoneetal. ( 1997 ); Denisonetal. ( 1998 ); Lindstrom ( 1999 ); DiMatteoetal. ( 2001 ); BottsandDaniels ( 2008 )).Theexibilityandwideapplicabilityofsplinesisduetothefactthatprovidedtheknotsareevenlyspreadoutovertherangeofx,f(xj,)canaccuratelyestimateaverylargeclassofsmoothfunctionsf(.)evenifthedegreeofthesplineiskeptrelativelylow(say,1or2). Thesplinecoefcients(1,...,K)in( 1 )correspondtothediscontinuouspthderivativeofthespline-thus,theymeasurethejumpsofthesplineattheknots(1,...,K).Thus,theycontributetotheroughnessoftheresultingspline.Inordertosmooth-outthet,aroughnesspenaltyisplacedontheseparameters.Thisisoftendonebyminimizingtheexpression whereisknownasthesmoothingparameter.Thisissynonymoustominimizingtherstpartof( 1 )subjecttotheconstraint0.playsacrucialroleinthesmoothingprocesssinceitcontrolsthegoodnessoftandroughnessofthettedmodel.Decreasing,thesplinewilltendtoovert,becominganinterpolatingcurveas!0.Increasing,thesplinewillbecomesmootherandwilltendtotheleastsquarestas!1.Therearedifferentmethodsforchoosingtheoptimallikecross-validation,generalizedcross-validation,Mallow'sCpcriterionetc. Broadlyspeaking,therearethreemaintypesofsplines:Regressionsplines,SmoothingsplinesandPenalizedsplines(orP-splines).Allofthemarebasedonthesameprincipleasdetailedabovebutdifferinthespecicmannerinwhichsmoothingisdoneortheknotsareselected.Inregressionsplines,smoothingisachievedbythedeletionofnon-essentialknotsorequivalently,bysettingthejumpsatthoseknotstozerokeepingthejumpsattheotherknotsundisturbed.Insmoothingandpenalizedsplines,smoothingisachievedbyshrinkingthejumpsatalltheknotstowardszerousing 28 PAGE 29 1 ).Amajordifferencebetweensmoothingsplinesandpenalizedsplinesisthat,intheformer,alltheuniquedatapointsareusedasknotsbutinthelatterthenumberofknotsaremuchsmallerresultinginmoreexibility.Infact,penalizedsplinescanbeseenasageneralizationofregressionandsmoothingsplines. Thewideapplicabilityofpenalizedsplinesindiversesettingsismainlyduetoitscorrespondencewithlinearmixedeffectsmodels.Infact,penalizedsplinescanbeshowntobebestlinearunbiasedpredictors(BLUP)'sinamixedmodelframework.Toseethis,werewrite( 1 )as where=(,)0,=(0,1,...,p)0,=(1,2,...,K)0andDisaknownpositivesemi-denitepenaltymatrixsuchthatD=0B@0(p+1)(p+1)0(p+1)(K)0(K)(p+1)1K1CA 1 )correspondstosetting=I. LetXbethematrixwiththeithrowXi=(1,xi,...,xpi)andZbethematrixwiththeithrowZi=f(xi1)p+,...,(xi1)p+).Usingthisformulationin( 1 )withthebasisfunctionin( 1 )anddividingbytheerrorvariance2e,wehave 2ekk2(1) ByassumingthatisavectorofrandomeffectswithCov()=2Iwhere2=2e=whileasthesetofxedeffectsparameters,theabovepenalizedsplineframework 29 PAGE 30 whereCov(e)=2eIandandeareindependent. BayesianP-splineshaverecentlybecomepopularbecausetheycombinetheexibilityofnon-parametricmodelsandtheexactinferenceprovidedbytheBayesianinferentialprocedure.Thisisevenmoretruebecauseoftheseamlessfusionofpenalizedsplinesintothemixedmodelframework( Wand 2003 )asshownabove.Thisequivalencealsocarriesovertothemannerinwhichsmoothingisdone.Smoothingcanbeachievedbyimposingpenaltiesonthesplinecoefcients,asshownin( 1 )orbyassumingadistributionalformfor,forexampleNK(0,2IK).IntheBayesiancontext,priorsareplacedon2andtheotherparametersandusualposteriorsamplingiscarriedout.Sincesamplesaregeneratedfromthesmoothingparameteralongsidetheotherparameters,thismethodisalsoknownasautomaticscatterplotsmoothing.Inalltheproblemstackledinthisdissertation,wewillbeusingBayesianinferentialproceduresonpenalizedsplinesasshownabove. 30 PAGE 31 Lewisetal. 1996 )haveindicatedthatalongitudinalapproachofincorporatingtheentireexposurehistory,whenavailable,mayleadtoagainininformationonthecurrentdiseasestatusofasubjectandmorepreciseestimationoftheoddsratioofdisease.Itmayalsoprovideinsightsonhowthepresentdiseasestatusofasubjectisbeinginuencedbypastexposureconditionsconditionalonthecurrentones.Inthiswork,wepresentaBayesiansemiparametricapproachforanalyzingcasecontroldatawhenlongitudinalexposureinformationisavailableforbothcasesandcontrols. Statisticalanalysisofcase-controldatawaspioneeredby Corneld ( 1951 ), Corneldetal. ( 1961 )and MantelandHaenszel ( 1959 ).Sincethen,importantandfarreachingcontributionshavebeenmadeinvirtuallyeveryaspectoftheeld.Someofthenotableonesareequivalenceofprospectiveandretrospectivelikelihoods( PrenticeandPyke 1979 ),measurementerrorinexposures( Roederetal. 1996 )andmatchedcase-controlstudies( Breslowetal. 1978 ).ImportantcontributionsintheBayesianparadigmincludebinaryexposures( ZelenandParker 1986 ),continuousexposures( MullerandRoeder 1997 ),categoricalexposures( SeamanandRichardson 2001 ),equivalence( SeamanandRichardson 2004 )andmatching( Diggleetal. ( 2000 ); GhoshandChen ( 2002 )). Theanalysisofcomplexdatascenariosinacasecontrolframeworkisarelativelynewareaofresearch.Specically,analysisoflongitudinalcasecontrolstudieshasonly 31 PAGE 32 ParkandKim ( 2004 )areoneoftherstcontributorstothisarea.Theyproposedanordinarylogisticmodeltoanalyzelongitudinalcasecontroldatabutignoredthelongitudinalnatureofthecohort.Theyalsoshowedthatordinarygeneralizedestimatingequations(GEE)basedonanindependentcorrelationstructurefailsinthisframework. Inviewoftheabovechallenges,weproposetousefunctionaldataanalytictechniques,speciallynonparametricregressionmethodologytomodelboththetime 32 PAGE 33 EilersandMarx ( 1996 ); Ruppertetal. ( 2003 )).Wealsoexpresstheeffectoftheexposuresonthecurrentdiseasestateasapenalizedsplinetoaccountforanypossibletimevaryingpatternsofinuence.AnalysisiscarriedoutinahierarchicalBayesianframework.Ourmodelingframeworkisquiteexiblesinceitcanaccommodateanypossiblenon-lineartimevaryingpatternintheexposureandinuenceproles.Itisdifculttoachievethesamegoalinapurelyparametricsetting. Inacase-controlstudy,thenaturallikelihoodistheretrospectivelikelihood,basedontheprobabilityofexposuregiventhediseasestatus. PrenticeandPyke ( 1979 )showedthatthemaximumlikelihoodestimatorsandasymptoticcovariancematricesofthelog-oddsratiosobtainedfromaretrospectivelikelihoodarethesameasthatobtainedfromaprospectivelikelihood(basedontheprobabilityofdiseasegivenexposure)underalogisticformulationforthelatter.Thus,case-controlstudiescanbeanalyzedusingaprospectivelikelihoodwhichgenerallyinvolvesfewernuisanceparametersthanaretrospectivelikelihood. SeamanandRichardson ( 2004 )provedasimilarresultintheBayesiancontext.Specically,theyshowedthatposteriordistributionofthelog-oddsratiosbasedonaprospectivelikelihoodwithauniformpriordistributiononthelogodds(thatanindividualwithbaselineexposureisdiseased)isexactlyequivalenttothatbasedonaretrospectivelikelihoodwithaDirichletpriordistributionontheexposureprobabilitiesinthecontrolgroup.Thus,Bayesiananalysisofcase-controlstudiescanbecarriedoutusingalogisticregressionmodelundertheassumptionthatthedatawasgeneratedprospectively. Weshowthattheresultsof SeamanandRichardson ( 2004 )appliesfortheproposedsemiparametricframeworkthusenablingustoperformtheanalysisbasedonaprospectivelikelihoodeventhoughacasecontrolstudyisretrospectiveinnature.Weperformmodelcheckingbasedontheposteriorpredictivelosscriterion( Gelfandand 33 PAGE 34 , 1998 ).Oncetheoptimalmodelisidentied,modelassessmentiscarriedoutusingcasedeletiondiagnostics( BradlowandZaslavsky 1997 ). Etzionietal. 1999 ).Thisdatasetisbasedonabiomarkerbasedscreeningprocedureforprostatecancertoelucidatetheassociationbetweenprostatecancerandprostate-specicantigen(PSA).Theeffectivenessofbiomarkerbasedscreeningproceduresforprostatecanceriscurrentlyatopicofintensedebateandinvestigationintherealmsofhealthcarepractice,policyandresearch.Sincethediscoveryofprostate-specicantigen(PSA)andtheobservationthatserumPSAlevelsmaybesignicantlyincreasedinprostatecancerpatients,alotofefforthasbeendedicatedtoidentifyingeffectivePSAbasedtestingprogramswithfavorablediagnosticproperties. Inthisstudy,thelevelsoffreeandtotalPSAweremeasuredintheseraof71prostatecancercasesand70controls.Participantsinthisstudyincludedmenaged50to65athighriskoflungcancer.TheywererandomizedtoreceiveeitherplaceboorBetaCaroteneandRetinol.Theinterventionhadnonoticeableeffectontheincidenceofprostatecancer,withsimilarnumberofcasesobservedintheinterventionandcontrolarms.SeveralPSAmeasurementsrecordedforthecasesweretakenaslongas10yearspriortotheirdiagnosis.The71prostatecancercaseswerediagnosedbetweenSeptember1988andSeptember1995inclusive.Theindividualsdeemedcontrolswereselectedamongindividualsnotyetdiagnosedashavingcancerbythetimeofanalysis.Astheexposurevariable,weusethenaturallogarithmofthetotalPSA(Ptotal)althoughthenegativelogarithmoftheratiooffreetototalPSA(Pratio)canalsobeconsidered.Inadditiontotheabovemeasurements,observationswerecollectedontime(years)relativetoprostatecancerdiagnosisandageatblooddrawforthecases 34 PAGE 35 2-1 showsthePSAtrajectoryagainstageforsomerandomlychosencasesandcontrols. Etzionietal. ( 1999 )analyzedthisdatasetbymodelingthereceiveroperatingcharacteristic(ROC)curvesassociatedwithboththebiomarkers(PtotalandPratio)asafunctionofthetimewithrespecttodiagnosis.Theyobservedthatalthoughthetwomarkersperformedsimilarlyeightyearspriortodiagnosis,PtotalwassuperiortoPratioattimesclosertodiagnosis. Therestofthechapterisorganizedasfollows.InSection 2.2 ,weintroducethesemiparametricmodelingframework.Section 2.3 describesthedetailsofposteriorinference.InSection 2.4 ,wediscussrelevantBayesianequivalenceresultsforourframework.Section 2.5 outlinesthemodelcomparisonandmodelassessmentproceduresweperformed.WedescribethedataanalysisresultsbasedontheprostatecancerdatasetinSection 2.6 andendwithadiscussioninSection 2.7 2.2.1Notation 35 PAGE 36 Longitudinalexposure(PSA)prolesof3randomlysampledcases(1stcolumn)and3randomlysampledcontrols(2ndcolumn)plottedagainstage. 36 PAGE 37 Ourmodelingframeworkbearssomeresemblancetothatof Zhangetal. ( 2007 )whousedatwostagefunctionalmixedmodelapproachformodelingtheeffectofalongitudinalcovariateproleonascalaroutcome.Theyproposedalinearfunctionalmixedeffectsmodelformodelingtherepeatedmeasurementsonthecovariate.Theeffectofthecovariateproleonthescalaroutcomewasmodeledusingapartialfunctionallinearmodel.Indoingso,theytreatedtheunobservedtruesubject-speciccovariatetimeproleasafunctionalcovariate.Forttingpurposes,theydevelopedatwo-stagenonparametricregressioncalibrationmethodusingsmoothingsplines.Thus,estimationatboththestageswasconvenientlycastintoauniedmixedmodelframeworkbyusingtherelationbetweensmoothingsplinesandmixedmodels.ThekeydifferencesbetweentheirframeworkandoursisthatweuseBayesianinferentialtechniquestosimultaneouslyestimatetheparametersoftheexposureanddiseasemodels.Moreover,insteadofalinearmodelingframework,weuseacombinationoflinearandlogisticmodelssinceourresponseisbinary. whereeijN(0,2e),f(a)isthepopulationmeanfunctionmodelingtheoverallPSAtrendasafunctionofageforallthesubjectswhilegi(a)isthesubjectspecicdeviationfunctionreectingthedeviationoftheithsubjectspecicprolefromthemeanpopulationprole. Thereasonformodelingexposureasafunctionofageisthatforarandomlychosensubjectwithunknowndiseasestatus,thePSAvalueatacertaintimepointshoulddependonthesubject'sageatthattimepointcontrollingforthetimewithrespect 37 PAGE 38 Werepresentbothf(aij)andgi(aij)usingp-splinesasfollows wherep,(aij)=[1,aij,...,apij,(aij1)p+,...,(aijK)p+]0andq,(aij)=[1,aij,...,aqij,(aij1)q+,...,(aijM)q+]0aretruncatedpolynomialbasisfunctionsofdegreespandqwithknots(1,...,K)and(1,...,M)respectively( Durbanetal. 2004 ).Generally,MK. whereL(.)isthelogisticdistributionfunction,Xi(t+adi)isthetrue,error-freeunobservedsubject-specicexposureprolemodeledasf(t+adi)+gi(t+adi)while(t+adi)isanunknownsmoothfunctionofagewhichreectsthetimepatternoftheeffectofthePSAtrajectoryonthecurrentdiseasestatusfortheithsubject.In( 2 ),weusetherelationaij=tij+aditomodeltheexposuretrajectoryX(.)andtheinuencefunction(.)asafunctionoftimewithrespecttodiagnosis.Indoingso,wecaneasilyassesstheeffectofthetrajectoryonthecurrentdiseasestateatanygivenpointbeforediagnosisforaparticularsubject.cisthetimebywhichwegobackinthepasttorecordtheexposurehistoryfortheithsubject;e.g.c=8wouldimplythat,fortheithsubject,theexposureobservationsrecordedsinceeightyearspriortodiagnosisarebeingconsideredforanalysis.Thus,bychangingthevalueofc,theeffectofdifferentiallengthsofPSAtrajectoriesonthecurrentdiseasestatuscanbestudied. 38 PAGE 39 wherer,(t+adi)=[1,(t+adi),...,(t+adi)r,(t+adi1)r+,...,(t+adiK)r+]0,=(0,...,K+r)0and(1,...,K)aretheknots. Asspecialcasesof( 2 ),wemayconsider(t+adi)=0,inwhichcasethecovariateistheareaunderthePSAprocessfXi(t+adi),ct0gand0isitseffectonthediseaseprobability(orlogitofthediseaseprobability).Wecanalsoassume(t+adi)=0+1(t+adi)whichsigniesalinearpatternoftheeffectoftheexposuretrajectoryonthediseaseprobability.Intheabovemodels,theknotscanbechosenonagridofequallyspacedquantilesoftheages. Replacing( 2 )and( 2 )intheR.H.Sof( 2 ),wehave whereMi=Z0cp,(t+adi)r,(t+adi)0dtandQi=Z0cq,(t+adi)r,(t+adi)0dt. Forpre-chosendegreesofthebasisfunctionsandtheknots,bothMiandQiarematricesandareavailableinclosedforms.Weassumenormaldistributionalformsforthesplinecoefcientsin( 2 )and( 2 )inordertopenalizethejumpsofthesplineattheknots.Thus,wehavep+kN(0,2)(k=1,...,K);bi,q+mN(0,2b)(m=1,...,M)andk+rN(0,2)(k=1,...,K).Finally,therandomsubjectspecicdeviationfunctiongi(aij)ismodeledasbijN(0,2j)(i=1,...,N;j=0,...,q). 39 PAGE 40 2.3.1LikelihoodFunction Thelikelihoodfortotheithsubject,conditionalontherandomeffectsisgivenby wherep(Yij,ai,bi,2e)istheprobabilitydistributioncorrespondingtothetrajectorymodel,p(Dij,,)denotesthelogisticdistributioncorrespondingtothediseasemodelwhiletherestdealswiththedistributionalstructuresonthesplinecoefcientsandrandomeffects. Sincethetrajectorymodel( 2 )hasanormaldistributionalstructurewhilethediseasemodel( 2 )hasalogisticstructure,thelikelihoodfunctionandhencetheposteriorhaveacomplicatedform.Toalleviatethisproblem,weapproximatethelogisticdistributionasamixtureofnormalsusingawellknowndataaugmentationalgorithmproposedby AlbertandChib ( 1993 ).ThisisbrieyexplainedinSection3.3. 40 PAGE 41 LikelihoodApproximation AlbertandChib ( 1993 )toapproximatethelikelihoodandthussimplifyposteriorinference.Theyshowedthatalogisticregressionmodelonbinaryoutcomescanbewellapproximatedbyanunderlyingmixtureofnormalregressionstructureonlatentcontinuousdata.Indoingso,itcanbeshownthatalogitlinkisapproximatelyequivalenttoaStudent-tlinkwith8degreesoffreedom. Asin AlbertandChib ( 1993 ),weintroducelatentvariablesZ1,Z2,...,ZNsuchthatDi=1ifZi>0andDi=0otherwise.LetZibeindependentlydistributedfromatdistributionwithlocationHi=+0Mi+b0iQi,scaleparameter1anddegreesoffreedom.Equivalently,withtheintroductionoftheadditionalrandomvariablei,thedistributionofZicanbeexpressedasscalemixturesofnormaldistribution 26 )as 41 PAGE 42 2 ).Since,themarginalposteriordistributionofisanalyticallyintractable,weconstructanMCMCalgorithmtosamplefromitsfullconditionals.Indoingso,weusemultiplechainsandmonitorconvergenceofthesamplersusingGelmanandRubindiagnostics( GelmanandRubin 1992 ). 1.2 ), SeamanandRichardson ( 2004 )showedthatforcertainchoicesofthepriorsonthelogodds,posteriorinferencefortheparameterofinterestbasedonaprospectivelogisticmodelcanbeshowntobeequivalenttothatbasedonaretrospectiveone.Asaresult,aprospectivemodelingframeworkcanbeusedtoanalyzecase-controldatawhicharegenerallycollectedretrospectively.HereweshowthattheBayesianequivalenceresultsof SeamanandRichardson ( 2004 )canbeextendedtothesemiparametricframeworkwehaveproposed.Thisenablesustouseaprospectivelogisticframework(asdescribedinSection( 2.2.2 ))toanalyzethePSAdataset. Ourmodelingframeworkhingesontheideathatforeverysubject,insteadofasingleexposureobservation,aseriesofpastexposureobservationsareavailable.Weusethisexposuretrajectoryorexposureproleinanalyzingthepresent 42 PAGE 43 Rubin ( 1981 )andlaterby Gustafsonetal. ( 2002 )canbeappliedtothetrajectoryasawholei.efXi(t),ct0gcanbeassumedtobeadiscreterandomvariablewithsupportfZ1(t),...,ZJ(t),ct0g,thesetofallobservableexposuretrajectorieswherefZj(t),ct0,j=1,...,JgisanitecollectionofelementsinthesupportoftheXij's.LetY0jandY1jbethenumberofcontrolsandcaseshavingexposureprolefZj(t),ct0g.WedenotetheNullorbaselinetrajectoryasfX(t)=0,ct0g. TheoddsratioofdiseasecorrespondingtofZj(t),ct0gwithrespecttobaselineexposureisexpZ0cZj(t)(t)dt.AssumingthatacontrolhasexposureprolefZj(t),ct0gwithprobabilityj=PJk=1k,itcanbeeasilyshownthatP(X(t)=Zj(t),ct0jD=1)=jexpZ0cZj(t)(t)dt 43 PAGE 44 since(t)=(t)0=0(t)by( 2 ).Weassume1=1foridentiability.Hered=0and1standsforcontrolsandcasesrespectively.Assuming#tobethebaselineoddsofdisease,theprospectivelikelihoodisgivenby Basedontheabovesetup,wehavethefollowingequivalenceresults: 2 )withrespecttoisthesameasthatobtainedbymaximizingL(#,)in( 2 )withrespectto#. 44 PAGE 45 (ii)Assuming=(1,...,J)andj=j=JXk=1k,theposteriordensityof(,)is (iii)Themarginalposteriordensitiesofobtainablefromp(w,jy)andp(,jy)arethesame. Theproofsoftheabovetheoremaresimilarinnaturetothosein SeamanandRichardson ( 2004 )andaregivenintheAppendixA.SincewehaveconsiderednearuniformpriorforandourprioronensurestheexistenceandnitenessofE(),theconditionsofTheorem2areessentiallysatisedforourframework. Basedontheaboveresults,itcanbeconcludedthatthemarginalposteriordistributionof-theparameterofinterest,willbethesameregardlessofwhetherwetaprospectiveorretrospectivemodel.Thus,wecananalyzethePSAdatausingtheprospectivesemiparametricmodelingframeworkdescribedabove.Bayesianequivalencecanalsobeshowninthemoregeneralcaseofmulticategorycasecontrolsetup,i.ewhentherearemultiple(>2)diseasestates.Wehavethefollowingresult PAGE 46 KXl=1ldl1CCCCCAndkKYk=11k!() TheproofoftheabovetheoremisgiveninAppendixA. 2.5.1PosteriorPredictiveLoss GelfandandGhosh ( 1998 ).Thiscriterionisbasedontheideathatanoptimalmodelshouldprovideaccuratepredictionofareplicateoftheobserveddata. 46 PAGE 47 ( 1998 )obtainedthiscriterionbyminimizingtheposteriorlossforagivenmodelandthen,forallmodelsunderconsideration,selectingtheonewhichminimizesthiscriterion.Foragenerallossfunction,thiscriterioncanbeexpressedasalinearcombinationoftwodistinctpartsi.eagoodness-of-tpartandapenaltypart.Forourframework,theposteriorpredictivelosscanbewrittenas k+1NXi=1Var(^Di)(2) where^Di=E(Drepijy,D)andVar(^Di)=Var(Drepijy,D)=E(Drepijy,D)(E(Drepijy,D))2.Forourframework,Drep=(Drep1,...,DrepN)isthereplicateddiseasestatusvectorforallthesubjects.ItisstraightforwardtocalculatetheexpectedvalueoftheabovecriterionusingtheposteriorsamplesobtainedfromtheGibbssampler.Lowervaluesofthiscriterionwouldimplyabettermodelt.Weassumek=1andobtainthevaluesofposteriorpredictivelossfordifferentlengthsofexposuretrajectoriesanddifferentnumberofknots.TheresultsaregiveninTable 2-3 andexplainedinSection 2.6 .Fortheoptimalmodelselectedusingtheposteriorpredictivelosscriterion,modelassessmentwasperformedusingKappameasuresofagreementandcasedeletiondiagnostics.Themethodologyisdescribedbelow. Agresti 2002 )whichcomparesagreementagainstthatwhichmightbeexpectedbychance.Thevalueofrangesfrom1to1;=1impliesperfectagreementwhile=1impliescompletedisagreement.Avalueof0indicatesnoagreementaboveandbeyondthatexpectedbychance. 47 PAGE 48 Theobserveddiseasestatus(vis-a-viscaseorcontrolstatus)ofasubjectisobtainedfromthedatasetwhilethepredicteddiseasestatusiscalculatedfromtheposteriorestimatesoftheparameters.AtiterationnoftheGibbssampler,wecancalculatethequantity^p(n)i=^P(n)(Di=1jXi(t+adi),t2[c,0])=L(n)(+0Mi+b0iQi)whereL(.)canbeeithertheexactlogitcdfortheapproximateStudent-tcdf(with8degreesoffreedom).Basedonthevalueof^p(n)i,wecanassign^D(n)i=8><>:1if^p(n)i>0.50if^p(n)i0.5 Hampeletal. 1987 ).Thesediagnosticscanbeusedtodetectobservationswithanunusualeffectonthettedmodelandthusmayleadtoidenticationofdataormodelerrors. BradlowandZaslavsky ( 1997 )appliedcaseinuencetoolsin 48 PAGE 49 LetHi=+0Mi+b0iQiandSij=p,(aij)0+q,(aij)0bi.SupposeL(YijjSij,2e)bethedensityfunctioncorrespondingtothetrajectorymodel,whileL(DijHi)betheoneforthediseasemodel.Weworkedwiththefollowingthreetypesofweightingschemesbasedonthoseproposedby BradlowandZaslavsky ( 1997 ) HerendenotethenthiterationoftheGibbssampler,thesubscriptidenotethedeletionofyiandthesuperscriptdenoteunnormalizedweights.Inthelastweighingscheme,L(YijjSij,2e)andL(DijHi)aretheusuallikelihoodswiththepopulationlevelparametersi.e(,,,2e)replacedbythefulldataposteriormedians.Herefulldataposterioristheposteriordistributionobtainedfromthecompletedataseti.etheonehavingallthesubjects. 2.2 toanalyzetheprostatecancerdatasetdescribedinSection 2.1.2 .MultipleobservationsonfreeandtotalPSAwereobtainedfor71prostatecancercasesand70controls.Forsomesubjects,observationswerecollectedasfaras10yearspriortodiagnosis.WeusethenaturallogarithmoftotalPSA(Ptotal)asourexposureofinterest.Ourprinciple 49 PAGE 50 Forthepurposeofouranalysis,wehaveusedalinearp-spline(p=1)withasubjectspecicslopeparametertomodeltheexposuretrajectoryasfollows Fortheprospectivediseasemodel( 2 ),weconsideredtwospecicscenariosviz.constantinuence,(t+adi)=0andlinearinuence,(t+adi)=0+1(t+adi).Theresultsforthesetwocasesaresummarizedbelow. Onttingtheabovemodel,weobservedthatforalltrajectorylengths,0issignicant(its95%credibleintervaldoesnotcontain0).Foranyparticularinterval(i.echoiceofc),theposteriormeansand95%credibleintervalsof0donotchangemuchwiththenumberofknots(K).Inaddition,0increasesasthetrajectorylengthdecreasesi.easwemoveclosertothepointofdiagnosis.ThisislikelyrelatedtothescaleoftheareaunderthePSAprocessbutitalsoseemstosupportthewellknownmedicalfactthattotalPSAisabetterdiscriminatorofprostatecancerattimesclosertodiagnosisthanattimesfurtheroff( Catalonaetal. 1998 ).ToassesstheimpactofonlythepastPSAobservationsonthecurrentdiseasestate,weconsideredtheexposureintervalI=(10,5)and3knotsinthetrajectory.Theposteriormeanof0is0.298 50 PAGE 51 2.6.3 ParameterizingfZi(t+adi),ct0gasp,(t+adi)0+q,(t+adi)0di,asin( 2 ),wecanrewrite( 2 )asexp()0Z0cp,(t+adi)r,(t+adi)0dtexp(dibi)0Z0cq,(t+adi)r,(t+adi)0dt. expmZ0c(t+adi)dt=expcm(0+(adic=2)1).(2) 51 PAGE 52 2-1 showstheposteriormeansand95%credibleintervalsoftheoddsratioscorrespondingtodifferenttrajectorylengthsandageatdiagnosiswhenm=0.5.Foraxedtrajectorylength,theoddsratiosdecreaseasageatdiagnosisincreases.This Table2-1. Estimatesofoddsratiosfordifferenttrajectorylengthsandageatdiagnosisfora0.5verticalshiftoftheexposuretrajectoryforthelinearinuencemodel Age(3,0)(5,0)(8,0)(10,0) seemstosupportthenotionthatyoungersubjectstendtohavemoreaggressiveformofprostatecancerthanolderonesandthusaremostlikelytobebenetedfromearlydetection( Catalonaetal. 1998 ).Formostagesatdiagnosis,theoddsratiossteadilyincreaseaslongerexposuretrajectoriesareconsideredi.easpastexposureobservationsaretakenintoaccount.However,therateofincreaseishigherforlowerageatdiagnosis.Thus,considerationofpastexposureobservationsinadditiontorecentonesresultinasignicantgainininformationaboutthecurrentdiseasestatusofasubject.Finally,forthehighestageatdiagnosisconsidered(80),theoddsratiosdecreaseaslongerexposuretrajectoriesareconsidered.Thismayimplythatforasubjectwithveryhighageatdiagnosis,his/herpastexposureobservationsmaynotcontainsignicantamountsofinformationaboutthepresentdiseasestatus. Asbefore,wettedthediseasemodelontheintervalI=(10,5).Theposteriormeanand95%credibleintervalof0and1arerespectively1.24(0.29,2.19)and-0.015(-0.029,0.003)implyingthatexposureobservationsrecorded5-10yearspriortodiagnosisalsohasasignicanteffectonthecurrentdiseasestatus.Theposteriormeansand95%credibleintervalsoftheoddsratiosshowninTable 2-2 corroboratetheaboveconclusion. 52 PAGE 53 Posteriormeansand95%condenceintervalsofoddsratioforI=(10,5)forthelinearinuencemodel AgeatDiagnosis 50607080 Mean4.993.272.221.5695%C.I(1.96,10.41)(1.91,5.36)(1.67,2.98)(1.10,2.29) 2-3 ThePPLvaluesforthelinearmodelweresmallerthanthosecorrespondingtotheconstantinuencemodel.Thus,wecanconcludethatfortheprostatecancerdata,theclassoflinearinuencemodelstbetterthantheclassofconstantinuencemodels.Forbothsetups,themodelwith0knotshastheworstt(highestPPLcriterion)acrossalltrajectorylengths.Foragiventrajectory,themodelstendtoimprovewithanincreaseinthenumberofknotsuntilacertainnumberofknotsisreached.Furtherincreaseofknotstendtoworsenthet;thisagreeswiththendingsof Ruppert ( 2002 ).Theimportantpointtonotehereisthatthenumberofknotsandthelengthoftheexposuretrajectoryseemtointeractintheireffectonmodelt.Thebestttingconstantinuencemodelseemtobetheonewithexposuretrajectory(10,0)and3knots. Forthelinearinuencesetup,thePPLcriterionhasadecreasingtrendaslongerexposuretrajectoriesaretakenintoaccount.Thus,inclusionofpastexposuresresultinanimprovementofmodelt.Thismaybeindicativeofthefactthatpastexposureobservationscontainsignicantamountofinformationaboutthecurrentdiseasestatus.Inaddition,forthetrajectoryintervalI=(10,5),thePPLcriteriacorrespondingtothelinearandconstantinuencemodelsaremoderatelysmall.Thus,exposureobservationsrecorded5-10yearspriortodiagnosisalsoprovideamodestamountofinformationtowardpredictingthecurrentdiseasestatus,corroboratingtheconclusions 53 PAGE 54 Posteriorpredictivelosses(PPL)fortheconstantandlinearinuencemodelsforvaryingexposuretrajectoriesandknots KnotsModel(2,0)(5,0)(8,0)(10,0)(10,5) reachedearlier.Forthelinearsetup,themodelwithexposuretrajectoryI=(8,0)and4knotsperformthebest(hasthelowestPPLcriterionamongallthemodelsconsidered). Forthismodel,theposteriormeanofwasabout0.6with95%credibleinterval(0.535,0.680)whichindicatessubstantialagreementbeyondwhatisexpectedbychance.Wenextperformedcasedeletionanalysis.Wedeletedeachsubject(withalltheobservations)ratherthaneachobservationforasubject.Figure 2-2 (a)-(c)showsthecasedeletedposteriormeansand95%credibleintervalsfor1,0and1.(In 54 PAGE 55 2-2 (d)showstheplotoftheposteriormeansofthedifferenceprobabilitiesandthecorrespondingcondenceintervals.(Inthisgure,thesolidlinerepresentszerodifference.Thesolidpointsrepresentsthedifferenceindiseaseprobabilitiesbasedonthefullandcasedeletedposteriors.Theverticallinesegmentsarethe95%posteriorintervalsofthedifferences).Surprisingly,theobservationforcasenumber108hasasignicantdeparturefromtherest.Onanalyzingthissubject,itwasfoundthatithadtheuniquecombinationofveryhighageandveryhighvaluesofPSA.Infactithadthehighestmeanageinthesample,thehighestageatdiagnosiswhilethethirdhighestmeanPtotalvalue.Thesecharacteristicsmayhavecontributedtotheexceptionallyhighdifferenceinthepredictedprobabilityofdisease. Wealsoperformedcasedeletionanalysisoftheinterceptparametersofthediseaseandtrajectorymodelsandthevariancecomponents.Noneofthesubjectswerefoundtobeinuentialontheposteriorestimatesoftheseparameters.Thus,basedontheabovetwomeasures,wemayconcludethatthesemiparametriclinearinuencemodelwithtrajectoryI=(8,0)and4knotsseemstottheobserveddatarelativelywell. 55 PAGE 56 Sensitivityof1,0,1anddiseaseprobabilityestimatestocase-deletions. 56 PAGE 57 Inthiswork,wehaveappliedsemiparametricregressiontechniquesinanalyzinglongitudinalcasecontrolstudies.Wehaveusedpenalizedregressionsplinesinmodelingtheexposuretrajectoriesforthecasesandthecontrols.Thusourframeworkcanbeusedevenwhenexposureobservationsarecollectedatdifferenttimepointsacrosssubjectsi.ewhenexposuresareunbalancedinnature.Theexposuretrajectoryisusedasthepredictorinaprospectivelogisticmodelforthebinarydiseaseoutcome.Wehavealsomodeledtheslopeparameterofthediseasemodelasap-splinetoaccountforanytimevaryinginuencepatternoftheexposuretrajectoryonthecurrentdiseasestatus.Indoingso,wehavesummarizedtheexposurehistoryforthecasesandcontrolsinaexiblewaywhichallowedustoconsiderdifferentiallengthsoftheexposuretrajectoryinanalyzingitseffectonthecurrentdiseasestatus.Inordertosimplifytheanalysis,weusedthelogit-mixtureofnormalapproximation( AlbertandChib 1993 ).WeshowedthattheBayesianequivalenceresultsof SeamanandRichardson ( 2004 )essentiallyholdsforourframework,thusallowingustouseaprospectivelogisticmodelhavingfewernuisanceparametersalthoughthedatasetwascollectedretrospectively.AnalysishavebeencarriedoutinanhierarchicalBayesianframework.ParameterestimatesandassociatedcredibleintervalsareobtainedusingMCMCsamplers.Wehaveappliedourmethodologytoalongitudinalcasecontrol 57 PAGE 58 Weanalyzedourmodelusingdifferentiallengthsofexposuretrajectories.Indoingso,wehaveconcludedthatpastexposureobservationsdoprovidesignicantinformationtowardspredictingthecurrentdiseasestatusofasubject.Specically,wehaveshownthatacrossallageatdiagnosisgroups,theoddsofdiseasesteadilyincreaseaspastexposureobservationsaretakenintoaccountinadditiontotherecentones.Wealsoobservedthatforaxedtrajectorylength,theoddsofdiseasesteadilydecreaseastheageatdiagnosisincreasescorroboratingthemedicalfactthatyoungersubjectstendtohavemoreaggressiveformofprostatecancerandthusaremostlikelytobebenettedfromearlydetection.Weperformedmodelcomparisonusingposteriorpredictiveloss( GelfandandGhosh 1998 ).Thiscriterionindicatedthatmodelswithlongerexposuretrajectoriestendtoperformbetterthanthosewithshortertrajectories.Lastly,modelassessmentwasperformedontheoptimalmodelusingthekappastatisticandcasedeletiondiagnostics.Boththesetoolssuggestedthatourmodeltsrelativelywelltothedata. Someinterestingextensionscanbedonetooursetup.Forricherdatasets,itwillbeinterestingtomodelthesubjectspecicdeviationfunctionsasp-splines.Inaddition,wehaveonlyassumedconstantandlinearparameterizationsoftheinuencefunctionoftheprospectivediseasemodel.Foralargerdataset,ap-splineformulationcanalsobeusedfortheinuencefunctionwhichmaybringoutanyunderlyingnon-linearpatternofinuenceoftheexposuretrajectoryonthecurrentdiseasestatus.Althoughwehaveusedabinarydiseaseoutcome,itwillbeinterestingtoextendourframeworktoaccommodatemulti-categorydiseasestates.Ourmodelingframeworkcanalsobegeneralizedbyincorporatingalargerclassofnonparametricdistributionalstructures(likeDirichletprocessesorPolyatrees)forthesubjectspecicrandomeffects. 58 PAGE 59 59 PAGE 60 ThecurrentmethodologyoftheSAIPEprogramisbasedoncombiningstateandcountyestimatesofpovertyandincomeobtainedfromtheAmericanCommunitySurvey(ACS)withotherindicatorsofpovertyandincomeusingtheFay-Herriotclassofmodels( FayandHerriot 1979 ).Theindicatorsaregenerallythemeanandmedianadjustedgrossincome(AGI)fromIRStaxreturns,SNAPbenetsdata(formerlyknownasFoodStampProgramdata),themostrecentdecennialcensus,intercensalpopulationestimates,SupplementalSecurityIncomeReceipiencyandothereconomicdataobtainedfromtheBureauofEconomicAnalysis(BEA).EstimatesfromACSarebeingusedsinceJanuary2005ontherecommendationoftheNationalAcademyofSciencesPanelonEstimatesofPovertyforSmallGeographicAreas(2000).Incomeandpovertyestimatesuntil2004werebasedondatafromtheAnnualSocialandEconomicSupplement(ASEC)oftheCurrentPopulationSurvey(CPS). Apartfromvariouspovertymeasures,theSAIPEprogramprovidesannualstateandcountylevelestimatesofmedianhouseholdincome.Atthispoint,directACSestimatesofmedianhouseholdincomeareonlyavailablefortheperiod2005-2008.Thus,forillustrationpurpose,wehaveconsidereddatafromASECfortheperiod1995-1999inordertoestimatethestatelevelmedianhouseholdincomefor1999.Thisisbecause,themostrecentcensusestimatescorrespondtotheyear1999andthesecensusvaluescanbeusedforcomparisonpurposes.TheSAIPEregressionmodelforestimatingthemedianhouseholdincomefor1999useascovariates,themedianadjustedgrossincome(AGI)derivedfromIRStaxreturnsandthemedianhouseholdincomeestimatefor1999obtainedfromthe2000Census.Theresponsevariableisthedirectestimateofmedianhouseholdincomefor1999obtainedfromthe 60 PAGE 61 Bell 1999 ).NoninformativepriordistributionsareplacedontheregressionparametercorrespondingtotheIRSmedianincomesinceitwasfoundtobestatisticallysignicanteveninthepresenceofcensusdata,bothinthe1989and1999models. Fay ( 1987 )inthisregard.EstimationwascarriedoutinanempiricalBayes(EB)frameworksuggestedby Fayetal. ( 1993 ).Later, Dattaetal. ( 1993 )extendedtheEBapproachof Fay ( 1987 )andalsoputforwardunivariateandmultivariatehierarchicalBayes(HB)models.TheestimatesfromtheirEBandHBproceduressignicantlyimprovedovertheCPSmedianincomeestimatesfor1979. Ghoshetal. ( 1996 )exploitedtherepetitivenatureofthestate-specicCPSmedianincomeestimatesandproposedaBayesiantimeseriesmodelingframeworktoestimatethestatewidemedianincomeoffour-personfamiliesfor1989.Indoingso,theyusedatimespecicrandomcomponentandmodeleditasarandomwalk.TheyconcludedthatthebivariatetimeseriesmodelutilizingthemedianincomesoffourandvepersonfamiliesperformsthebestandproducesestimateswhicharemuchsuperiortoboththeCPSandCensusBureauestimates.Ingeneral,thetimeseriesmodelalwaysperformedbetterthanitsnon-timeseriescounterpart. 61 PAGE 62 Opsomeretal. ( 2008 )inwhichtheycombinedsmallarearandomeffectswithasmooth,non-parametricallyspeciedtrendusingpenalizedsplines.Indoingso,theyexpressedthenon-parametricsmallareaestimationproblemasamixedeffectsregressionmodelandanalyzeditusingrestrictedmaximumlikelihood.Theoreticalresultswerepresentedonthepredictionmeansquarederrorandlikelihoodratiotestsforrandomeffects.Inferencewasbasedonasimplenon-parametricbootstrapapproach.Themethodologywasusedtoanalyzeanon-longitudinal,spatialdatasetconcerningtheestimationofmeanacidneutralizingcapacity(ANC)oflakesinthenortheasternstatesofU.S. Ghoshetal. ( 1996 ),wehaveviewedthestatespecicannualhouseholdmedianincomevaluesaslongitudinalprolesorincometrajectories.ThisgainedmoregroundbecauseweusedthestatewideCPSmedianhouseholdincomevaluesforonlyveyears(1995-1999)inourestimationprocedure.Figure 3-1 showssamplelongitudinalCPSmedianhouseholdincomeprolesforsixstatesspanning1995to2004whileFigure 3-2 showstheplotsoftheCPSmedianincomeagainsttheIRSmeanandmedianincomesforallthestatesfortheyears1995through1999.ItisapparentthatCPSmedianincomemayhaveanunderlyingnon-linearpatternwithrespecttoIRSmeanincome,speciallyforlargevaluesofthelatter.Theabovetwofeaturesmotivatedustouseasemiparametricregressionapproach.Indoingso,wehavemodeledtheincometrajectoryusingpenalizedspline(orP-spline)( EilersandMarx 1996 )whichisacommonlyusedbutpowerfulfunctionestimationtoolinnon-parametricinference.TheP-splineis 62 PAGE 63 LongitudinalCPSmedianincomeprolesfor6statesplottedagainstIRSmeanandmedianincomes.(1stcolumn:IRSMeanIncome;2ndcolumn:IRSMedianIncome). 63 PAGE 64 GelfandandGhosh 1998 )hasbeenusedtoobtaintheparameterestimates. Wehavecomparedthestate-specicestimatesofmedianhouseholdincomefor1999withthecorrespondingdecennialcensusvaluesinordertotestfortheiraccuracy.Indoingso,weobservedthatthesemiparametricmodelestimatesimproveuponboththeCPSandtheSAIPEestimates.Interestingly,thepositioningoftheknotshadsignicantinuenceontheresultsaswillbediscussedlateron.WewanttomentionherethattheSAIPEmodelhadaconsiderableadvantageoveroursinthattheyusedthecensusestimatesofthemedianincomefor1999asapredictor.Insmallareaestimationproblems,thecensusestimatesareregardedasthegoldstandardsincethesearethemostaccurateestimatesavailablewithvirtuallynegligiblestandarderrors.So,usingthoseasexplanatoryvariableswasanaddedadvantageoftheSAIPEstatelevelmodels.ThefactthatourestimatesstillimproveontheSAIPEmodelbasedestimatesisatestamenttotheexibilityandstrengthofthesemiparametricmethodologyspeciallywhenobservationsarecollectedovertime.ItalsoindicatesthatitmaybeworthwhiletotakeintoaccountthelongitudinalincomepatternsinestimatingthecurrentincomeconditionsofthedifferentstatesoftheU.S. Therestofthechapterisorganizedasfollows.InSection 3.2 weintroducethetwotypesofsemiparametricmodelswehaveused.Section 3.3 goesoverthehierarchicalBayesiananalysisweperformed.InSection 3.4 ,wedescribetheresultsofthedata 64 PAGE 65 BIRSmedianincomeplot PlotsofCPSmedianincomeagainstIRSmeanandmedianincomesforallthestatesoftheU.S.from1995to1999. analysiswithregardtothemedianhouseholdincomedataset.InSection 3.5 ,wediscusstheBayesianmodelassessmentprocedureweusedtotestthegoodness-of-tofourmodels.WeendwithadiscussioninSection 3.6 .Theappendixcontainstheproofsoftheposteriorproprietyandtheexpressionsofthefullconditionaldistributions. 3.2.1GeneralNotation 65 PAGE 66 wheref(xij)isanunspeciedfunctionofxijreectingtheunknownresponse-covariaterelationship. Weapproximatef(xij)usingaP-splineandrewrite( 3 )as whereij=X0ij+Z0ij+bi+uijisourtargetofinference. HereXij=(1,xij,...,xpij)0,Zij=f(xij1)p+,...,(xijK)p+g0,=(0,...,p)0isthevectorofregressioncoefcientswhile=(1,...,K)0isthevectorofsplinecoefcients.Theabovesplinemodelwithdegreepcanadequatelyapproximateanyunspeciedsmoothfunction.Typically,linear(p=1)orquadratic(p=2)splinesservesmostpracticalpurposessincetheyensureadequatesmoothnessinthettedcurve.mandtrespectivelydenotethenumberofsmallareasandthenumberoftimepointsatwhichtheresponseandcovariatesaremeasured.Thus,inourcase,m=51,forallthe50statesoftheU.S.andtheDistrictofColumbiaandt=5fortheyears1995-1999.biisastate-specicrandomeffectwhileuijrepresentsaninteractioneffectbetweentheithstateandthejthyear.Weassumebii.i.dN(0,2b)andN(0,2IK).2controlstheamountofsmoothingoftheunderlyingincometrajectory.Moreover,itisassumed 66 PAGE 67 3.1.1 .InthedatasetsprovidedbytheCensusBureau,theseestimatesaregivenforallthestatesateachofthetimepoints.Theknots(1,...,K)areusuallyplacedonagridofequallyspacedsamplequantilesofxij's. From( 3 )and( 3 ),wehave 3 )andmodeleditasarandomwalkasfollows whereij=X0ij+Z0ij+bi+vj+uij Beforeproceedingtothenextsection,wemaynotethatunlikethemodelsof Ghoshetal. ( 1996 ),themodelsgivenin( 3 )and( 3 )incorporatestatespecicrandomeffects(bi).Thisrectiesalimitationoftheformeraspointedoutin Rao ( 2003 ). 67 PAGE 68 3.3.1LikelihoodFunction Here,L(Uja,b)denotesanormaldensitywithmeanaandvariancebwhileL(bij2b)andL(j2)denotesanormaldistributionwithmean0andvariances2band2respectively. Fortherandomwalkmodel,theparameterspacefortheithstatewouldbei=(i,,,bi,v,2,2b,2,2v)wherev=(v1,...,vt)isthevectoroftimespecicrandomeffects.Thus,thelikelihoodfunctionfortheithstatewillhaveanextracomponentcorrespondingtovasfollows whereL(vjjvj1,2v)denotesanormaldistributionwithmeanvj1andvariance2vwherev0=0. 68 PAGE 69 Thus,wehavethefollowingpriors:uniform(Rp+1),(2j)1G(cj,dj)(j=1,...,t),(2b)1G(c,d),(2)1G(c,d)and(2v)1G(cv,dv).HereXG(a,b)denotesagammadistributionwithshapeparameteraandrateparameterbhavingtheexpressionf(x)/xa1exp(bx),x0.Sincewehavechosenimproperpriorsfor,posteriorproprietyofthefullposteriorhavebeenshown.Wehavethefollowingtheorem Fortherandomwalkmodel,therewillbeanadditionalterm(2v).Bytheconditionalindependenceproperties,wecanfactorizethefullposterioras[,,,b,2b,2,f21,...,2tgjY,X,Z]/[Yj][j,,b,f21,...,2tg,X,Z][bj2b][j2][][2][2b]tYj=1[2j] 69 PAGE 70 GelmanandRubin ( 1992 )andrunn(2)parallelchains.Foreachchain,werun2diterationswithstartingpointsdrawnfromanoverdisperseddistribution.Todiminishtheeffectsofthestartingdistributions,therstditerationsofeachchainarediscardedandposteriorsummariesarecalculatedbasedontherestofthediterates.Thefullconditionalsforboththemodelsaregivenintheappendix. 3.2.2 toanalyzethemedianhouseholdincomedatasetreferredtoinSection 3.1.3 .TheresponsevariableYijandthecovariatesXijrespectivelydenotetheCPSmedianhouseholdincomeestimateandthecorrespondingIRSmean(ormedian)incomeestimatefortheithstateatthejthyear(i=1,...,51;j=1,...,5).Thestate-specicmeanormedianincomeguresareobtainedfromIRStaxreturndata.TheCensusBureaugetslesofindividualtaxreturndatafromtheIRSforuseinspecicallyapprovedprojectssuchasSAIPE.Foreachstate,theIRSmean(median)incomeisthemean(median)adjustedgrossincome(AGI)acrossallthetaxreturnsinthatstate.LikeotherSAIPEmodelcovariatesobtainedfromadministrativerecordsdata,thesevariablesdonotexactlymeasurethemedianincomeacrossallhouseholdsinthestate.OneofthereasonsforthisisthattheAGIwouldnotnecessarilybethesameastheexactincomegureandthetaxreturnuniversedoesnotcovertheentirepopulationi.esomehouseholdsdonotneedtoletaxreturns,andthosethatdonotarelikelytodifferinregardtoincomethanthosethatdo.However,theuseofthemeanormedianAGIasacovariateonlyrequiresittobecorrelatedwithmedianhouseholdincome,notnecessarilybethesamething.Specicallyforthisstudy,wehaveusedIRSmeanincomeasourcovariate.Thisisbecause,itseemstopossess 70 PAGE 71 3-2A ),andsoitismoresuitedtoasemiparametricanalysis. Inordertochecktheperformanceofourestimates,weplantousefourcomparisonmeasures.ThesewereoriginallyrecommendedbythepanelonsmallareaestimatesofpopulationandincomesetupbytheCommitteeonNationalStatisticsinJuly1978andareavailableintheirJuly1980report(p.75).Theseare ThebasicstructureofourmodelswouldremainthesameasinSection 3.2.2 .WehaveusedtruncatedpolynomialbasisfortheP-splinecomponentinboththemodels.SinceFig2adoesnotindicateahighdegreeofnon-linearity,wehaverestricted 71 PAGE 72 Ruppert 2002 ).Generally,atmost35to40knotsarerecommendedforeffectivelyallsamplesizesandfornearlyallsmoothregressionfunctions.Followingthegeneralconvention,wehaveplacedtheknotsonagridofequallyspacedsamplequantilesoftheindependentvariable(IRSmeanincome). GelmanandRubin ( 1992 ).Weranthreeindependentchainseachwithasamplesizeof10,000andwithaburn-insampleofanother5,000.Weinitiallysampledtheij'sfromt-distributionswith2dfhavingthesamelocationandscaleparametersasthecorrespondingnormalconditionalsgivenintheAppendix.ThisisbasedontheGelman-Rubinideaofinitializingcertainsamplesofthechainfromoverdisperseddistributions.However,onceinitialized,thesuccessivesamplesofij'saregeneratedfromregularunivariatenormaldistributions.ConvergenceoftheGibbssamplerwasmonitoredbyvisuallycheckingthedynamictraceplots,acfplotsandbycomputingtheGelman-Rubindiagnostic.Thecomparisonmeasuresdeviatedslightlyfordifferentinitialvalues.Wechosetheleastofthoseasthenalmeasurespresentedinthetablesthatfollows. 72 PAGE 73 WettedModelI(SPM)withallpossibleknotchoicesfrom0to40butthebestresultswereachievedwith5knots.Theestimates(with5knots)improvedsignicantlyovertheCPSestimatesbasedonallthefourcomparisonmeasures.Additionofmoreknotsseemedtodegradethetofthemodel.Thismayhappenaspointedoutin Ruppert ( 2002 ).Ontheotherhand,theSAIPEmodelbasedestimateswereslightlysuperiortotheSPMestimates. Next,wettedthesemiparametricrandomwalkmodel(SPRWM)toourdata.Overall,therandomwalkstructureleadtosomeimprovementintheperformanceoftheestimates.However,forthemodelwith5knots,theperformanceoftheestimatesremainednearlythesame.Thismaybebecause5knotsissufcienttocapturetheunderlyingpatternintheincometrajectoryandtherandomwalkcomponentdoesnotleadtoanyfurtherimprovement.Lastbutnottheleast,therandomwalkmodelestimates,althoughgenerallybetterthanthoseofthebasicsemiparametricmodel,stillcannotclaimtobesuperiortotheSAIPEestimatesforallthecomparisonmeasures.Table 3-1 reportstheposteriormean,medianand95%CIfortheparametersoftheSPRWMwith5knots. Itisofinterestthatthe95%CIfor1,4and5doesnotcontain0indicatingthesignicanceoftherst,fourthandfthknots.ThisisindicativeoftherelevanceofknotsinthepenalizedsplinetontheCPSmedianincomeobservations.ThesameistrueforthecoefcientsofSPM. 73 PAGE 74 ParameterestimatesofSPRWMwith5knots ParameterMeanMedian95%CI 3.1.1 ,theSAIPEstatemodelsusethecensusestimatesofmedianincome(for1999)asoneofthepredictorwhichessentiallygivesthemabigedgeoverus.Thismaybeoneofthereasonswhytheestimatesobtainedfromthesemiparametricmodelsareatmostcomparable,butnotsuperiortotheSAIPEestimates.Butthatdoesn'truleoutthefactthatthesemiparametricmodelshaveroomforimprovement.Inthissection,wewilllookforanypossibledecienciesintheourmodelsandwilltrytocomeupwithsomeimprovements,ifthereisany. AsmentionedinSection 3.4.1 ,selectionandproperpositioningofknotsplaysapivotalroleincapturingthetrueunderlyingpatterninasetofobservations.Poorlyplacedknotsdoeslittleinthisregardandcanevenleadtoanerroneousorbiasedestimateoftheunderlyingtrajectory.Ideally,asufcientnumberofknotsshouldbeselectedandplaceduniformlythroughouttherangeoftheindependentvariabletoaccuratelycapturetheunderlyingobservationalpattern. Figures 3-3A and 3-3B showstheexactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Inboththecases,theknotsareplacedonagridofequallyspacedsamplequantilesofIRSmeanincome.Inboththegures,theknotslieontheleftofIRSmean=50000,theregionwherethedensityofobservationsishigh.Theknotstendtolieinthisregionbecausetheyareselectedbasedonquantileswhichisadensity-dependentmeasure.Thus,inboththegures,thecoverageareaofknots(i.ethepartoftheobservationalpatternwhichiscapturedbytheknots)isthe 74 PAGE 75 BPositioningof7Knots Exactpositionsof5and7knotsintheplotofCPSmedianincomeagainstIRSmeanincome.Theknotsaredepictedastheboldfacedtrianglesatthebottom. regiontotheleftofthedottedverticallines.Ontheotherhand,thenon-linearpatternistangibleonlyinthelowdensityareaoftheploti.etheregionlyingtotherightofIRSmean=50000.Evidently,noneoftheknotslieinthispartofthegraph.Thus,wecanpresumethatinboththecases(5and7knots),theunderlyingnon-linearobservationalpatternisnotbeingadequatelycaptured. Asanaturalsolutiontothisissue,wedecidedtoplacehalfoftheknotsinthelowdensityregionofthegraphwhiletheotherhalfinthehighdensityregion.Theexactboundarylinebetweenthehighdensityandlowdensityregionsishardtodetermine.WetesteddifferentalternativesandcameupwithIRSmean=47000asatentativeboundarybecauseitgavethebestresults.Inboththeregions,weplacedtheknotsatequallyspacedsamplequantilesoftheindependentvariable.Figure 3-4 showsthenewknotpositionsfor5knots. ItisclearfromFigure 3-4 thatthenewknotsaremoredispersedthroughouttherangeofIRSmeanthantheoldones.Theregionbetweentheboldanddashedverticallinesdenotestheadditionalcoveragethathasbeenachievedwiththeknot 75 PAGE 76 Positionsof5knotsafterrealignment.Theknotsaretheboldfacedtrianglesatthebottom.Theregionbetweenthedashedandboldlinesistheadditionalcoverageareagainedfromtherealignment. rearrangement.Basedonthenumberofdatapointsinsidethisregion,itisclearthatamuchlargerproportionofobservationshasbeencapturedwiththeknotrealignment.Noknotsareintheregionbeyondtheboldverticallines(i.ebeyondIRSmean56000)possiblyduetotheverylowdensityoftheobservationsinthatarea.Overall,itseemsthat,thenewknotscancapturesomeoftheunderlyingnon-linearpatterninthedatasetwhichtheoldknotsfailedtoachieve.Wealsoexperimentedbyplacingalltheknotsinthelowdensityregion(beyondIRSmean=47000)buttheresultswerenotsatisfactory.Thisindicatesthattheknotsshouldbeuniformlyplacedthroughouttherangeoftheindependentvariabletogetanoptimalt. Wehaveworkedwith5knotsbecauseitperformedconsistentlywellforboththeSPMandSPRWmodels.Onttingthesemiparametricmodelswiththenewknotalignment,wedidachievesomeimprovementintheresults.Table 3-2 reports 76 PAGE 77 3-3 depictsthepercentageimprovementofthesemiparametricestimatesovertheCPSandSAIPEestimates.Here,SPM(5)andSPRWM(5)respectivelydenotethesemiparametricmodelswiththerealigned5knots. Table3-2. ComparisonmeasuresforSPM(5)andSPRWM(5)estimateswithknotrealignment EstimateARBASRBAABASD CPS0.04150.00271,753.335,300,023SAIPE0.03260.00151,423.753,134,906SPM(5)0.0280.00121173.712,334,379SPRWM(5)0.02950.00131256.082,747,010 Table3-3. PercentageimprovementsofSPM(5)andSPRWM(5)estimatesoverSAIPEandCPSestimates EstimateModelARBASRBAABASD SPM(5)14.11%20.00%17.56%25.54%SAIPESPRWM(5)9.51%13.33%11.78%12.37%SPM(5)32.53%55.55%33.06%55.96%CPSSPRWM(5)28.92%51.85%28.36%48.17% Itisclearthat,withtheknotrealignment,thecomparisonmeasurescorrespondingtothesemiparametricestimateshavedecreasedsubstantially,speciallysofortheSPM.ThenewcomparisonmeasuresforthesemiparametricmodelsarequitelowerthanthosecorrespondingtotheSAIPEestimates.Thus,wemaysaythatthesemiparametricmodelestimatesperformsbetterthantheSAIPEestimateswiththerealignedknots.Thisimprovementisapparentlyduetotheadditionalcoverageoftheobservationalpatternthatisbeingachievedwiththerelocationoftheknots.Asaresultofthisincreasedcoverage,alargerproportionoftheunderlyingnonlinearpatternintheobservationsinbeingcapturedbythenewknots.Althoughwehavedonethisexercisewithonly5knots,itwouldbeinterestingtoexperimentwithothertypesofknotalignment 77 PAGE 78 3-4 andTable 3-5 reporttheposteriormean,medianand95%CIfortheparametersinSPM(5)andSPRWM(5)respectively. Table3-4. ParameterestimatesofSPM(5) Table3-5. ParameterestimatesofSPRWM(5) Itisofinteresttonotethat,withtheknotrealignment,alltheknotcoefcients(i.ethe's)aresignicantforbothSPMandSPRWM.Fortheoldconguration,someoftheknotcoefcientswerenotsignicantforthemodels.Thiscorroboratesthefactthat,withtheknotrealignment,alltheveknotsaresignicantlycontributingtothecurvettingprocessintermsofcapturingthetrueunderlyingnon-linearpatternintheobservations. Ghoshetal. ( 1996 ),henceforthreferredtoastheGNKmodel.Theirunivariatemodelisasfollows where(bjjbj1)N(0,2b),uijN(0,2j)andeijN(0,2ij). 78 PAGE 79 wherebii.i.dN(0,2b)whileuijandeijhavethesamedistributionasabove.Clearly,theonlydifferencebetween( 3 )and( 3 )isthattheformercontainsatimespecicrandomcomponentwhilethelattercontainsaareaspecicrandomcomponent. Ghoshetal. ( 1996 )showedthattheestimatesfromthebivariateversionoftheGNKmodel( 3 )performsmuchbetterthanthecensusbureauestimatesinestimatingthemedianhouseholdincomeof4-personfamiliesintheUnitedStates.Table 3-6 depictsthecomparisonmeasurescorrespondingtotheabovemodels. Table3-6. Comparisonmeasuresfortimeseriesandothermodelestimates EstimateARBASRBAABASD CPS0.04150.00271,753.335,300,023SAIPE0.03260.00151,423.753,134,906GNK0.03970.00251709.585,229,869SPM(0)0.03370.00171408.73,137,978SPM(5)0.0280.00121173.712,334,379SPRWM(5)0.02950.00131256.082,747,010 Itisclearthat,althoughtheestimatesfromtheGNKmodelperformslightlybetterthantheCPS,thosearequiteinferiortothesemiparametricandSAIPEestimates.Thismaybebecausethestatespecicrandomeffectsinthesemiparametricmodelscanaccountforthewithin-statecorrelationsintheincomevalues,somethingwhichtheGNKmodelfailstodo.SincethecomparisonmeasuresforSPM(0)aremuchlowerthanthosefortheGNKmodel,wecanalsoconcludethattheareaspecicrandomeffectismuchmorecriticalthanatimespecicrandomcomponentinthissituation. 79 PAGE 80 Johnson ( 2004 ).ThisisessentiallyanextensionoftheclassicalChi-squaregoodness-of-ttestwherethestatisticiscalculatedateveryiterationoftheGibbssamplerasafunctionoftheparametervaluesdrawnfromtherespectiveposteriordistribution.Thus,aposteriordistributionofthestatisticisobtainedwhichcanbeusedforconstructingglobalgoodness-of-tdiagnostics. Toconstructthisstatistic,weform10equallyspacedbins((k1)=10,k=10),k=1,...,10,withxedbinprobabilities,pk=1=10.Themainideaistoconsiderthebincountsmk(~)toberandomwhere~denotesaposteriorsampleoftheparameters.AteachiterationoftheGibbssampler,binallocationismadebasedontheconditionaldistributionofeachobservationgiventhegeneratedparametervaluesi.eYijwouldbeallocatedtothekthbinifF(Yijj~)2((k1)=10,k=10),k=1,...,10.TheBayesianchi-squarestatisticisthencalculatedasRB(~)=10Xk=1"mk(~)npk Theonlyassumptionsforthisstatistictoworkarethattheobservationsshouldbeconditionallyindependentandtheparametervectorshouldbenitedimensional.The 80 PAGE 81 BSemiparametricRWModel Quantile-quantileplotofRBvaluesfor10000drawsfromtheposteriordistributionofthebasicsemiparametricandsemiparametricrandomwalkmodels.TheX-axisdepictstheexpectedorderstatisticsfroma2distributionwith9degreesoffreedom. secondassumptionnaturallyholdsinourcase.Regardingtherstone,sincewehavemultipleobservationsovertimeforeverystate,theremaybewithin-statedependencebetweenthose.Thus,insteadoftakingalltheobservations(i.etheCPSmedianincomevalues),wedecidedtousethelastobservationforeachstate.Forthebasicsemiparametricmodel(SPM),theabovesummarymeasureswererespectively0.049and0.5whilefortherandomwalkmodel(SPRWM),thesewere0.047and0.51.ThesemeasuressuggestthatbothSPMandSPRWMtsthedataquitewell.Figure 3-5A and 3-5B showsthequantile-quantileplotsofRBvaluesobtainedfrom10000samplesofSPMandSPRWMwith5knots.BoththeplotsdemonstrateexcellentagreementbetweenthedistributionofRBandthatofa2(9)randomvariable. JohnsonpointsoutthattheBayesianchi-squareteststatisticisalsoanusefultoolforcodeverication.IftheposteriordistributionofRBdeviatessignicantlyfromitsnulldistribution,itmayimplythatthemodelisincorrectlyspeciedortherearecodingerrors.Sincethesummarymeasuresarequiteclosetothecorrespondingnullvalues, 81 PAGE 82 FayandHerriot 1979 ).Inthisstudy,wehaveproposedasemiparametricclassofmodelswhichexploitthelongitudinaltrendinthestate-specicincomeobservations.Indoingso,wehavemodeledtheCPSmedianincomeobservationsasanincometrajectoryusingpenalizedsplines( EilersandMarx 1996 ).Wehavealsoextendedthebasicsemiparametricmodelbyaddingatimeseriesrandomwalkcomponentwhichcanexplainanyspecictrendintheincomelevelsovertime.Wehaveusedasourcovariate,themeanadjustedgrossincome(AGI)obtainedfromIRStaxreturnsforallthestates.AnalysishasbeencarriedoutinahierarchicalBayesianframework.OurtargetofinferencehasbeenthemedianhouseholdincomesforallthestatesoftheU.S.andtheDistrictofColumbiafortheyear1999.Wehaveevaluatedourestimatesbycomparingthosewiththecorrespondingcensusestimatesof1999usingsomecommonlyusedcomparisonmeasures. Ouranalysishasshownthatinformationofpastmedianincomelevelsofdifferentstatesdoprovidestrengthtowardstheestimationofstatespecicmedianincomesforthecurrentperiod.Infact,ifthereisanunderlyingnon-linearpatterninthemedianincomelevels,itmaybeworthwhiletocapturethatpatternasaccuratelyaspossibleandusethatintheinferentialprocedure.Intermsofmodelingtheunderlyingobservationalpattern,thepositioningofknotsprovedtobebothimportantandinteresting.The 82 PAGE 83 Theabovemodelscanbeextendedinvariouswaysbasedonthenatureoftheobservationalpatternandthequality(orrichness)ofthedataset.Someobviousextensionsaregivenasfollows:(1)Inthemodelsconsideredabove,thesplinestructuref(xij)representsthepopulationmeanincometrajectoryforallthestatescombined.Thedeviationoftheithstatefromthemeanismodeledthroughtherandominterceptbi.Thisimpliesthatthestate-specictrajectoriesareparallel.Amoreexible 83 PAGE 84 Heregi(x)isanunspeciednonparametricfunctionrepresentingthedeviationoftheithstate-specictrajectoryfromthepopulationmeantrajectoryf(x).gi(x)isalsomodeledusingP-splinewithalinearpart,bi1+bi2xandanon-linearone,PKk=1wik(xk)+thusallowingformoreexibility.Boththesecomponentsarerandomwith(bi1,bi2)0N(0,)(beingunstructuredordiagonal)andwikN(0,2w).Thisextensionisparticularlyrelevantinsituationswherethestate-specicincometrajectoriesarequitedistinctfromthepopulationmeancurveandthusneedtobemodeledexplicitly.Weplantopursuethisextensionifwecanprocurearicherdatasetwithlongerstatespecicincometrajectories.(2)Sometimesthefunctiontobeestimated(herethemedianincomepattern)mayhavevaryingdegreesofsmoothnessindifferentregions.Inthatcase,asinglesmoothingparametermaynotbeproperandaspatiallyadaptivesmoothingprocedurecanbeused( RuppertandCarroll 2000 ).(3)WeusedthetruncatedpolynomialbasisfunctiontomodeltheincometrajectorybutothertypesofbaseslikeB-splines,radialbasisfunctionsetccanalsobeused.(4)Althoughweusedaparametricnormaldistributionalassumptionfortherandomstateandtimespeciceffects,abroaderclassofdistributionslikethemixturesofDirichletprocesses( MacEachernandMuller 1998 )orPolyatrees( HansonandJohnson 2000 )maybetested. Lastbutnottheleast,wethinkthatsemiparametricmodelingapproachholdsalotofpromiseforsmalldomainproblemsspeciallywhenobservationsforeachdomainarecollectedovertime.TheassociatedclassofsemiparametricmodelscanwellbeanattractivealternativetothemodelsgenerallyemployedbytheU.S.CensusBureau. 84 PAGE 85 TheU.S..CensusBureauhasalwaysbeenconcernedwiththeestimationofincomeandpovertycharacteristicsofsmallareasacrosstheUnitedStates.Theseestimatesplayavitalroletowardstheadministrationoffederalprogramsandtheallocationoffederalfundstolocaljurisdictions.Forexample,statelevelestimatesofmedianincomeforfour-personfamiliesareneededbytheU.S.DepartmentofHealthandHumanServices(HHS)inordertoformulateitsenergyassistanceprogramtolowincomefamilies.Sinceincomecharacteristicsforsmallareasaregenerallycollectedovertime,theremaywellbeatimevaryingpatterninthoseobservations.Neglectingthosepatternsmayleadtobiasedestimateswhichdoesnotreectthetruepicture.Inthisstudy,weputforwardamultivariateBayesiansemiparametricprocedurefortheestimationofmedianincomeoffour-personfamiliesforthedifferentstatesoftheU.S.whileexplicitlyaccommodatingforthetimevaryingpatternintheobservations. 85 PAGE 86 Inestimatingthemedianincomeoffour-personfamilies,theU.S.CensusBureaureliedondatafromthreesources.ThebasicsourcewastheannualdemographicsupplementtotheMarchsampleoftheCurrentPopulationSurvey(CPS)whichusedtoprovidethestatespecicmedianincomeestimatesfordifferentfamilysizes.Thesecondsourcewasthedecennialcensusestimatesfortheyearpreceedingthecensusyeari.e1969,1979,1989andsoon.Lastly,theCensusBureaualsousedtheannualestimatesofpercapitaincome(PCI)providedbytheBureauofEconomicAnalysis(BEA)oftheU.S.DepartmentofCommerce.Eachoftheabovedatasources(andtheresultingestimates)havesomedisadvantageswhichneccesiatedanestimationprocedurethatusedacombinationofallthreetoproducethenalmedianincomeestimates.TheCPSestimateswerebasedonsmallsampleswhichresultedinsubstantialvariability.Ontheotherhand,decennialcensusestimates,althoughhavingnegligiblestandarderrors,wereonlyavailableevery10years.Duetothislaginthereleaseofsuccessivecensusestimates,therewasasignicantlossofinformationconcerninguctuationsintheeconomicsituationofthecountryingeneralandsmallareasinparticular.Lastly,thepercapitaincomeestimatesdidnothaveassociatedsamplingerrorssincetheywerenotobtainedusingtheusualsamplingtechniques.Thedetailsoftheestimationprocedureappearsin Fayetal. ( 1993 ). TheCensusBureaubasedtheirestimationprocedureonabivariateregressionmodelsuggestedby Fay ( 1987 ).Indoingso,theyusedmedianincomeobservationsforthreeandvepersonfamiliesinadditiontothoseoffourpersonfamilies.ThebasicdatasetforeachstatewasabivariaterandomvectorwithonecomponenttheCPSmedianincomeestimatesoffourpersonfamiliesandtheothercomponentbeingtheweightedaverageofCPSmedianincomesofthreeandvepersonfamilies,withweights0.75and0.25respectively.Boththeregressionequationsusedthebaseyear 86 PAGE 87 Adjustedcensusmedian(c)=PCI(c) PCI(b)censusmedian(b) HerePCI(c)andPCI(b)denotesthepercapitaincomeestimatesproducedbytheBEAforthecurrentandbaseyearsrespectively.Thus,intheaboveexpression,thecurrentyearadjustedcensusmedianestimateisobtainedbyadjustingthebaseyearcensusmedianbytheproportionalgrowthinthePCIbetweenthebaseyearandthecurrentyear.Intheregressionequation,thebaseyearcensusmedianadjustsforanypossibleoverstatementoftheeffectofchangeinthePCIinestimatingthecurrentmedianincomes.Finally,theCensusBureauusedanempiricalBayesian(EB)technique( Fay ( 1987 ); Fayetal. ( 1993 ))tocalculatetheweightedaverageofthecurrentCPSmedianincomeestimateandtheestimatesobtainedfromtheregressionequation. Dattaetal. ( 1993 )extendedandrenedtheideasof Fay ( 1987 )andproposedamoreappealingempiricalBayesianprocedure.TheyalsoperformedanunivariateandmultivariatehierarchicalBayesiananalysisofthesameproblemandshowedthatboththeEBandHBproceduresresultedinsignicantimprovementovertheCPSmedianincomeestimatesfortheunivariateandmultivariatemodels.However,themultivariatemodelresultedinconsiderablylowerstandarderrorandcoefcientofvariationthantheunivariatemodelalthoughthepointestimatesweresimilar.Later, Ghoshetal. ( 1996 )(henceforthreferredtoasGNK)presentedaBayesiantimeseriesanalysisofthesameproblembyexploitingtheinherentrepetitivenatureoftheCPSmedianincomeestimates.Indoingso,theyestimatedthestatewidemedianincome 87 PAGE 88 Semiparametricregressionmethodshavenotbeenusedinsmallareaestimationcontextsuntilrecently.Thiswasmainlyduetomethodologicaldifcultiesincombiningthedifferentsmoothingtechniqueswiththeestimationtoolsgenerallyusedinsmallareaestimation.Thepioneeringcontributioninthisregardistheworkby Opsomeretal. ( 2008 )inwhichtheycombinedsmallarearandomeffectswithasmooth,non-parametricallyspeciedtrendusingpenalizedsplines( EilersandMarx 1996 ).Indoingso,theyexpressedthenon-parametricsmallareaestimationproblemasamixedeffectsregressionmodelandanalyzeditusingrestrictedmaximumlikelihood.Theyalsopresentedtheoreticalresultsonthepredictionmeansquarederrorandlikelihoodratiotestsforrandomeffects.Inferencewasbasedonasimplenon-parametricbootstrapapproach.Theyappliedtheirmodeltoanon-longitudinal,spatialdatasetconcerningtheestimationofmeanacidneutralizingcapacity(ANC)oflakesinthenortheasternstatesofU.S. 88 PAGE 89 Ghoshetal. ( 1996 ),wehavetreatedthestatespecicmedianincomeobservationsaslongitudinalprolesorincometrajectories.Aswithanylongitudinallyvaryingobservations,theincomeproles(bothstate-specicandoverall)mayhaveanon-linearpatternovertime.Moreover,thesuccessiveincomeobservationsmaybeunbalancedinnature.Thesefeaturesmotivatedustouseasemiparametricregressionapproachinourmodelingframework.Indoingso,wehavemodeledtheincometrajectoryusingpenalizedspline(orP-spline)whichisacommonlyusedbutpowerfulfunctionestimationtoolinnon-parametricinference.TheP-splineisexpressedusingtruncatedpolynomialbasisfunctionswithvaryingdegreesandnumberofknotsalthoughothertypesofbasisfunctionslikeB-splinesorthinplatesplinescanalsobeused.Ascovariates,wehaveusedtheadjustedcensusmedianincomessinceitwasfoundtobethemosteffectivecovariateby Ghoshetal. ( 1996 ).Wetestedfourdifferentregressionmodelsviz(1)AunivariatemodelwithonlytheCPSmedianincomeoffour-personfamilyastheresponsevariable;(2)AbivariatemodelwiththeCPSmedianincomesofthreeandfourpersonfamiliesastheresponsevariables;(3)AbivariatemodelwiththeCPSmedianincomesoffourandvepersonfamiliesastheresponsevariables;andlastly(4)AbivariatemodelwiththeCPSmedianincomesoffourpersonfamilyandweightedaverageoftheCPSmedianincomesofthreeandvepersonfamilies(withweights0.75and0.25)astheresponsevariables.Inallthecases,ourprimaryobjectivehasbeentheestimationofmedianincomesoffour-personfamiliesofallthe50U.S.statesandtheDistrictofColumbiafor1989.Foreachofthesemodels,analysishasbeencarriedoutusingahierarchicalBayesianapproach.Sincewechosenon-informativeimproperpriorsfortheregressionparameters,proprietyoftheposteriorhasbeenrigorouslyprovedbeforeproceedingwiththecomputations(seeTheorem3in 89 PAGE 90 GelfandandSmith 1990 )hasbeenusedtoobtaintheparameterestimates. Wehavecomparedthestate-specicestimatesofmedianhouseholdincomefor1989withthecorrespondingdecennialcensusvaluesinordertotestfortheiraccuracy.Indoingso,weobservedthatthesemiparametricmodelestimatesimproveuponboththeCPSandtheCensusBureauestimates.Interestingly,foralltheabovemodels,thesemiparametricestimatesaregenerallysuperiororatleastcomparabletothecorrespondingestimatesfromthetimeseriesmodelsof Ghoshetal. ( 1996 ).Thisisatestamenttotheexibilityandstrengthofthesemiparametricmethodologyspeciallywhenobservationsarecollectedovertime.ItalsoindicatesthatitmaybeworthwhiletotakeintoaccountthelongitudinalincomepatternsinestimatingthecurrentincomeconditionsoftheU.S.states.Lastly,thesemiparametricmodelingframeworkisverygeneralandcanbeappliedtoanysituationwherevariouscharacteristicsofsmallareasarecollectedovertime. Therestofthechapterisorganizedasfollows.InSection 4.2 weintroducethebivariatesemiparametricmodelingframework.Section 4.3 goesoverthehierarchicalBayesiananalysisweperformed.InSection 4.4 ,wedescribetheresultsofthedataanalysiswithregardtothemedianhouseholdincomedataset.Finally,weendwithadiscussionandsomereferencestowardsfutureworkinSection 4.5 .Theappendixcontainstheproofsoftheposteriorproprietyandtheexpressionsofthefullconditionaldistributionsforourmodels. 4.2.1Notation 90 PAGE 91 3 .Here,wewillexplainthebivariateframeworkwhichisoftwotypesvizasimplebivariatemodelandabivariaterandomwalkmodel.ThesecanalsobeseenasextensionsoftheunivariatemodelsexplainedinSection 3.2.2 Thisisthemostgeneralstructuresincethedegreesofthesplineaswellasthenumberandpositionoftheknotsaredifferentforthetwomodels.Iffori=1,2,...,m;j=1,2,...,t,fYij1,Xij1gandfYij2,Xij2ghavesimilarrelationship,wecanassumep=qandk1=k2,k=1,2,...,K1(=K2). Equation( 4 )canberewrittenas 91 PAGE 92 4 )asfollows whereij=U0ij+Z0ij+bi+vj+uij. AsinSection 3.2.2.2 ,weassumethat(vjjvj1,v)N(vj1,v)withv0=0.Alternatively,wemaywritevj=vj1+wjwherewji.i.dN(0,v). 92 PAGE 93 3 Here,L(Xj,)denotesamultivariatenormaldensitywithmeanvectorandvariancecovariancematrix. Forthebivariaterandomwalkmodel,theparameterspacefortheithstatewouldbei=(i,,,bi,v,f1,...,tg,0,,v)wherev=(v01,...,v0t)0isthevectoroftimespecicrandomeffects.ThehierarchicalBayesianframeworkisgivenby 1. PAGE 94 4 )willhaveanextracomponentcorrespondingtovgivenbyL(vjjvj1,v)whichhasanormaldistributionwithmeanvj1andcovariancematrixv. Thus,wehavethefollowingpriors:uniform(Rp+q+2),jIW(Sj,dj)(j=1,...,t),IW(S,d),0IW(S0,d0)andvIW(Sv,dv)HereXIW(A,b)denotesainverseWishartdistributionwithscalematrixAanddegreesoffreedombhavingtheexpressionf(X)/jXj(b+p+1)=2exp(tr(AX1)=2),pbeingtheorderofA. Fortherandomwalkmodeltherewillbeanadditionalterm(v).Byconditionalindependenceproperties,wecanfactorizethefullposterioras[,,,b,0,,f1,...,tgjY,U,Z]/[Yj][j,,b,f1,...,tg,X,Z][bj0][j][][][0]tYj=1[j] 94 PAGE 95 GelmanandRubin ( 1992 )andrunn(2)parallelchains.Foreachchain,werun2diterationswithstartingpointsdrawnfromanoverdisperseddistribution.Todiminishtheeffectsofthestartingdistributions,therstditerationsofeachchainarediscardedandposteriorsummariesarecalculatedbasedontherestofthediterates.Thefullconditionalsforboththemodelsaregivenintheappendix. Onceposteriorsamplesaregeneratedfromthefullconditionalsoftheparameters,Rao-Blackwellizationyieldsthefollowingposteriormeansandvariancesofij and 4.2.2 .toanalyzethemedianincomedatasetreferredtoinSection 4.1.3 .Thebasicdatasetforourproblemisthetriplet(Yij1,Yij2,Yij3)andtheassociatedvariancecovariancematrixij(i=1,...,51;j=1,...,11).HereYij1,Yij2andYij3respectivelydenotetheCPSmedianincomesof 95 PAGE 96 Fortheunivariatesetup,theresponseandcovariatesarerespectivelyYij1andXij1.Forthebivariatesetup,thebasicdatavectorisadupletwithrstcomponentYij1andsecondcomponentiseitherYij2,Yij3or0.75Yij2+0.25Yij3.Theadjustedcensusmediansarechosenanalogously.Asmentionedbefore,ourtargetofinferencearethestatespecicmedianincomesoffourpersonfamiliesfor1989. Inordertochecktheperformanceofourestimates,weplantousefourcomparisonmeasures.ThesewereoriginallyrecommendedbythepanelonsmallareaestimatesofpopulationandincomesetupbytheCommitteeonNationalStatisticsinJuly1978andisavailableintheirJuly1980report(p.75).Theseare PAGE 97 ThebasicstructureofourmodelswouldremainthesameasinSection 4.2.2 .WehaveusedlineartruncatedpolynomialbasisfunctionsfortheP-splinecomponentinourmodelssincethemedianincomeprolesdidnotexhibitahighdegreeofnon-linearity.Forhighlynon-linearprolesaquadraticorcubicpolynomialbasisfunctionrepresentationcanbeused.Innon-parametricregressionproblems,theproperselectionofknotsplaysacriticalrole.Ideally,asufcientnumberofknotsshouldbeselectedandplaceduniformlythroughouttherangeoftheindependentvariablesothattheunderlyingobservationalpatternisproperlycaptured.Toofewortoomanyknotsgenerallydegradesthequalityofthet.Thisisbecause,iftoofewknotsareused,thecompleteunderlyingpatternmaynotbecapturedproperly,thusresultinginabiasedt.Ontheotherhand,oncethereareenoughknotstotimportantfeaturesofthedata,furtherincreaseintheknotshavelittleeffectonthetandmayleadtooverparametrization( Ruppert 2002 ).Generally,atmost35to40knotsarerecommendedforeffectivelyallsamplesizesandfornearlyallsmoothregressionfunctions.Followingthegeneralconvention,wehaveplacedtheknotsonagridofequallyspacedsamplequantilesoftheindependentvariable(adjustedcensusmedianincome). GelmanandRubin ( 1992 ).Weranthreeparallelchains,withvaryinglengthsandburn-ins.Weinitiallysampledtheij'sfrommultivariatet-distributionswith2dfhavingthesamelocationandscalematricesasthecorrespondingmultivariatenormalconditionalsgivenintheAppendix.ThisisbasedontheGelman-Rubinideaofinitializingthechainatoverdisperseddistributions.However,onceinitialized,the 97 PAGE 98 Wettedboththeunivariateandbivariatemodelstothemedianincomedataset.Indoingso,weworkedwithallpossibleknotchoicesfrom0to40.Here,wewouldonlyshowtheresultscorrespondingtothebestperformingmodeli.ethemodelwiththelowestvaluesofthecomparisonmeasures. Intheunivariateframework,themodelwith3knotsintheincometrajectoryperformedthebest.Table 4-1 reportsthecomparisonmeasuresforthismodel(denotedasUSPM(3))alongwiththoseoftheCPSestimates(CPS),CensusBureauestimates(Bureau),andtheunivariateGNKtimeseries(GNK.TS)andnon-timeseries(GNK.NTS)estimates.Table 4-2 reportsthepercentageimprovementofthetimeseries,non-timeseriesandthesemiparametricestimatesoverthecensusbureauestimates. FromTable 4-1 ,itisclearthatthesemiparametricestimatessignicantlyimproveupontheCPS,timeseriesandnon-timeseriesestimateswithrespecttoallthecomparisonmeasures.Infact,thesemiparametricestimatesperformslightlybetterthanthebivariateCensusBureauestimatestoowithrespecttoARBandAAB.This 98 PAGE 99 Comparisonmeasuresforunivariateestimates EstimateARBASRBAABASD CPS0.07350.00842,928.8213,811,122.39Bureau0.02960.00131,183.902,151,350.18GNK.TS0.03380.00181,351.673,095,736.14GNK.NTS0.03630.00211,457.473,468,496.61USPM(3)0.02890.00141169.742,549,698.26 Table4-2. PercentageimprovementsofunivariateestimatesoverCensusBureauestimates EstimateARBASRBAABASD GNK.TS-14.19%-38.46%-14.17%-43.90%GNK.NTS-22.64%-61.54%-23.11%-61.22%USPM(3)2.37%-7.69%1.2%-18.52% isalsoreectedinTable 4-2 wherethesemiparametricestimatesmarginallyimproveupontheBureauestimatesfortheabovetwocomparisonmeasures.Overall,thedegreeofdominanceoftheBureauestimatesonthetimeseriesandnontimeseriesestimatesismuchlargercomparedtothatonthesemiparametricestimates.Theseresultsindicatethat,intheunivariateframework,thesemiparametricmodelwith3knotsperformsignicantlybetterthanthetimeseriesandnon-timeseriesmodelsof Ghoshetal. ( 1996 ). Now,wemoveontothebivariatenon-randomwalksetup.First,weconsiderthemodelwithresponsevectortheCPSmedianincomeof4and3personfamiliesi.e(Yij1andYij2).Thecovariatesarethecorrespondingadjustedcensusmedians.SinceweassumedinverseWishartpriorsforthevariancecovariancematrices,thevaluesofthecomparisonmeasuresweredependentonthedegreesoffreedomoftheWishartdistributionandthenumberofknotsintheincometrajectory.Weworkedwithdifferentcombinationsofthetwointtingthesemodels.Thebestresults(lowestcomparisonmeasures)wereobtainedfortwomodels,bothwith6knotsbutwithdegreesoffreedoms7and9respectively.ThesemodelsaredenotedbyBSPM(1)(4,3)andBSPM(2)(4,3)respectively.Whenweconsiderthemedianincomesof4and5person 99 PAGE 100 Comparisonmeasuresforbivariatenon-randomwalkestimates EstimateARBASRBAABASD CPS0.07350.00842,928.8213,811,122.39Bureau0.02960.00131,183.902,151,350.18GNK.TS(4,3)0.02950.00131,171.712,194,553.67GNK.NTS(4,3)0.03230.00161,287.782,610,249.94BSPM(1)(4,3)0.02740.00131079.632,182,669.56BSPM(2)(4,3)0.02860.00111131.611,880,089.29GNK.TS(4,5)0.02300.0009932.511,618,025.33GNK.NTS(4,5)0.02950.00131,179.942,216,738.06BSPM(4,5)0.02550.00101033.121,859,373.98GNK.TS(4,3+5)0.02870.00131,150.242,116,692.71GNK.NTS(4,3+5)0.03240.00151,297.122,530,938.06BSPM(1)(4,3+5)0.02710.00121078.52,128,679.65BSPM(2)(4,3+5)0.02890.00121132.101,838,598.30 families,thelowestcomparisonmeasureswereobtainedforthemodelwith4knotsintheincometrajectoryand7degreesoffreedom.WedenotethismodelbyBSPM(4,5). Lastly,forthemodelwithmedianincomesof4personfamiliesandtheweightedaverageincomesof3and5personfamilies(withweights0.75and0.25)asresponsevectors,thebestresultswereobtainedfortwomodels,bothwith6knotsandwithdegreesoffreedoms7and9respectively.WedenotethesemodelsasBSPM(1)(4,3+5)andBSPM(2)(4,3+5)respectively.Table 4-3 reportsthecomparisonmeasuresforthesemodelsalongwiththoseofCPS,Bureau,andthecorrespondingbivariateGNKtimeseriesandnon-timeseriesestimates.Table 4-4 reportsthepercentageimprovementoftheaboveestimatesoverthecensusbureauestimates. FromTable 4-3 andTable 4-4 ,itisclearthatbothBSPM(4,3)andBSPM(4,3+5)estimatesimproveuponthebivariatetimeseriesandnontimeseriesestimateswithrespecttonearlyallthefourcomparisonmeasures.ThesemiparametricestimatesalsoimprovesupontheCensusBureauestimatesandtherawCPSestimates.Forthemodelwithmedianincomeoffourandvepersonfamiliesasresponse,thesemiparametricestimatesfallswellbehindthebivariatetimeseriesestimatesof Ghoshetal. ( 1996 )butsignicantlyimprovesupontheCPSandCensusBureauestimates. 100 PAGE 101 Percentageimprovementsofbivariatenon-randomwalkestimatesoverCensusBureauestimates EstimateARBASRBAABASD GNK.TS(4,3)-0.48%-2.52%1.03%-2.01%GNK.NTS(4,3)-8.99%-22.45%-8.77%-21.33%BSPM(1)(4,3)7.43%0.00%8.81%-1.46%BSPM(2)(4,3)3.38%15.38%4.42%12.61%GNK.TS(4,5)22.19%30.52%21.23%24.79%GNK.NTS(4,5)0.31%-0.18%0.33%-3.04%BSPM(4,5)13.85%23.08%12.74%13.57%GNK.TS(4,3+5)2.94%3.56%2.84%1.61%GNK.NTS(4,3+5)-9.36%-17.18%-9.56%-17.64%BSPM(1)(4,3+5)8.45%7.69%8.90%1.05%BSPM(2)(4,3+5)2.37%7.69%4.37%14.54% Nowletusconsiderthebivariaterandomwalkmodel.Forthecasewith4and3personfamilies,thelowestcomparisonmeasureswereobtainedforthreemodelswithdegreesoffreedomsandnumberofknots(3,6),(5,6)and(9,1)respectively.WedenotethesemodelsasBRWM(1)(4,3),BRWM(2)(4,3)andBRWM(3)(4,3)respectively.EachofthesemodelssignicantlyimprovesupontheCPSandCensusBureauestimatesandarealsosuperiortothebivariatetimeseriesandnon-timeseriesmodelsproposedby Ghoshetal. ( 1996 )(GNK).Therandomwalkestimatesalsoseemtoimprovemarginallyoverthosecorrespondingtothenon-randomwalksemiparametricmodel.Whenweconsiderthemedianincomeestimatesof4and5personfamilies,therandomwalkmodelwithdegreesoffreedom5and1knotinthetrajectoryseemstoperformthebest.ThecomparisonmeasuresaresignicantlybetterthantheCPS,Bureauandthenon-timeseriesmodelofGNK.However,theyfallmarginallyshortofthetimeseriesestimatesbutfarebetterthanthecorrespondingestimatesobtainedfromthenon-randomwalkmodel(BSPM(4,5)).WedenotethismodelasBRWM(4,5).Lastly,forthemodelwithmedianincomesof4personfamiliesandtheweightedaverageincomesof3and5personfamilies(withweights0.75and0.25)asresponsevectors,thebestresultswereobtainedforthemodelwith5degreesoffreedomand1knotinthetrajectory.ThecomparisonmeasuresweresignicantlybetterthantheCPS, 101 PAGE 102 Comparisonmeasuresforbivariaterandomwalkmodel EstimateARBASRBAABASD BRWM(1)(4,3)0.02610.00111043.331,902,416.1BRWM(2)(4,3)0.02740.00101094.251,804,969.06BRWM(3)(4,3)0.02580.00121037.032,114,599.65BRWM(4,5)0.02450.0010978.121,672,183.6BRWM(4,3+5)0.02440.0011990.501,941,833.29 BureauandGNK(bothtimeseriesandnon-timeseries)whileitalsoimproveduponthenon-randomwalksemiparametricmodel.WedenotethismodelasBRWM(4,3+5).Table 4-5 reportsthecomparisonmeasuresfortherandomwalkmodels. EstimationofmedianincomesoffourpersonfamiliesfordifferentstatesofU.S.(hereplayingtheroleofsmallareas)isofinteresttotheU.S.BureauoftheCensus.Towardsthisend,theBureauofCensuscollectedannualmedianincomeestimatesof3,4and5personfamiliesforallthestatesandtheDistrictofColumbiaforeveryyear.ButthemethodologyusedbytheCensusBureaudoesnottakeintoaccountthelongitudinalnatureofthestate-specicmedianincomeobservations. 102 PAGE 103 Ghoshetal. ( 1996 ).Wealsoextendedthebasicsemiparametricframeworkbyincorporatingatimeseries(randomwalk)componenttoaccountforthewithinstatedependenceinthesuccessiveincomeobservations.Theclassofrandomwalkmodelsseemedtoimproveupontheirnon-randomwalkcounterpartsbutmorestudiesarerequiredtobedonebeforereachingadeniteconclusionabouttheirrelativeperformance.Overall,westronglythinkthatsemiparametricproceduresholdsalotofpromiseforsmallareaestimationproblems,specicallyinsituationswheremultipletimevaryingobservationsofsomecharacteristicareavailableforthesmallareas. 103 PAGE 104 Inmydissertation,Ihaveconcentratedontheapplicationofsemiparametricmethodologiesinanalyzingunorthodoxdatascenariosoriginatingindiverseeldslikecasecontrolstudiesandsmallareaestimation.Intheformerscenario,Ihaveusedpenalizedsplinestomodellongitudinalexposureprolesanditsinuencepatternonthecurrentdiseasestatusforagroupofcasesandcontrols.Indoingso,Ihavecometotheconclusionthatpastexposureobservationsmayhavesignicanteffectonthepresentdiseasestatus.Ourmodelingframeworkisquitegeneralandexibleinthesensethatitcanbeusedtomodelanypossiblepatternsofexposureprolesandalsoitcancapturecomplextimevaryingpatternsofinuenceoftheexposurehistoryonthecurrentdiseasestatus.WeappliedourmodelingframeworkonanestedcasecontrolstudyofprostatecancerwheretheexposurewastheProstateSpecicAntigen(PSA).Inthesecondscenario,wehaveusedsemiparametricprocedurestomodeltheincometrajectoriesofdifferentsmallareasandhaveusedthatinformationtoestimatethemedianincomesofthosesmallareasatagiventimepointinthefuture.OurmodelbasedestimatesseemedtoperformbetterthantheusualBureauofCensusestimateswhicharebasedontheincomeobservationsfromaparticulartimepointandhencearenon-longitudinalinnature.Wehavealsoextendedthesemiparametricmodelingframeworktothebivariatescenarioinestimatingthemedianincomeofvaryingfamilysizesforeachsmallarea.Inboththesecases,thesemiparametricincomeestimatesnotonlyimprovesonthecensusestimatesbutarealsocomparabletoestimatesbasedontimeseriesmodels.Thus,wecanconcludethatsemiparametricmethodology,ifproperlyapplied,holdsalotofpromiseforcomplicateddata-drivensituationsarisingindiversestatisticalsettingsliketheoncementionedabove. Theexibilityandpowerofthenonparametricandsemiparametricproceduresimmediatelyimpliesthatamultitudeofinterestingbutusefulextensionscanbecarried 104 PAGE 105 1.4 ,selectionandproperpositioningofknotsisavitalaspectinanysmoothingprocedureinvolvingsplines.Traditionally,knotsareplacedatequallyspacedsamplequantilesoftheindependentvariablesandthat'swhatwehavedoneinboththecasecontrolandsmallareascenarios.Butthisprocedurehasitsfairshareofdrawbacks-itwasevidentintheunivariatesmallareaproblemwheretheoriginalplacementoftheknotsfailedtoaccountforthelowdensityregionofthedatapatternwherethenon-linearitywasmostlyconcentrated.Thiswasprobablybecauseofthequantiledependentplacementprocedureoftheknots. Recently,therehasbeensomeresearchondata-drivenoradaptiveknotplacementproceduresinwhichthenumberandlocationsoftheknotsarecontrolledbythedataitselfratherthanbeingpre-specied.Theadvantageofthisprocedureisthatfewernumberofknotswouldberequiredwhichwouldbeplacedinoptimallocationsalongthedomain.Thus,theresultingsplinetwillbeexibleenoughtocaptureanyunderlyingheterogeneityinthedatapattern.BothFrequentistandBayesianapproacheshavebeenproposedtowardsthisend.SomeFrequentistcontributionsinclude Friedman ( 1991 )and Stoneetal. ( 1997 )whousedforwardandbackwardknotselectionschemesuntilthebestmodelisidentied. ZhouandShen ( 2001 )usedanalternativealgorithmwhichledtotheadditionofknotsatlocationswhichalreadypossessedsomeknots.Bayesiantreatmentofthisproblemsrevolvesonthenotionoftreatingtheknotnumberandknotlocationsasfreeparameters.SomenotableBayesiancontributionsinclude 105 PAGE 106 ( 1998 )whoplacedpriorsonthenumberandlocationsoftheknots.Thentheysampledfromthefullposteriorsoftheparameters(includingknotlocationsandnumbers)usingreversiblejumpMCMCmethods( Green 1995 ).However,theyrestrictedtheknotstobelocatedonlyatthedesignpointsoftheindependentvariable. DiMatteoetal. ( 2001 )followedthesamebasicprocedureas Denisonetal. ( 1998 )buttheydidnotrestricttheknotstobelocatedonlyatthedesignpointsoftheexperiment.Theyalsopenalizedmodelswithunnecessarilylargenumberofknots. BottsandDaniels ( 2008 )proposedaexibleapproachforttingmultiplecurvestosparsefunctionaldata.Indoingso,theytreatedthenumbersandlocationsofknotsofthepopulationaveragedandsubjectspeciccurvesasdistinctrandomvariablesandsampledfromtheirposteriordistributionsusingreversiblejumpMCMCmethods.Theyusedfree-knotb-splinestomodelthepopulationaveragedandsubjectspeciccurves.Inalltheabovecontributions,Poissonpriorsareplacedontheknotnumberswhileatpriorsareplacedontheknotpositions.TheusefulnessandexibilityoftheBayesianapproachliesinthefactthatthenumberandlocationsofknotsareautomaticallydeterminedfromtheMCMCscheme.Thus,thismethodologyisoftenknownasBayesianAdaptiveRegressionSplines.However,thesamplingprocedureisquiteintensivesincetheparameterdimensionvariesateveryiteration.BottsandDanielssubstantiallyreducedthecomputationalburdenbydealingwiththeapproximateposteriordistributionofonlythenumberandpositionsoftheknotsbyintegratingouttheotherparametersbyusingLaplacetransformations. Animmediatebutworthwhileextensiontowhatwehavealreadydonewouldbetoincorporateanadaptiveknotselectionschemeintoboththecasecontrolandsmallareamodelingframeworks.Fortheformersetup,thiswouldcorrespondtodecipheringtheoptimalnumberofknotsforthepopulationmeanPSAtrajectoryandtheinuencefunction.So,dependingontheparticularstudyorthedatasetathand,anyunderlyingpatternintheinuenceprole(oftheexposuretrajectoryonthediseasestate)canbe 106 PAGE 107 Someotherinterestingextensionstoourworkcanbe 1. Incorporatinginformative(non-ignorable)missingness( LittleandRubin 1987 )inthelongitudinalexposure(casecontrol)orincome(smallarea)proles. 2. Incorporatingnon-parametricdistributionalstructureslikemixturesofDirichletprocesses( MacEachernandMuller 1998 ),Polyatrees( HansonandJohnson 2000 )onthesubject(orarea)specicrandomeffects. 3. Extendingthesemi-parametriccasecontrolmodelingframeworktosituationsinvolvingmultiple(>2)orevencategoricaldiseasestates. Now,Ibrieyexplainsomeworkthatwearecurrentlyengagedindoing. 5.2.1IntroductionandBriefLiteratureReview LittleandRubin 1987 ).Broadlytheseareofthreetypesviz: 1. 2. 3. 107 PAGE 108 LittleandRubin ( 1987 ).Theseapproachesdifferinthewaytheyfactorthejointdistributionofthemissingdataandtheresponse.Intheformerapproach,thepopulationisrststratiedbythepatternofdropoutresultinginamodelforthewholepopulationthatisamixtureoverthepatterns.Ontheotherhand,theselectionmodellingapproachrstmodelsthehypotheticalcompletedataandthenamodelforthemissingdataprocess(conditionalonthehypotheticalcompletedata)isappendedtothecompletedatamodel.InthisstudywewillfocusonthePatternmixture(PM)modelingapproach. SupposeourstudyconsistsofNsubjects,eachofwhomcanbemeasuredatTtimepoints.LetYiandtheDirespectivelydenotetheresponsevectoranddropouttimefortheithsubject.DiissuchthatDi=8><>:tiftheithsubjectdropsoutbetweenthe(t1)thandtthobservationtimes.T+1iftheithsubjectisacompleter. So,fortheithsubject,yiandDiareassumedtobeassociatedordependent.Thus,inthisapproachmodelsarebuiltfor[YijDi]butinferencesarebasedonf(y)=XDf(yjD)P(D). AnimportantbutrealisticsituationthatmayariseinlongitudinalstudiesisthatthenumberofuniquedropouttimesT(vis-a-vis,thenumberoftimesasubjectismeasured)maybelarge.Asaresultthenumberofsubjectshavingaparticulardropouttimemaybequitesmall.Thus,straticationbydropoutpatternmayleadtosparse 108 PAGE 109 HoganandLaird ( 1998 )suggestedparameterstobesharedacrosspatterns. Hoganetal. ( 2004 )suggestedwaystogrouptheTdropouttimesintom PAGE 110 Diggleetal. 2002 )usedtocapturetheserialdependenceintheresponseprocess. Thereexistsanotherclassofmodelsknownasmarginalizedlatentvariablemodelswhichtakescareoftheexchangeableornon-diminishingdependencepatternamongtherepeatedresponseobservationsusingrandomintercepts. SchildcroutandHeagerty ( 2007 )combinedthemarginalizedtransitionandlatentvariablemodelsbyproposingaunifyingmodelthattakesintoaccountbothserialandlongrangedependenceamongtheresponseobservations.Theirmodelcanbeusedinsituationswithmoderatetolargenumberofrepeatedmeasurementspersubjectwherebothserial(shortrange)andexchangeable(longrange)responsecorrelationcanbeidentied. Inthisstudy,wecombinethemethodologiesproposedin Heagerty ( 2002 ), SchildcroutandHeagerty ( 2007 )and RoyandDaniels ( 2008 )andproposeanewmodelwhichaccountsforbothserial(shortterm)andlong-rangedependenceamongtheresponseobservationsinsituationswherethenumberofuniquedropouttimesislarge.Wegroupthedropouttimesusingalatentvariableapproachtakingintoaccounttheuncertaintyinthenumberofgroups.Wealsomodelthemarginalcovariateeffectsofinterest. 110 PAGE 111 Heagerty ( 1999 )proposedmarginallyspeciedlogisticmodelswhichleadtodirectmodelingofthemarginalcovariateeffects.LetYitandXitrespectivelybetheresponseobservationandthecovariatevectorcorrespondingtotheithindividualatthetthtimepoint,i=1,2,...,N;t=1,2,...,T.LetE(YitjXit,)bethemarginalmeanofYit.Itisspeciedas Theabovestructureisthemarginalregressionmodel.Now,inordertospecifythedependenceamong(Yi1,Yi2,...,YiT)thefollowingconditionalmodelisspecied wherebiN(0,).itcanbecomputedbysolvingthefollowingconvolutionequation Thusisafunctionorand.Inthisstudywewillbeproposingamodelwhichwillmarginalizeovertherandomeffectsandthedrop-outdistributiontodirectlymodelthemarginalcovariateeffectsofinteresttakingintoaccountboththeserialandexchangeabledependencestructureamongtheYit's. Letusbrieygooverthenecessarynotationswithrespecttosubjecti.LetYi=(Yi1,Yi2,...,YiT)betheresponsevector.LettheTuniquedropouttimesbegroupedintomclassesbythelatentindicatorsSi=(Si1,...,Sim).HereSijisanindicatorforclassj,j=1,...,m(m PAGE 112 1. Dependencebetweenresponseanddropouttimemodeledbythelatentclasses. 2. Shortrange(serialdependence)betweenYitand(Yit1,...,Yitp)modelledbyaMTM(p). 3. Longrangeornon-diminishingdependenceamongtheYit'smodelledbythesubjectspecicrandomeffectsbi,i=1,...,N. WerstspecifytheMarginalmodelas Theabovemodelmarginalizesoverthesubjectspecicrandomeffectsandoverthelatentclassdistribution(implicitlyoverthedropoutdistribution)aswell.Inordertofullyspecifytheassociationduetorepeatedmeasurementsandnonignorabilityinthemissingnessprocess,wespecifyaconditionalmodelinadditiontothemarginalmodel.Byconditional,wemeanconditionedovertherandomeffectsandlatentclasses.WeassumethattherelevantinformationinthedropouttimesiscapturedbythelatentvariableS-thisisobviousbecausethespeciclatentclassasubjectwouldbelongtowouldsolelydependonhis/herdropouttime.Thus,wespecifyamixturedistributionovertheselatentclasses,asopposedtooverDitself. Beforedelvingintothemodel,itisimportanttonotethattheconditionalmodelparametersarenotofmaininterest,andinfactwillbeviewedasnuisanceparameters.Thisisbecausewearenotinterestedinestimatingeithersubject-speciceffects(i.e.effectsconditionalontherandomeffects)orclass-speciccovariateeffects(i.e.effectsofcovariatesonYgivenaparticulardropoutclass).Moreover,theconditionalmodelshouldbesospeciedthatitiscompatiblewiththemarginalmodel( 5 ).Aswewillseebelow,thisleadstoasomewhatcomplicatedmodel.Specifyingthisconditionalmodel 112 PAGE 113 WeassumethatYit,conditionalontherandomeffectsbiandlatentclassSi,arefromanexponentialfamilywithdistribution where,inthemostgeneralcase,[bijSij=1,Xi]N(0,2j(Xi))andit,k(Sij=1)=V0it,kjkforj=1,2,...,mandk=1,2,...,p,whereVitandZitarebothsubsetsofXit.Thus,thevarianceofbimaydependonthelatentclassandthecovariatevectorfortheithsubject.Moreover,(1k,2k,...,mk)determineshowthedependencebetweenYitandYitkvariesasafunctionofthecovariatesVit,kconditionalonthelatentclasses.Wealsomakethesum-to-zeroconstrainti.em=Pmj=1jforthepurposeofidentiability.Lastly,inthisconditionalmodel,eachsubjecthasitsownintercept,andtheeffectofeachcovariate,isallowedtodifferbydropoutclassviatheregressioncoefcients,(j). Theprobabilitiesofthelatentclassesgiventhedrop-outtimesarespeciedasproportionalodd'smodel( Agresti 2002 )givenby where0,10,2...0,M1and1areunknownparameters.Thustheclassprobabilitiesareassumedtobeamonotonefunctionofdropouttime(infact,linearonthelogitscale). 113 PAGE 114 Lastly,thedrop-outtimesDiareassumedtofollowamultinomialdistributionwithmassateachpossibledrop-outtimes,parameterizedby'.HerewemaketheimportantassumptionthatYitisindependentofDigivenSi.Ourmaintargetofinferencearethecovariateeffectsaveragedovertheclassesi.eMaveragedoverM.Theinterceptitin( 5 )isdeterminedbythefollowingrelationshipbetweenthemarginalandconditionalmodelsE(Yitj)=XDXSp(SijDi)P(Di)ZXAfE(Yitjyit1,...,yitp,bi,Si)p(yit1,...,yitpjbi,Si)gp(bijSi)dbi 114 PAGE 115 Proportionalityin( 5 )holdsbecauseweassumethatthemissingandobservedresponsesfromsubjectiareindependent,givenSiandbi(i.e.[YmijYi,bi,Si]=[Ymijbi,Si]).FollowingtheOPEFformulation,wehaveLi(YijYfig,Sij=1,bi,(j),)=expTXt=1yititTXt=1(it)=(mi)+TXt=1h(Yit,) PAGE 116 Wecanavoidtheintegral(w.r.tbi)in( 5 )ifwealsosamplethebi'salongwiththeotherparametersfromthefullposterior( 5 ).Inthatcase,thefullposteriormayberewrittenas where Forthemostgeneralcase,wehaveassumedanOPEFstructureforeachYitconditionalonthepast.Sincetheoutcomesarebinary,wecansimplifyittoaBernoullidistributioni.e wherecit=E(Yitjyit1,yit2,...,yitp,bi,Sij=1)=g1it+bi+MXj=1SijZ0ij(j)+pXk=1it,kyitk. 116 PAGE 117 1+e0j+1Di1+e0j1+1Di Now,asmentionedearlier,Diisthedropouttimefortheithsubject.Also,thereareTuniquedropouttimes.Let,fort=1,2,...,Tit=8><>:1iftheithsubjectdropsoutbetweenthe(t1)thandtthobservationtimes0otherwise. 1. LetNq(0,0)assumingthat8i=1,2,...,Nandt=1,2,...,T,Xitisqdimensional. 2. Let(1),(2),...,(m)iidNr(0,0).whererqsinceZitXit8i=1,2,...,Nandt=1,2,...,T. 3. Let21,22,...,2miidU(a,b)where0 PAGE 118 7. Forthetimebeingwekeepthepriorof,()unspecied. Now,combining( 5 5 )andthepriorsspeciedabove,wecanwritedownthefullposteriordistributionofmandw,(w,mjY,X,D)uptoaconstant.Thus,wecangetthefullconditionaldistributionofalltherelevantparametersandproceedwithsamplegenerationusingMCMC. TheassumptionofconditionalindependencebetweenYiandDigivenSiandthecovariatescanbeveriedbyperformingalikelihoodratiotest(Frequentist)orusingBayesfactors(Bayesian).Thenullmodelisgivenby( 5 )andthealternativemodelmaybewrittenas wheref(Di)maybeasmoothbutunspeciedfunctionofDi.Thus,thenullhypothesisofconditionalindependence(betweenYiandDigivenSiandXi)wouldbesimplyf(Di)=0.Thetestcanbecarriedoutbyrstttingthenullmodel(??).Then,theposteriorprobabilityofclassmembershipforeachsubjectcanbeestimatedby^P(Sij=1jDi,Yi,Xi,^w)=RLi(YijYfig,Sij=1,bi,^j,^)p(Sij=1jDi;^)p(Dij^)dF(bijSij,^2j) 5 )usingaweightedlikelihood(theweightsbeingtheaboveposteriorprobabilityofclassmembership).Analternativewayofdoingtheaboveconditionalindependencetestswouldbetousescoretestsbasedonsmoothingsplinesasusedinproportionalhazardsmodelsby Linetal. ( 2006 ). 118 PAGE 119 5 )hasthemostgeneralform.Wecansimplifyitbyassumingalineareffectofdrop-outtimeinwhichcasethealternative(simpler)modelwouldbe whereeachhj()isaknownfunctionandthe'sareparameters.ThenullhypotheseswouldbeH0:1=...=J=0.Thelineardrop-outeffectwouldimplyJ=1andh(Di)=Di.TheLRTcanthenbeperformedasbeforebyttingmodels( 5 )and( 5 )usingthesameweightsgivenabove.WecanalsouseBayesfactorsforcarryingouttheseanalysis. Heagerty ( 1999 )proposedMarginallySpeciedLogisticNormalmodelsforlongitudinalbinarydata.Heproposedtwomodels:therstonewasamarginallogisticregressionmodelwhichlinkstheaverageresponsetothecovariatesbythefollowingequation: HereYijandXijrespectivelydenotethebinaryresponseandtheexogenouscovariatevectorrecordedattimejfortheithsubject,i=1,2,...,N;j=1,2,...,ni.Thesecondmodelisaconditionalmodelwhichexplainsthewithin-subjectdependenceamong 119 PAGE 120 Animportantassumptionthatismadeisthatconditionalonbi=(bi1,bi2,...,bini),thecomponentsofYiareindependent.Finally,itisassumedthat(bijXi)N(0,i)whereimodelsthedependenceamongthebi's(andthus,indirectlyamongtheYi's)andcanbeobtainedasafunctionoftheobservationtimesti=(ti1,ti2,...,tini)andaparametervector. Heagerty ( 1999 )referredtothemodelsgivenin( 5 )and( 5 )asthemarginallyspeciedlogisticnormalmodels. Undertheabovemodellingframework,theparameterijcanbeexpressedasafunctionofboththemarginallinearpredictorij=X0ijandij,thestandarddeviationofbij.WritingbijasijzwherezN(0,1),ijcanbeobtainedasthesolutiontothefollowingconvolutionequation: whereh(.)istheinverseofthelogitlinkand(.)isthestandardnormaldensityfunction.Given(ij,ij),theaboveequationcanbesolvedforijusingnumericalintegrationandNewton-Raphsoniteration. 5 )willbeafunctionofthemarginalmeanparametersandtherandomeffectscovarianceparametersandshouldbecomputedforboththemaximumlikelihoodandestimatingequationmethodology( Heagerty 1999 ).Formaximumlikelihoodestimation,thecontributionoftheithsubjecttotheobserveddatalikelihoodisascertainedbyrstassumingalineartransformationoftheformbi=CiziwhereCiisaniqmatrixandziNq(0,Iqq).Theabovetransformationeffectivelylinksupbitoalowerdimensionalrandomeffectzi.Thecontributionoftheithsubject(totheobserveddatalikelihood)cannowbeexpressedasamixtureovertherandom 120 PAGE 121 whereq(zi)=qYk=1(zik).SinceLi(,)cannotbeevaluatedanalytically,numericalproceduresarerequiredtonditsvalue. Heagerty ( 2002 )usedGauss-HermiteQuadraturetoperformthecalculationbutassumedq=1.Withincreasingvaluesofq,thecomputationalburdenincreasesexponentiallyandisnotfeasibleatall.Wearecurrentlytryingtodevelopalternativeandlesscomputationallyintensivemethodologiestoaccomplishtheaboveobjectives.WeareworkingwithMultivariateLogisticandMultivariatetdistributionsagainstaBayesianframeworkasin O'brienandDunson ( 2004 ).Wehopethatthismethodologywillprovideabetteralternativetothearduousnumericalmethodsmentionedbelow. 121 PAGE 122 logdj=log+dlog#+logj+d0Z0cZj(t)(t)dt Thus,thelikelihoodwillbe A )wehave Differentiating( A )w.r.tand#andsolvingtheresultingequationswehave A )andthenexponentiating,weobtaintheexpressionofL(,)in( 2 ). Again,differentiating( A )w.r.tj,wehave Itiseasytoshowthatifwereplace( A )in( A )andthenexponentiate,wegettheexpressionforL(#,)in( 2 ).Sincetheorderofmaximizationisimmaterial,itfollowsthat,L(,)andL(#,),oncemaximizedoverthenuisanceparameters(#and PAGE 123 Replacingtheexpressionofdjfrom( 2 ),wehave 2 ). (ii)First,weperformthetransformationfromto(,),where=JXj=1j.Thus,j=j,j=1,...,J.ThejacobianoftransformationwillbeJ1. Usingthistransformationin( A )andaftersomemanipulation,wehave 123 PAGE 124 A )w.r.t#weobtain Integrationof( A )w.r.tyields( 2 )aftersomeminormanipulation. (iii)Theorderinwhichp(#,,jy)isintegratedw.r.ttheparametersdoesnotmakeanydifferenceinthemarginalposteriordensityofp().Thus,integrationofp(w,jy)w.r.tworp(,jy)w.r.twillyieldthesamemarginalposteriordensityp(jy)of. 1. AsinSeamanandRichardson(2004),theassumptionofexistenceandnitenessofE0Z0cZq(t)(t)dtandE0Z0cZr(t)(t)dtisautomaticallysatisedprovidedthepriordensityp()ensuresthatE()existsandisnite. 2. Theposteriorproprietyofp(#,,jy)in( )canbeshowninasimilarwaytothatinSeamanandRichardson(2001). 3. Thepriordistributionp()ofinducesapriordistributionontheinuencefunctionf(t),ct0ginthelogisticcase-controlmodelin( 23 )since(t)=0(t),ct0. LetP(D=djX(t)=Zk(t),ct0)=pdk,(d=0,1,...,r;k=1,...,K)andP(X(t)=Zk(t),ct0jD=0)=k=PKl=1l.LetndkbethenumberofindividualswithD=dandfX(t)=Zk(t),ct0g.ItcanbeshownthatP(X(t)=Zk(t),ct0jD=d)=kpdk=p0k KXl=1lpdl=p0l PAGE 125 KXl=1lpdl=p0l1CCCCCAndk. KXl=1ldl1CCCCCAndk. TheaugmentedmodelisgivenbyZdkjdkpoisson(dk)wherelog(dk)=log(#d)+log(dk)+log(k),log(0k)=log(k),d=1,...,r;k=1,...,K. 125 PAGE 126 NotingthatZ10expk(1+rXd=1#ddk)!(k)Prd=0ndk1dk/1+rXd=1#ddk!Prd=0ndk,wehave,byintegratingoutin( A ), Now,integratingout(#1,...,#r)from( A ),wehave Next,wemakethetransformationk='kand'=KXl=1lhavingjacobian'1.Hencethepriordistributionin( A )becomes(,#,',)/rYd=1#1d!'1KYk=11k!(). PAGE 127 A )canberewrittenas Integratingout'from( A ),wehave KXl=1ldl1CCCCCAndkKYk=11k!() From( A )and( A ),itisclearthatposteriorinferencefortheparameterofinterest,remainsthesameundereithertheprospectivelikelihoodLportheretrospectivelikelihoodLRaslongastheposteriorisproper.Itcanbeshownthattheposteriorwillbeproperforanyproperpriorforifn0k18k=1,...,K. 127 PAGE 128 WehavetoshowthatIMwhereMisanynitepositiveconstant. Integratingrstw.r.t,wehave 2Xi(iXiZibi1)01(iXiZibi1)d=jXiX0i1Xij1=2exp1 2XiW0i1Wi+Q 2PiW0i1XiPiX0i1Xi1PiX0i1Wi,Wi=iZibi1and1=diag(21,22,...,2t). Now,W0i1Wi=W0i1=21=2Wi=S0iSiwhereSi=1=2Wi.Similarly,W0i1Xi=S0iTi,X0i1Wi=T0iSiandX0i1Xi=T0iTiwhereTi=1=2Xi. 128 PAGE 129 B )becomes1 2XiS0iSiXiS0iTiXiT0iTi1XiT0iSi=1 2S0SS0T(T0T)1T0S=1 2S0IT(T0T)1T0S=Q,say whereS=(S01,...,S0m)0andT=(T01,...,T0m)0.Since(IT(T0T)1T0)isidempotent,S0IT(T0T)1T0Sisnon-negative,implyingQ0andthusexp(Q)1. Next,weconsiderintegrationw.r.t2i.e Assumingmax=max(1,...,t),wehave,8j=1,...,t,2j2max)Xij2jX0ijXij2maxX0ij)Pi,jXij2jX0ij2maxPi,jXijX0ijandthus Combining( B )and( B ),wehaveIjXi,jXijX0ijj1=2Z...Z(2max)(p+1)=2tYj=1(2j)m=2cj+1exp(dj=2j)d21...d2t 2expdk (B) 129 PAGE 130 Combining( B )and( B ),wehave where=().Sinceallthecomponentsoftheintegrandin( B )haveproperdistributions,theaboveintegralwouldbenitethusprovingposteriorpropriety. Fortherandomwalkmodel,theintegrandin( B )willhaveanadditionallikelihoodtermQtj=1L(vjjvj1,2v)andapriorterm(2v).Thederivationwouldthenproceedexactlyasaboveandtheintegrandin( B )willalsocontaintheseadditionalterms.Butsincebothoftheseareproperdistributions(normalandinversegammarespectively),Iwillstillbeniteundertheconditionsstatedinthetheorem. 2Xi,j(ijX0ijZ0ijbi)01j(ijX0ijZ0ijbi)d<1(B) inordertoproveposteriorpropriety. Usingthesametypeofalgebraicmanipulationsasintheunivariatecase,theL.H.Sof( B )canbeshowntobe 2Xi,jW0ij1jWij+1 2Q 130 PAGE 131 Asbefore,theexpressionwithintheexponentin( B )canberewrittenasK=1 2Xi,jS0ijSijXi,jS0ijTijXi,jT0ijTijXi,jT0ijSij=1 2S0IT(T0T)1T0S0. Thus, exp1 2Xi,jW0ij1jWij+1 2Q1 So,inordertoproveposteriorpropriety,wehavetoshow Hereristheorderofj,j=1,2,...,t.(r=2inourcase). Letj1,j2,...,jrbethedistincteigenvaluesof1j,j=1,2,...,t.Sincejisavariancecovariancematrix,itispositivedeniteandsymmetric.Hence,1jalsohasthesameproperties.Thus,jk>0,8k=1,2,...,r. Now,8j=1,2,...,r, PAGE 132 2jXi,jXijX0ijj1 2 Sincej1jj=rYk=1jk,8j=1,...,t, 2=rYk=1(jk)(m+djr1) 2 Now,replacing( B )and( B )intheexpressionofIin( B ),wehave 2Z..Z(min)p+q+2 2tYj=1rYk=1(jk)(m+djr1) 2exp"TV1j1j whereTdenotestrace.Letmin=lm,l2[1,...,t];m2[1,...,r]. Then,II1I2where 2ZrYfk=1,k6=mg(lk)(m+dlr1) 2(lm)(m+dlpq2)r1 2expTV1l1l 2ZrYfk=1,k6=mg(lk)p+q+2 2j1lj(m+dlpq2)r1 2expTV1l1l 2exp"TV1j1j 2jVjjm+dj whichisnite. Thus,inordertoshowposteriorpropriety,wehavetoprovethatI2<1. 132 PAGE 133 2j1lj(m+dlpq2)r1 2expTV1l1l BytheAM-GMinequality,wehave, 21 2 21 2=1 2=1 2 where(l)kkdenotesthekthdiagonalelementof1l. Since1lhasaWishartdistribution,(l)kkkk2dl,(k=1,...,r)implyingthatPrk=1(l)kk<1. Combining( B )and( B ),wehave,IZ1 2j1lj(m+dlpq2)r1 2expTV1l1l 2ZrXk=1(l)kk(r1)(p+q+2) 2j1lj(m+dlpq2)r1 2expTV1l1l 2whereC=1 2 PAGE 134 Now, 2(r1)(p+q+2) 2r1rXk=1((l)kk)(r1)(p+q+2) 2 2(r1)(p+q+2) 2r1ErXk=1((l)kk)(r1)(p+q+2) 2 whichisnitebecause 2<18k=1,...,r)rXk=1((l)kk)(r1)(p+q+2) 2<1)ErXk=1((l)kk)(r1)(p+q+2) 2<1)ErXk=1(l)kk(r1)(p+q+2) 2<1 ThusIisniteimplyingposteriorpropriety. 134 PAGE 135 1. andisthep+K+1orderpriorvariance-covariancematrixof. 2. 3. andbistheq+M+1ordervariance-covariancematrixofb. 4. andisther+K+2ordervariance-covariancematrixof(,). 5. 2,+(Zi0Mib0iQi) 2where 6. 2NXi=1b2ij,j=0,...,q. 135 PAGE 136 2NXi=1ni+1,1 2NXi=1niXj=1yijp,(aij)0q,(aij)0bi2. 8. 2KXk=12p+k. 9. 2NXi=1q+MXj=q+1b2ij. 10. 2KXk=12r+k. Here,G(x,y)denotesaGammadensitywithshapeparameterxandrateparameteryrespectively. C.2.1SemiparametricUnivariateSmallAreaModel 1. PAGE 137 20+d 2mXi=1(ijX0ijZ0ijbi)2+dj 2mXi=1b2i+d 1. 137 PAGE 138 10. PAGE 139 Agresti,A.(2002).Categoricaldataanalysis.Wiley. Albert,J.andChib,S.(1993).Bayesiananalysisofbinaryandpolychotomousresponsedata.JournaloftheAmericanStatisticalAssociation88,669. Althman,P.(1971).Theanalysisofmatchedproportions.Biometrika58,561. Ashby,D.,Hutton,J.,andMcGee,M.(1993).SimpleBayesiananalysesforcase-controlledstudiesincancerepidemiology.Statistician42,385. Battese,G.,Harter,R.,andFuller,W.(1988).Anerrorcomponentmodelforpredictionofcountycropareasusingsurveyandsatellitedata.JournaloftheAmericanStatisticalAssociation83,28. Bell,W.(1999).Accountingforuncertaintyaboutvariancesinsmallareaestimation.BulletinoftheInternationalStatisticalInstitute. Botts,C.andDaniels,M.(2008).AfexibleapproachtoBayesianmultiplecurvetting.ComputationalStatisticsandDataAnalysis52,5100. Bradlow,E.andZaslavsky,A.(1997).CaseinuenceanalysisinBayesianinference.JournalofComputationalandGraphicalStatistics6,314. Breslow,E.T.andDay,N.E.(1980).StatisticalMethodsinCancerResearch,Volume1.InternationalAgencyforResearchonCancer,Lyon. Breslow,E.T.,Day,N.E.,Halvorsen,K.T.,Prentice,R.L.,andSabai,C.(1978).Estimationofmultiplerelativeriskfunctionsinmatchedcase-controlstudies.Ameri-canJournalofEpidemiology108,299. Breslow,N.(1996).Statisticsinepidemiology:Thecase-controlstudy.JournaloftheAmericanStatisticalAssociation91,14. Carroll,R.J.,Wang,S.,andWang,C.Y.(1995).Prospectiveanalysisoflogisticcasecontrolstudies.JournaloftheAmericanStatisticalAssociation90,157. Catalona,W.,Partin,A.,Slawin,K.,andBrawer,M.(1998).Useofthepercentageoffreeprostate-specicantigentoenhancedifferentiationofprostatecancerfrombenignprostaticdisease:Aprospectivemulticenterclinicaltrial.JournaloftheAmericanMedicalAssociation19,1542. Corneld,J.(1951).Amethodofestimatingcomparativeratesfromclinicaldata:applicationstocancerofthelung,breast,andcervix.JournaloftheNationalCancerInstitute11,1269. Corneld,J.,Gordon,T.,andSmith,W.W.(1961).Quantalresponsecurvesforexperimentallyuncontrolledvariables.BulletinoftheInternationalStatisticalInstitute38,97. 139 PAGE 140 Denison,D.,Mallick,B.,andSmith,A.(1998).AutomaticBayesiancurvetting.JournaloftheRoyalStatisticalSociety,SeriesB60,333. Diggle,P.,Heagerty,P.,Liang,K.,andZeger,S.(2002).Theanalysisoflongitudinaldata,2ndEdition.NewYork:OxfordUniversityPress. Diggle,P.,Morris,S.,andWakeeld,J.(2000).Pointsourcemodelingusingmatchedcase-controldata.Biostatistics1,89. DiMatteo,I.,Genovese,C.,andKass,R.(2001).Bayesiancurvettingwithfreeknotsplines.Biometrika88,1055. Durban,M.,Harezlak,J.,Wand,M.,andCarroll,R.(2004).Simplettingofsubjectspeciccurvesforlongitudinaldata.StatisticsinMedicine00,1. Eilers,P.andMarx,B.(1996).FlexiblesmoothingwithB-splinesandpenalties.Statisti-calScience11,89. Ericksen,E.andKadane,J.(1985).Estimatingthepopulationincensusyear:1980andbeyond(withdiscussion).JournaloftheAmericanStatisticalAssociation80,98. Escobar,M.andWest,M.(1995).Bayesiandensityestimationandinferenceusingmixtures.JournaloftheAmericanStatisticalAssociation90,577588. Etzioni,R.,Pepe,M.,Longton,G.,Hu,C.,andGoodman,G.(1999).Incorporatingthetimedimensioninreceiveroperatingcharacteristiccurves:Acasestudyofprostatecancer.MedicalDecisionMaking19,242. Eubank,R.(1988).Splinesmoothingandnonparametricregression.NewYork:MarcelDekker. Eubank,R.(1999).Nonparametricregressionandsplinesmoothing.NewYork:MarcelDekker. Fan,J.andGijbels,I.(1996).Localpolynomialmodelinganditsapplications.ChapmanandHall. Fay,R.(1987).Applicationofmultivariateregressiontosmalldomainestimation,inR.Platek,J.N.K.Rao,C.E.Srndal,andM.P.Singh(Eds).SmallAreaStatistics. Fay,R.andHerriot,R.(1979).Estimationofincomefromsmallplaces:anapplicationofJames-Steinprocedurestocensusdata.JournaloftheAmericanStatisticalAssociation74,269. 140 PAGE 141 Friedman,J.(1991).Multivariateadaptiveregressionsplines.TheAnnalsofStatistics19,1. Gelfand,A.andGhosh,S.(1998).Modelchoice:Aminimumposteriorpredictivelossapproach.Biometrika85,1. Gelfand,A.andSmith,A.(1990).Samplingbasedapproachestocalculatingmarginaldensities.JournaloftheAmericanStatisticalAssociation85,398. Gelman,A.andRubin,D.(1992).Inferencefromiterativesimulationusingmultiplesequences(withdiscussion).StatisticalScience7,457. Ghosh,M.andChen,M.-H.(2002).Bayesianinferenceformatchedcasecontrolstudies.Sankhya,B64,107. Ghosh,M.,Nangia,N.,andKim,D.(1996).Estimationofmedianincomeoffour-personfamilies:ABayesiantimeseriesapproach.JournaloftheAmericanStatisticalAssociation91,1423. Ghosh,M.andRao,J.N.K.(1994).Smallareaestimation:Anappraisal.StatisticalScience9,55. Godambe,V.P.(1976).Conditionallikelihoodandunconditionaloptimumestimatingequations.Biometrika63,277. Green,P.(1995).ReversiblejumpMarkovChainMonteCarlocomputationandBayesianmodeldetermination.Biometrika82,711. Green,P.andSilverman,B.(1994).Nonparametricregressionandgeneralizedlinearmodels:aroughnesspenaltyapproach.ChapmanandHall/CRC. Gustafson,P.,Le,N.,andValle,M.(2002).ABayesianapproachtocase-controlstudieswitherrorsincovariables.Biostatistics3,229. Hampel,F.,Ronchetti,E.,Rousseeuw,P.,andStahel,W.(1987).Robuststatistics:Theapproachbasedoninuencefunctions.Wiley. Hanson,T.andJohnson,W.(2000).Spatiallyadaptivepenaltiesforsplinetting.AustralianandNewZealandJournalofStatistics2,205. Heagerty,P.(1999).Marginallyspeciedlogisticnormalmodelsforlongitudinalbinarydata.Biometrics55,688. Heagerty,P.(2002).Marginalizedtransitionmodelsandlikelihoodinferenceforlongitudinalcategoricaldata.Biometrics58,342. 141 PAGE 142 Hogan,J.andLaird,N.(1998).Mixturemodelsforthejointdistributionofrepeatedmeasuresandeventtimes.StatisticsinMedicine16,239. Hogan,J.,Roy,J.,andKorkontzelou,C.(2004).Tutotialinbiostatistics:Handlingdrop-outinlongitudinalstudies.StatisticsinMedicine23,1455. Jiang,J.andLahiri,P.(2006).Mixedmodelpredictionandsmallareaestimation.Test15,1. Johnson,V.(2004).ABayesian2testforgoodness-of-t.AnnalsofStatistics32,2361. Lewis,M.,Heinemann,L.,MacRae,K.,Bruppacher,R.,andSpitzer,W.(1996).Theincreasedriskofvenomousthromboembolismandtheuseofthirdgenerationprogestagens:Roleofbiasinobservationalresearch.Contraception54,5. Lin,J.,Zhang,D.,andDavidian,M.(2006).Smoothingsplinebasedscoretestsforproportionalhazardsmodels.Biometrics62,803. Lindstrom,M.(1999).Penalizedestimationoffree-knotsplines.JournalofComputa-tionalandGraphicalStatistics8,333. Lipsitz,S.,Parzen,M.,andEwell,M.(1998).Inferenceusingconditionallogisticregressionwithmissingcovariates.Biometrics54,295. Little,R.andRubin,D.(1987).StatisticalAnalysiswithMissingData.NewYork:Wiley&Sons. MacEachern,S.andMuller,P.(1998).EstimatingmixturesofDirichletprocessmodels.JournalofComputationalandGraphicalStatistics2,223. Mantel,N.andHaenszel,W.(1959).Statisticalaspectsoftheanalysisofdatafromretrospectivestudiesofdisease.JournaloftheNationalCancerInstitute22,719. Marshall,R.(1988).Bayesiananalysisofcase-controlstudies.StatisticsinMedicine7,12231230. Morris,C.(1983).ParametricempiricalBayesinference:theoryandapplicaions.JournaloftheAmericanStatisticalAssociation78,47. Muller,P.,Parmigiani,G.,Schildkraut,J.,andTardella,L.(1999).ABayesianhierarchicalapproachforcombiningcase-controlandprospectivestudies.Biometrics55,858. Muller,P.andRoeder,K.(1997).ABayesiansemiparametricmodelforcase-controlstudieswitherrorsinvariables.Biometrika84,523. 142 PAGE 143 O'brien,S.andDunson,D.(2004).Bayesianmultivariatelogisticregression.Biometrics60,739. Opsomer,J.,Claeskens,G.,Ranalli,M.,andBreidt,F.(2008).Non-parametricsmallareaestimationusingpenalizedsplineregression.JournaloftheRoyalStatisticalSociety,SeriesB70,265. Paik,M.andSacco,R.(2000).Matchedcase-controldataanalyseswithmissingcovariates.AppliedStatistics49,145. Park,E.andKim,Y.(2004).Analysisoflongitudinaldataincase-controlstudies.Biometrika91,321. Prentice,R.L.andPyke,R.(1979).Logisticdiseaseincidencemodelsandcasecontrolstudies.Biometrika66,403. Rao,J.N.K.(2003).SmallAreaEstimation.WileyInterScience,NewYork. Rathouz,P.,Satten,G.,andCarroll,R.(2002).Semiparametricinferenceinmatchedcase-controlstudieswithmissingcovariatedata.Biometrika89,905. Robinson,G.(1991).ThatBLUPisagoodthing:theestimationofrandomeffects.StatisticalScience6,15. Roeder,K.,Carroll,R.,andLindsay,B.(1996).Asemiparametricmixtureapproachtocase-controlstudieswitherrorsincovariables.JournaloftheAmericanStatisticalAssociation91,722. Roy,J.(2003).Modelinglongitudinaldatawithnon-ignorabledropoutsusingalatentdropoutclassmodel.StatisticsinMedicine59,829. Roy,J.andDaniels,M.(2008).Ageneralclassofpatternmixturemodelsfornonignorabledropoutswithmanypossibledropouttimes.Biometrics64,538. Rubin,D.(1981).TheBayesianbootstrap.TheAnnalsofStatistics9,130. Ruppert,D.(2002).Selectingthenumberofknotsforpenalizedsplines.JournalofComputationalandGraphicalStatistics11,735. Ruppert,D.andCarroll,R.(2000).Spatiallyadaptivepenaltiesforsplinetting.AustralianandNewZealandJournalofStatistics2,205. Ruppert,D.,Wand,M.,andCarroll,R.(2003).SemiparametricRegression.CambridgeUniversityPress,Cambridge,U.K. Satten,G.andCarroll,R.(2000).Conditionalandunconditionalcategoricalregressionmodelswithmissingcovariates.Biometrics56,384. 143 PAGE 144 Schildcrout,J.andHeagerty,P.(2007).Marginalizedmodelsformoderatetolongseriesoflongitudnalbinaryresponsedata.Biometrics63,322. Seaman,S.R.andRichardson,S.(2001).Bayesiananalysisofcase-controlstudieswithcategoricalcovariates.Biometrika88,1073. Seaman,S.R.andRichardson,S.(2004).EquivalenceofprospectiveandretrospectivemodelsintheBayesiananalysisofcase-controlstudies.Biometrika91,15. Sinha,S.,Mukherjee,B.,andGhosh,M.(2004).Bayesiansemiparametricmodelingformatchedcase-controlstudieswithmultiplediseasestates.Biometrics60,41. Sinha,S.,Mukherjee,B.,Ghosh,M.,Mallick,B.,andCarroll,R.(2005).SemiparametricBayesiananalysisofmatchedcase-controlstudieswithmissingexposure.JournaloftheAmericanStatisticalAssociation100,591. Stone,C.,Hansen,M.,Kooperberg,C.,andTruong,Y.(1997).Polynomialsplinesandtheirtensorproductsinextendedlinearmodeling.TheAnnalsofStatistics25,1371. Wahba,G.(1990).Splinemodelsforobservationaldata.CBMS-NSFRegionalConferenceSeriesinAppliedMathematics. Wand,M.(2003).Smoothingandmixedmodels.ComputationalStatistics18,223. Wand,M.andJones,M.(1995).KernelSmoothing.ChapmanandHall. Zelen,M.andParker,R.(1986).CasecontrolstudiesandBayesianinference.StatisticsinMedicine5,261269. Zhang,D.,Lin,X.,andSowers,M.(2007).Twostagefunctionalmixedmodelsforevaluatingtheeffectoflongitudinalcovriateprolesonascalaroutcome.Biometrics63,351. Zhou,S.andShen,X.(2001).Spatiallyadaptiveregressionsplinesandaccurateknotselectionschemes.JournaloftheAmericanStatisticalAssociation96,247. 144 PAGE 145 DhimanBhadrareceivedhisBachelorofScienceinstatisticsfromPresidencyCollege,Calcutta(India)in2002andMasterofScienceinstatisticsfromCalcuttaUniversityin2004.HejoinedtheDepartmentofstatisticsatUniversityofFloridainJanuary2005forpursuingaPhDinstatistics.HeplanstograduateinAugust2010. 145 |