UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations  Vendor Digitized Files   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
OPTIMAL MATING DESIGNS AND OPTIMAL TECHNIQUES FOR ANALYSIS OF QUANTITATIVE TRAITS IN FOREST GENETICS By DUDLEY ARVLE HUBER A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1993 ACKNOWLEDGEMENTS I express my gratitude to Drs. T. L. White, G. R. Hodge, R. C. Littell, M. A. DeLorenzo and D. L. Rockwood for their time and effort in the pursuit of this work. Their guidance and wisdom proved invaluable to the completion of this project. I further acknowledge Dr. Bruce Bongarten for his encouragement to continue my academic career. I am grateful to Dr. T. L. White and the School of Forest Resources and Conservation at the University of Florida for funding this work. I extend special thanks to George Bryan and Dr. M. A. DeLorenzo of the Dairy Science Department and Greg Powell of the School of Forest Resources and Conservation for the use of computing facilities, programming help and aid in running the simulations required. Most importantly, I thank my family, Nancy, John and Heather, for their understanding and encouragement in this endeavor. TABLE OF CONTENTS ACKNOWLEDGEMENTS ........................................ ii LIST OF TABLES ................................. ........... vi LIST OF FIGURES ............................................. vii ABSTRACT ................................................ viii CHAPTER 1 INTRODUCTION .................... 1 CHAPTER 2 THE EFFICIENCY OF HALFSIB, HALFDIALLEL AND CIRCULAR MATING DESIGNS IN THE ESTIMATION OF GENETIC PARAMETERS WITH VARIABLE NUMBERS OF PARENTS AND LOCATIONS ................ 4 Introduction ............................................... 4 M ethods .............................................. 6 Assumptions Concerning Block Size ...................... 6 The Use of Efficiency (i) ............................. 7 General Methodology .................................. 8 Levels of Genetic Determination .......................... 10 Covariance Matrix for Variance Components ............... 12 Covariance Matrix for Linear Combinations of Variance Components and Variance of a Ratio ............... .......... 13 Comparison Among Estimates of Variances of Ratios ............ .14 Results .............. ... ........ ... ........... ...... 17 H eritability ........................................ 17 Type B Correlation .................................. 18 Dominance to Additive Variance Ratio ...................... 21 Discussion ................... ........... ............... 22 Comparison of Mating Designs ........................... 22 A General Approach to the Estimation Problem ................ 23 Use of the Variance of a Ratio Approximation ................. 25 Conclusions ............................................. 26 CHAPTER 3 ORDINARY LEAST SQUARES ESTIMATION OF GENERAL AND SPECIFIC COMBINING ABILITIES FROM HALFDIALLEL MATING DESIGNS ........ Introduction . . .. .. . . .. 28 M ethods ........................ Linear Model ................ Ordinary Least Squares Solutions ... SumtoZero Restrictions ........ Components of the Matrix Equation . Estimation of Fixed Effects ....... Numerical Examples ................ Balanced Data (Plotmean Basis) .... Missing Plot ................ Missing Cross ............... Several Missing Crosses ......... Discussion ...................... Uniqueness of Estimates ......... Weighting of Plot Means and Cross Me Diallel Mean ................ Variance and Covariance of Plot Means Comparison of Prediction and Estimatio ...................... ...................... ...................... ...................... ................ol.. ...........l......... ...................... ...................... ...................... ...................... ..................... ...eto ol .is ........... ........... ....o. ,ans in Estimating Parameters .. . ................o.. )n Methodologies .. .. .. .. .. . Conclusions ............................................ CHAPTER 4 VARIANCE COMPONENT ESTIMATION TECHNIQUES COMPARED FOR TWO MATING DESIGNS WITH FOREST GENETIC ARCHITECTURE THROUGH COMPUTER SIMULATION ..... Introduction .................................... M ethods ............................ Experimental Approach ............. Experimental Design for Simulated Data . FullSib Linear Model .............. Halfsib Linear Model .............. Data Generation and Deletion .... ..... Variance Component Estimation Techniques Comparison Among Estimation Techniques Results and Discussion ................... Variance Components .............. Ratios of Variance Components ........ General Discussion ..................... Observational Unit ................ Negative Estimates ................ Estimation Technique ............... Recommendation ................. i CHAPTER 5 GAREML: A COMPUTER ALGORITHM FOR ESTIMATING VARIANCE COMPONENTS AND PREDICTING GENETIC VALUES ............... 82 Introduction ............................................. 82 Algorithm .............................................. 83 Operating GAREML ...................................... 86 Interpreting GAREML Output ................................ 90 Variance Component Estimates ........................... 90 Predictions of Random Variables .......................... 91 Asymptotic Covariance Matrix of Variance Components ........... 92 Fixed Effect Estimates ............................... 93 Error Covariance Matrices ............................. 93 Example ............... ......................... .... 94 Data .................. ........ .... .............. 94 Analysis ..................................... ... 94 O utput .......................................... 98 Conclusions .............................................. 103 CHAPTER 6 CONCLUSIONS ..................... 104 APPENDIX FORTRAN SOURCE CODE FOR GAREML ............ 107 REFERENCE LIST ........................................... 145 BIOGRAPHICAL SKETCH ...................................... 151 LIST OF TABLES Table 21. Parametric variance components .. ..................... .11 Table 31. Data set for numerical examples .......................... 43 Table 32. Numerical results for examples ............................. 44 Table 41. Abbreviation for and description of variance component estimation m ethods .................. ............................ 60 Table 42. Sets of true variance components ............................ .61 Table 43. Sampling variance for the estimates .......................... 72 Table 44. Bias for the estimates ................................... 74 Table 45. Probability of nearness .................................. 75 Table 51. Data for example ...................................... 95 LIST OF FIGURES Figure 21. Efficiency () for h .................................... 16 Figure 22. Efficiency () for rB ................. ................. 19 Figure 23. Efficiency () fory .................................... 20 Figure 31. The overparameterized linear model ......................... 33 Figure 32. The linear model for a fourparent halfdiallel .................. 33 Figure 33. Intermediate result in SCA submatrix generation .................. 39 Figure 34. Weights on overall cross means ............................ 49 Figure 41. Distribution of 1000 MIVQUE estimates ....................... 77 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy OPTIMAL MATING DESIGNS AND OPTIMAL TECHNIQUES FOR ANALYSIS OF QUANTITATIVE TRAITS IN FOREST GENETICS By Dudley Arvie Huber May 1993 Chairperson: Timothy L. White Major Department: School of Forest Resources and Conservation First, the asymptotic covariance matrix of the variance component estimates is used to compare three common mating designs for efficiency (maximizing the variance reducing property of each observation) for genetic parameters across numbers of parents and locations and varying genetic architectures. It is determined that the circular mating design is always superior in efficiency to the halfdiallel design. For singletree heritability, the halfsib design is most efficient. For estimating type B correlation, maximum efficiency is achieved by either the half sib or circular mating design and that change in rank for efficiency is determined by the underlying genetic architecture. Another intent of this work is comparing analysis methodologies for determining parental worth. The first of these investigations is ordinary least squares assumptions in the estimation of parental worth for the halfdiallel mating design with balanced and unbalanced data. The conclusion from comparison of ordinary least squares to alternative analysis methodologies is that best linear unbiased prediction and best linear prediction are more appropriate to the problem of determining parental worth. The next analysis investigation contrasts variance component estimation techniques across levels of imbalance for the halfdiallel and halfsib mating designs for the estimation of genetic parameters with plot means and individuals used as the unit of observation. The criteria for discrimination are variance of the estimates, mean square error, bias and probability of nearness. For all estimation techniques individuals as the unit of observation produced estimates with the most desirable properties. Of the estimation techniques examined, restricted maximum likelihood is the most robust to imbalance. The computer program used to produce restricted maximum likelihood estimates of variance components was modified to form a user friendly analysis package. Both the algorithm and the outputs of the program are documented. Outputs available from the program include variance component estimates, generalized least squares estimates of fixed effects, asymptotic covariance matrix for variance components, best linear unbiased predictions for general and specific combining abilities and the error covariance matrix for predictions and estimates. CHAPTER 1 INTRODUCTION Analysis of quantitative traits in forest genetic experiments has traditionally been approached as a twopart problem. Parental worth would be estimated as fixed effects and later considered as random effects for the determination of genetic architecture. While traditional, this approach is most probably suboptimal given the proliferation of alternative analysis approaches with enhanced theoretical properties (White and Hodge 1989). In this dissertation emphasis is placed on the halfdiallel mating design because of its omnipresence and the uniqueness of the analysis problem this mating design presents. The half diallel mating design has been and continues to be used in plant sciences (Sprague and Tatum 1942, Gilbert 1958, Matzinger et al. 1959, Burley et al. 1966, Squillace 1973, Weir and Zobel 1975, Wilcox et al. 1975, Snyder and Namkoong 1978, Hallauer and Miranda 1981, Singh and Singh 1984, Greenwood et al. 1986, and Weir and Goddard 1986). The unique feature of the halfdiallel mating system which hinders analysis with many statistical packages is that a single observation contains two levels of the same main effect. Optimality of mating design for the estimation of commonly needed genetic parameters (singletree heritability, type B correlation and dominance to additive variance ratio) is examined utilizing the asymptotic covariance of the variance components (Kendall and Stuart 1963, Giesbrecht 1983 and McCutchan et al. 1989). Since genetic field experiments are composed of both a mating design and a field design, the central consideration in this investigation is which mating design with what field design (how many parents and across what number of locations 2 within a randomized complete block design) is most efficient. The criterion for discernment among designs is the efficiency of the individual observation in reducing the variance of the estimate (Pederson 1972). This question is considered under a range of genetic architectures which spans that reported for coniferous growth traits (Campbell 1972, Stonecypher et al. 1973, Snyder and Namkoong 1978, Foster 1986, Foster and Bridgwater 1986, Hodge and White [in press]). The investigation into optimal analysis proceeds by considering the ordinary least squares (OLS) treatment of estimating parental worth for the halfdiallel mating design. OLS assumptions are examined in detail through the use of matrix algebra for both balanced and unbalanced data. The use of matrix algebra illustrates both the uniqueness of the problem and the interpretation of the OLS assumptions. Comparisons among OLS, generalized least squares (GLS), best linear unbiased prediction (BLUP) and best linear prediction (BLP) are made on a theoretical basis. Although consideration of field and mating design of future experiments is essential, the problem of optimal analysis of current data remains. In response to this need, simulated data with differing levels of imbalance, genetic architecture and mating design is utilized as a basis for discriminating among variance component estimation techniques in the determination of genetic architecture. The levels of imbalance simulated represent those commonly seen in forest genetic data as less than 100% survival, missing crosses for fullsib mating designs and only subsets of parents in common across location for halfsib mating designs. The two mating designs are halfsib and halfdiallel with a subset of the previously used genetic architectures. The field design is a randomized complete block with fifteen families per block and six trees per family per block. The four criteria used to discriminate among variance component estimation techniques are probability of nearness (Pittman 1937), bias, variance of the estimates and mean square error (Hogg and Craig 1978). 3 The techniques compared for variance component estimation are minimum variance quadratic unbiased estimation (Rao 1971b), minimum norm quadratic unbiased estimation (Rao 1971a), restricted maximum likelihood (Patterson and Thompson 1971), maximum likelihood (Hartley and Rao 1967) and Henderson's method 3 (Henderson 1953). These techniques are compared using the individual and plot means as the unit of observation. Further, three alternatives are explored for dealing with negative variance component estimates which are accept and live with negative estimates, set negative estimates to zero, and resolve the system setting negative components to zero. The algorithm used for the method which provided estimates with optimal properties across experimental levels was converted to a user friendly program. This program providing restricted maximum likelihood variance component estimates uses Giesbrecht's algorithm (1983). Documentation of the algorithm and explanation of the program's output are provided along with the Fortran source code (appendix). CHAPTER 2 THE EFFICIENCY OF HALFSIB, HALFDIALLEL AND CIRCULAR MATING DESIGNS IN THE ESTIMATION OF GENETIC PARAMETERS WITH VARIABLE NUMBERS OF PARENTS AND LOCATIONS Introduction In forest tree improvement, genetic tests are established for four primary purposes: 1) ranking parents, 2) selecting families or individuals, 3) estimating genetic parameters, and 4) demonstrating genetic gain (Zobel and Talbert 1984). While the four purposes are not mutually exclusive, a test design optimal for one purpose is most probably not optimal for all (Burdon and Shelbourne 1971, White 1987). A breeder then must prioritize the purposes for which a given test is established and choose a design based on these priorities. Within a genetic test design there are two primary components: mating design and field design. There have been several investigations of optimal designs for these two components either separately or simultaneously under various criteria. These criteria have included the efficient and/or precise estimation of heritability (Pederson 1972, Namkoong and Roberds 1974, Pepper and Namkoong 1978, McCutchan et al. 1985, McCutchan et al. 1989), precise estimation of variance components (Braaten 1965, Pepper 1983), and efficient selection of progeny (van Buijtenen 1972, White and Hodge 1987, van Buijtenen and Burdon 1990, LooDinkins et al. 1990). Incorporated within this body of research has been a wide range of genetic and environmental variance parameters and field and mating designs. However, the models in previous investigations have been primarily constrained to consideration of testing in a single 5 environment with a corresponding limited number of factors in the model, i.e., genotype by environment interaction and/or dominance variance are usually not considered. This chapter focuses on optimal mating designs through consideration of three common mating designs (half sib, halfdiallel, and circular with four crosses per parent) for estimation of genetic parameters with a field design extending across multiple locations. In this chapter the approach to the optimal design problem is to maintain the basic field design within locations as randomized complete block with four blocks and a sixtree rowplot representing each genetic entry within a block (noted as one of the most common field designs by LooDinkins et al. 1990). The number of families in a block, number of locations, mating design and number of parents within a mating design are allowed to change. Since optimality, besides being a function of the field and mating designs, is also a function of the underlying genetic parameters, all designs are examined across a range of levels of genetic determination (as varying levels of heritability, genotype by environment interaction and dominance) reflecting estimates for many economically important traits in conifers (Campbell 1972, Stonecypher et al. 1973, Snyder and Namkoong 1978, Foster 1986, Foster and Bridgwater 1986, Hodge and White (in press)). For each design and level of genetic determination, a Minimum Variance Quadratic Unbiased Estimation (MIVQUE) technique and an approximation of the variance of a ratio (Kendall and Stuart 1963, Giesbrecht 1983 and McCutchan et al. 1989) are applied to estimate the variance of estimates of heritability, additive to additive plus additive by environment variance ratio, and dominance to additive variance ratio. These techniques use the true covariance matrix of the variance component estimates (utilizing only the known parameters and the test design and precluding the need for simulated or real data) and a Taylor series approximation of the variance of a ratio. The relative efficiencies of different test designs are compared on the basis of i (the 6 efficiency of an individual observation in reducing the variance of an estimate, Pederson 1972). Thus this research explores which mating design, number of parents and number of locations is most efficient per unit of observation in estimating heritability, additive to additive plus additive by environment variance ratio, and dominance to additive variance ratio for several variance structures representative of coniferous growth traits. Methods Assumptions Concerning Block Size As opposed to McCutchan et al. (1985), where block sizes were held constant and including more families resulted in fewer observations per family per block, in this chapter the blocks are allowed to expand to accommodate increasing numbers of families. This expansion is allowed without increasing either the variance among block or the variance within blocks. For the three mating designs which are discussed, the addition of one parent to the halfsib design increases block size by 6 trees (plot for a halfsib family), the addition of a parent to the circular design increases block size by 12 trees (two plots for fullsib families), and the addition of a parent to the halfdiallel design increases block size by 6p (where p is the number of parents before the addition or there are p new fullsib families per block). Therefore, block size is determined by the mating design and the number of parents. All comparisons among mating designs and numbers of locations are for equal block sizes, i.e., equal numbers of observations per location. This results in comparing mating designs with unequal numbers of parents in the designs and comparing two location experiments against five location experiments with equal numbers of observations per location but unequal total numbers of observations. The Use of Efficiency (i) Efficiency is the tool by which comparisons are made and is the efficacy of the individual observations in an experiment in lowering the variance of parameter estimates. An increasing efficiency indicates that for increasing experimental size the additional observations have enhanced the variance reducing property of all observations. Efficiency is calculated as i = 1 / N(Var(x)) where N is the total number of observations and Var(x) is the variance of a generic parameter estimate. Increasing N always results in a reduction of the variance of estimation, all other things being equal. Yet the change in efficiency with increasing N is dependent on whether the reduction in variance is adequate to offset the increase in N which caused the reduction. Comparing a previous efficiency with that obtained by increasing N, i.e., increasing the number of parents in a mating design or increasing the number of locations in which an experiment is planted: since i, = 1 / N(Var(x)), 21 then N(Var(x)) = 1 / i, and (N + AN)(Var(x) + AVar(x)) = 1 / i; if i, (the old efficiency) = i. (the new efficiency), then AVar(x) / Var(x) = AN / (N + AN); if i, < in, then AVar(x) / Var(x) < AN / (N + AN); and if i, > i,, then AVar(x) / Var(x) > AN / (N + AN); where A denotes the change in magnitude. Viewing equation 21, if N is held constant and one design has a higher efficiency (i), the design must also produce parameter estimates which have a lower variance. General Methodology Sets of true variance components are calculated in accordance with a stated level of genetic control and the design matrix is generated in correspondence with the field and mating design. Knowing the design matrix and a set of true variance components, a true covariance covariancee) matrix of variance component estimates is generated. Once the covariance matrix of the variance components is in hand, the variance of and covariances between any linear combinations of the variance component estimates are calculated. From the covariance matrix for linear combinations, the variance of genetic ratios as ratios of linear combinations of variance components are approximated by a Taylor series expansion. Since definition of a set of variance components and formation of the design matrix are dependent on the linear model employed, discussion of specific methodology begins with linear models. Linear Models Halfdiallel and circular designs The scalar linear model employed for halfdiallel and circular mating designs is yijm = IA + ti + bij + g + g, + Sw + tgk + tg, + tsiJ + pij + wijv 22 where yix. is the mL observation of the kli cross in the jth block of the ih test; jL is the population mean; ti is the random variable test environment ~ NID(0,o,); by is the random variable block NID(O,o2); gk is the random variable female general combining ability (gca) ~ NID(0,^g); g, is the random variable male gca ~ NID(0,og,); sk is the random variable specific combining ability (sca) ~ NID(0,ol,); tg, is the random variable test by female gca interaction ~ NID(0,olg); tgn is the random variable test by male gca interaction NID(O,a2; ts, is the random variable test by sea interaction NID(O,a ); p, is the random variable plot ~ NID(O,a2,); wij is the random variable within plot ~ NID(0,a2,); and there is no covariance between random variables in the model. This linear model in matrix notation is (dimensions below model component): y = Al + Zer + Ze, + ZGeG + Zes + + G + se + Zpe, +ew 23 nxl nxI nxt txl nxb bxl nxg gxl nxs sxl nxtg tgxl nxts tsxl nxp pxl nxl where y is the observation vector; Z7 is the portion of the design matrix for the il random variable; e, is the vector of unobservable random effects for the it random variable; 1 is a vector of l's; and n, t, b, g, s, tg, ts, and p are the number of observations, tests, blocks, gca's, sca's, test by gca interactions, test by sea interactions and plots, respectively. Utilizing customary assumptions in halfdiallel mating designs (Method 4, Griffing 1956), the variance of an individual observation is Var(yI,~j = oa + o2, + 2oW + a2e + 2oa + ou, + o2p + o2,; 24 and in matrix notation the covariance matrix for the observations is Var(y) = ZrZ;t + ZBZo2b + ZGZoga. + ZsZo + ZcZeoG + Z oZr + Z22p, + I.o2, 25 where indicates the transpose operator, all matrices of the form Z.Z1' are nxn, and I, is an nxn identity matrix. Halfsib design The scalar linear model for halfsib mating designs is yjkm = L + t + bi + gk + tgik + p*,jk + W*,2 26 where y,, is the mh observation of the kL halfsib family in the jL block of the ih test; I, t., bj, gk, and tga, retain the definition in Eq.22; p*4k is the random variable plot containing different genotype by environment components than Eq.22 NID(0,2,.); w*'jB is the random variable within plot containing different levels of genotypic and genotype by environment components than Eq.22 ~ NID(O,o2.); and there is no covariance between random variables in the model. The matrix notation model is y = ll + Zre + Z4e, + Zce, + ZrGe + Ze, + ew 27 nxl nxl nxt txl nxb bxl nxg gxl nxtg tgxl nxp pxl nxl The variance of an individual observation in halfsib designs is Var(yij = o, + o2b + o2 + o, + o2,. + o.; 28 and Var(y) = Z.r 2, + ZBZoa2b + ZGZGoa9cr + ZrGZtG02g + ZpZp'o,. + I2, 29 Levels of Genetic Determination Eight levels of genetic determination are derived from a factorial combination of two levels of each of three genetic ratios: heritability (h2 = 4o2gq / (202p + o2, + 2o2, + o2, + a2p + o2,) for fullsib models and h2 = 4o2~ / (o2gs + a2, + o2, + o2,) for halfsib models); additive to additive plus additive by environment variance ratio (r. = a2, I (oa2 + o,), Type B correlation of Burdon 1977); and dominance to additive variance ratio (7 = o2 / agJ. The levels employed for each ratio are h2 = 0.1 and 0.25; rE = 0.5 and 0.8; and 7 = 0.25 and 1.0. To generate sets of true variance components (Table 21) for halfdiallel and circular mating designs from the factorial combinations of genetic parameters, the denominator of h2 is set to 10 (arbitrarily, but without loss of generality) which, given the level of h2, leads to the 11 solution for oa Solving for o, and knowing y yields the value for o2,. Knowing the level of rB and a2, allows the equation for rB to be solved for o2l. An assumption that the ratio of y Table 21. Parametric variance components for the factorial combination of heritability (.1 and .25), Type B Correlation (.5 and .8) and dominance to additive variance ratio (.25 and 1.0) for full and halfsib designs. o2, and o2b were maintained at 1.0 and .5, respectively for all levels and designs. Design Level h2 r, y o., oa2 o o. o, t2. Full 1 .1 .8 1.0 .2500 .2500 .0625 .0625 .6344 8.4281 2 .1 .5 1.0 .2500 .2500 .2500 .2500 .5950 7.9050 3 .1 .8 .25 .2500 .0625 .0625 .0156 .6508 8.6461 4 .1 .5 .25 .2500 .0625 .2500 .0625 .6212 8.2538 5 .25 .8 1.0 .6250 .6250 .1562 .1562 .5359 7.1203 6 .25 .5 1.0 .6250 .6250 .6250 .6250 .4376 5.8125 7 .25 .8 .25 .6250 .1562 .1562 .0391 .5769 7.6649 8 .25 .5 .25 .6250 .1562 .6250 .1562 .5031 6.6844 Half I and .1 .8 .2500 .0625 .4844 9.2031 3 2 and .1 .5 .2500 .2500 .4750 9.0250 4 5 and .25 .8 .6250 .1562 .4609 8.7579 7 6 and .25 .5 .6250 .6250 .4375 8.3125 8 equals the ratio of a2, / o, permits a solution for a2,. A further assumption that 2, is seven percent of a02 + o2, yields a solution for both a2p and a2,. Finally, a2 and o2, are set to 1.0 and 0.5, respectively, for all treatment levels. In order to facilitate comparisons of halfsib mating designs with fullsib mating designs, og,. and o2, retain the same values for given levels of h2 and r. and the denominator of heritability again is set to 10. To solve for o2,. and o2,, the assumption that o21. is five percent of o2. + o2,. permits a solution for a2p. and o2, and maintains ap. approximately equal to and no larger than o2 of the fullsib mating designs (Namkoong et al. 1966) for the same levels of 12 h2 and rB. Under the previous definitions all consideration of differences in y changing the magnitudes of o2,. and o2, is disallowed. Thus, there are only four parameter sets for the half sib mating design (Table 21). Covariance Matrix for Variance Components The base algorithm to produce the covariance matrix for variance component estimates is from Giesbrecht (1983) and was rewritten in Fortran for ease of handling the study data. In using this algorithm, we assume that all random variables are independent and normally distributed and that the true variances of the random variables are known. Under these assumptions, Minimum Norm Quadratic Unbiased Estimation (MINQUE, Rao 1972) using the true variance components as priors (the starting point for the algorithm) becomes MIVQUE (Rao 1971b), which requires normality and the true variance components as priors (Searle 1987), and for a given design the covariance matrix of the variance component estimates becomes fixed. A sketch of the steps from the MIVQUE equation (Eq.210, Giesbrecht 1983, Searle 1987) to the true covariance matrix for variance components estimates is {tr(QVQVj)})2 = {y'QV,Qy) 210 rxr rxl rxl then ( = {tr(QVQVj)}'{y'QVQy) and Var(2) = {tr(QVQVj)}'Var({y'QVQy}){tr(QVQVj)} rxr rxr rxr rxr where {aj is a matrix whose elements are aj where in the fullsib designs i= 1 to 8 and j= 1 to 8, i.e., there is a row and column for every random variable in the linear model; 13 tr is the trace operator that is the sum of the diagonal elements of a matrix; Q = V' V'X(X'V'X)X'V for V = the covariance matrix of y and X as the design matrix for fixed effects; V, = ZZZ', where i = the random variables test, block, etc.; W is the vector of variance component estimates; and r is the number of random variables in the model. The variance of a quadratic form (where A is any nonnegative definite matrix of proper dimension) under normality is Var(y'Ay) = 2tr(AVAV) + jI'Ajz (Searle 1987); however, MINQUE derivation (Rao 1971) requires that AX = 0 which in our case is Al =0 and is equivalent to /W'Al11 = 0, thus Var({y'QViQy}) = 2{tr(QVQVj)}; 211 and using Eq.210 and Eq.211 Var(2) = {tr(QV.QVj)}'2{tr(QVQVj)}{tr(QVQVQV)}1 and therefore Var(2) = V, = 2{tr(QVQVJ)}'. 212 From Eq.212 it is seen that the MIVQUE covariance matrix of the variance component estimates is dependent only on the design matrix (the result of the field design and mating design) and the true variance components; a data vector is not needed. Covariance Matrix for Linear Combinations of Variance Components and Variance of a Ratio Once the covariance matrix for the variance component estimates (Eq.212) is created, then the covariance matrix of linear combinations of these variance components is formed as V, = L'VL 213 2x2 2xrrxrrx2 14 where L specifies the linear combinations of the variance components which are the combinations of variance components in the denominator and numerator of the genetic ratio being estimated. A Taylor series expansion (first approximation) for the variance of a ratio using the variances of and covariance between numerator and denominator is then applied using the elements of V, to produce the approximate variance of the three ratio estimates as (Kendall and Stuart 1963): Var(ratio) (1/D)2(Vk(1,1)) 2(N/D3)(V,(1,2)) + (N2/D4)(Vk(2,2)) 214 where the generic ratio is N/D and N and D are the parametric values; V,(1,1) is the variance of N; V,(1,2) is the covariance between N and D; and Vl(2,2) is the variance of D. Comparison Among Estimates of Variances of Ratios The approximate variances of the three ratio estimates (h2, r., and y) are compared across mating designs with equal (or approximately equal) numbers of observations, across numbers of locations, and across numbers of parents within a mating design all within a level of genetic determination. The standard for comparison is i. Results are presented by the genetic ratio estimated so that direct comparisons may be made among the mating designs for equal numbers of observations within a number of locations for varying levels of genetic control. Number of genetic entries (number of crosses for fullsib designs and number of halfsib families for halfsib designs) is used as a proxy for number of observations since, for all designs, number of observations equals twentyfour times the number of locations times the number of genetic entries. Further, by plotting the two levels of numbers of locations on a single figure, a 15 comparison is made of the utility of replication of a design across increasing numbers of locations. Efficiency plots also permit contrasts of the absolute magnitude of variance of estimation among designs. For a given number of genetic entries and locations, the design with the highest efficiency is the most precise (lowest variance of estimation). Increasing the number of genetic entries or locations always results in greater precision (lower variance of estimation), but is not necessarily as efficient (the reduction in variance was not sufficient to offset the increase in numbers of observations). A primary justification for using the efficiency of a design as a criterion is that a more precise estimate of a genetic ratio is obtained by using the mean of two estimates from replication of the small design as two disconnected experiments as opposed to the estimate from single large design. This is true when 1) the number of observations in the large design (N) equals twice the number of observations in small design (n), 2) the small design is more efficient, and 3) the variances are homogeneous. This is proven below: Since N = n, + n2 and n, = n, then N = 2n,. By definition i = 1 / (N*(Var(Ratio))); and Var(Ratio) = 1 /(i*N). The proposition is (Var,(Ratio) + Var,(Ratio))/4.0 < Varl(Ratio); substitution gives ((1/(n1*i)) + (1/(n,*i,)))/4.0 < (1/(N*i)). Simplification yields (1/(2.0*n,*ij) < (1/(N*i)); and multiplication by N produces /i, < l/i, 215 which is strictly true so long as i, > ii where i, is the efficiency of the smaller experiment and i, is the efficiency of the larger experiment. ,i i : i i, a i ie \ 5  \ \\ .7 S 5 / ; SW 9J a 1 o o 13 *, ; 4N \\ <; !  8 _______________________ : uIi S iU I I l U s a Ib oq r i i ii 8 ? a do 00 ,/ / ,, I \ Q  S a 8 a d d I4 ^ ? f 1 ^9 4 H ~ 8 : g v 5 v d d d d d AON30tUi 0 4. 0 to g' . E ~II a a"  4 ' a) .g II 43 a)'^ U U, kaN3DiIJ Results Heritability Halfsib designs are almost globally superior to the two fullsib designs in precision of heritability estimates (results not shown for variance but may be seen from efficiencies in Figure 21). For designs of equal size, halfsib designs excel with the exception of genetic level three (Figure 21c, h2 = 0.1, r, = 0.8, and y = 0.25). In genetic level three, the circular design provides the most precise estimate of h2 for two location designs; however, when the design is extended across five locations, the halfsib mating design again provides the most precise estimates. The circular mating design is superior in precision to the halfdiallel design across all levels of genetic control and location, even with a relatively large number of crosses per parent (four). Halfsib designs are, in general, (seven genetic control levels out of eight, Figure 21) more efficient with the exception of level three across two locations (Figure 21c). For the circular and halfsib mating designs considered, increasing the number of genetic entries always improves the efficiency of the design. However, definite optima exist for the halfdiallel mating design for number of genetic entries, i.e., crosses which convert to a specific number of parents. These optima are not constant but tend to be six parents or less, lower with increasing h2 or number of locations. The sixparent halfdiallel is never far from the halfdiallel optima, and increasing the number of parents past the optimum results in decreased efficiency. For halfsib designs with h2 = 0.1, five locations are more efficient than two locations; however, at h2 = 0.25 two locations are most efficient. Further, the number of locations required to efficiently estimate h2 for halfsib designs is determined only by the level of h2 and does not depend on the levels of the other ratios. Although estimates over larger numbers of 18 observations are more precise (fivelocation estimates are more precise than twolocation estimates), the efficiency (increase in precision per unit observation) declines. So that if h2 = 0.25 and estimates of a certain precision are required, disconnected sets of twolocation experiments are preferred to fivelocation experiments. The relative efficiencies of five locations versus two locations is enhanced with decreasing r, (increasing genotype by environment interaction) within a level of h2 (compare Figures 2la to 2lb and 2lc to 21d for h2 = 0.1, and 2le to 2If and 2lg to 2lh for h2 = 0.25). Yet, this enhancement is not sufficient to cause a change in efficiency ranking between the location levels. The fullsib designs differ markedly from this pattern (Figure 21) in that, for these parameter levels, it is never more efficient to increase the number of locations from two to five for heritability estimation. As observed with halfsib designs, for fullsib designs the relative efficiency status of five locations improves with decreasing r.. To further contrast mating designs note that the efficiency status of fullsib designs relative to the halfsib design improves with decreasing y and increasing r. (Figures 2lb versus 2lc and 2if versus 21g). Type B Correlation As opposed to h2 estimation, no mating design performs at or near the optima for precision of rB estimates across all levels of genetic control (Figure 22). However, the circular mating designs produce globally more precise estimates than those of the halfdiallel mating design. In general, the utility of fullsib versus halfsib designs is dependent on the level of ra. The lower r, value favors halfsib designs while the higher r, tends to favor fullsib designs (compare Figures 22a to 22b, 22c to 22d, 22e to 22f and 22g to 22h). Decreasing y and lowering h2 always improves the relative efficiency of fullsib designs to half sib designs (compare Figures 22c and 22d to 22e and 22f). i mi e ' v i : ; i I i ^ . da . I I , 7 R S  a S al * X1 I b D U i / ,d T I I i /  ' i 5 5 5 5  SI . .. o . I j 4(31 St4 AON131I3O 4 3 'a U g So I Bs at o, .8 8 at 8 0 h oo a .a 00 St ru II ^U 08 4,O 'gcr 'E. s gi ba ^ II 0 ~ s g 4 II i '. 1 I "4 73 4; '4 %41 3C bL i 4* o . a XD ;k sg~~I~rrgj 0 "i 1 1 1 1 1 1 Id 1 1 1 1 1 1 1 A3N313J3 (3a) h= .1; re = .5; y"= 1.0 0.0006 0.000 0.0005 0.0001 0.00041 0.000 0.00031 0.000 0.0012 0.001 0.0008 0.0006 0.0004 0.0002 0 10 20 30 40 (3c) h2=.25; r = .8; /= 1.0  .. AA I I I d I I I GENETIC ENTRIES Circulr 2 location loi Cicular 5localrn   5 B 5 5 5 4 5 0.016 0.014 0.012 0.01 0.008 0.006 0.04 0.035 0.03 0.025 0.02 0.015 0.01 n0E0 50 0 10 20 30 GENETIC ENTRIES SHalfdiallel 2 locations A HaFdiallel 5locations A Figure 23. Efficiency (i) for y plotted against number of genetic entries for four levels for genetic control for circular, halfdiallel, and halfsib mating designs across levels of location where i = 1/(N(Var(y))) and N = the total number of observations. I I I I a* .EB ... Er S. ...... AA I I I I I 0 10 20 30 40 50 (3d) h=.25; re = .8; /= .25 . .. O 8 N  ' .r.  cu: Y,^ ...... ..... .. .. 0 10 20 30 40 5 (3b) h2 = .1; r = .5; /= .25 21 For estimation of rs, fullsib designs are more efficient than halfsib designs except in the three cases of low r. (0.5) and high y (1.0) for h2 = 0.1 (Figure 22b) and low r. for h2 = 0.25 (Figures 22f and 22h). Within fullsib designs the circular design is globally superior to the halfdiallel. As with h2 estimation, halfdiallel designs have optimal levels for numbers of parents. The sixparent halfdiallel is again close to these optima for all genetic levels and numbers of locations. At low h2 for fullsib designs, planting in two locations is always more efficient than five locations. For halfsib designs at low h2, the relative efficiency of two versus five locations is dependent on the level of r, with lower r. favoring replication across more locations. At h2 = 0.25, halfsib designs are more efficient when replicated across five locations. At the higher h2 value, fullsib design efficiency across locations is dependent on the level of rB. With rB = 0.5 and h2 = 0.25, replication of fullsib designs is for the first time more efficient across five locations than across two locations; however, at the higher rB level two locations is again the preferred number. Dominance to Additive Variance Ratio In comparing the two fullsib designs for relative efficiency in estimating y, the circular design is always approximately equal to or, for most cases, superior to the halfdiallel design (Figure 23). The relative superiority of the circular design is enhanced by decreasing y and r, (not shown). The halfdiallel design again demonstrates optima for number of parents with the sixparent design being near optimal. Within a mating design the use of two locations is always more efficient than the use of five locations. The magnitude of this superiority escalates with increasing rB and h2 (Figures 23a and 23b versus 23c and 23d). Discussion Comparison of Mating Designs A prior knowledge of genetic control is required to choose the optimal mating and field design for estimation of h2, rB and 7. Given that such knowledge may not be available, the choices are then based on the most robust mating designs and field designs for the estimation of certain of the genetic ratios. If h2 is the only ratio desired, then the halfsib mating design is best. Estimation of both h2 and rB requires a choice between the halfsib and circular designs. If there is no prior knowledge then the selection of a mating design is dependent on which ratio has the highest priority. For experiments in which h2 received highest weighting, the halfsib design is preferred and in the alternative case the circular design is the better choice. In the last scenario information on all three ratios is desired from the same experiment and in this case the circular design is the better selection since the circular design is almost globally more efficient than the halfdiallel design. After choosing a mating design, the next decision is how many locations per experiment are required to optimize efficiency. For the halfsib design the number of locations required to optimize efficiency is dependent on both the ratio being estimated and the level of genetic control. A broad inference is that for h2 estimation a two location experiment is more efficient and for rB a five location experiment has the better efficiency. Estimation of any of the three ratios with a fullsib design is almost globally more efficient in two location experiments. The disparity between the behavior of the halfsib and fullsib designs with respect to the efficiency of location levels can be explained in terms of the genetic connectedness offered by the different designs. Genetic connectedness can be viewed as commonality of parentage among genetic entries. The more entries having a common parent the more connectedness is present. 23 The halfsib design is only connected across locations by the one common parent in a halfsib family in each replication. Fullsib designs are connected across locations in each replication by the fullsib cross plus the number of parents minus two (halfdiallel) or three (circular) for each of the two parents in a cross. The connectedness in a fullsib design means each observation is providing information about many other observations. The result of this connectedness is that, in general, fewer observations (number of locations) are required for maximum efficiency. A General Approach to the Estimation Problem The estimation problems may be viewed in a broader context than the specific solutions in this chapter. The technique for comparison of mating designs and numbers of locations across levels of genetic determination may be construed, for the case of h2 estimation, to be the effect of these factors on the variance of o2g, estimates. Viewing the variance approximation formula, the conclusion may be reached that the variance of o2, estimates is the controlling factor in the variance of h2 estimates since the other factors at these heritability levels are multiplied by constants which reduce their impact dramatically. Given this conclusion, the variance of h2 estimates is essentially the (3,3) element in 2{tr(QVQVj)})1 (Eq. 211). Further, since the covariances of the other variance component estimates with oe estimates are small, the variance of o2g, estimates is basically determined by the magnitude of the (3,3) element of {tr(QVQVj)} which is tr(QVQV). Thus, the variance of h2 estimates is minimized by maximizing tr(QVgQVg with h2 used as an illustration because this simplification is possible. Considering the impact of changing levels of genetic control, while holding the mating and field designs constant, V, is fixed, the diagonal elements of V are fixed at 11.5 because of our assumptions, and only the offdiagonal elements of V change with genetic control levels. Since Q is a direct function of V', what we observe in Figure 21 comparing a design across 24 levels of genetic control are changes in V" brought about by changes in the magnitude of the off diagonal elements of V (covariances among observations). The effect of positive (the linear model specifies that all offdiagonal elements in V are zero or positive) offdiagonal elements on V' is to reduce the magnitude of the diagonal elements and often also result in negative off diagonal elements. If one increases the magnitude of the offdiagonal elements in V, then the magnitude of the diagonal elements of V' is reduced and the magnitude of negative offdiagonal elements is increased. Since tr(QVQV) is the sum of the squared elements of the product of a direct function of VY and a matrix of nonnegative constants (V), as the diagonal elements of V' are reduced and the offdiagonal elements become more negative, tr(QVQV) must become smaller and the variance of h2 estimates increases. Mating designs may be compared by the same type of reasoning. Within a constant field design changes in mating design produce alterations in V. Of the three designs the halfsib produces a V matrix with the most zero offdiagonal elements, the circular design next, and the halfdiallel the fewest number of zero offdiagonal elements. Knowing the effect of offdiagonal elements on the variance of h2 estimates, one could surmise that the variance of estimates is reduced in the order of least to most nonzero offdiagonal elements. This tenant is in basic agreement with the results in Figures 21 through 23. The effects of rB and y on the variance of h2 estimates can also be interpreted utilizing the above approach. In the results section of this chapter it is noted that decreasing the magnitude of rB and/or y causes fullsib designs to rise in efficiency relative to the halfsib design. In accordance with our previous arguments this would be expected since decreasing the magnitude of those two ratios causes a decrease in the magnitude of offdiagonal elements. More precisely, decreasing y results in the reduction of offdiagonal elements in V of the fullsib designs while not affecting the halfsib design, and decreasing r% results in the reduction of offdiagonal 25 elements in V of fullsib and halfsib designs. Relative increases in efficiency of fullsib designs result from the elements due to location by additive interaction occurring much less frequently in the halfsib designs; thus, the relative impact of reduction in r, in halfsib designs is less than that for fullsibs. Use of the Variance of a Ratio Approximation Use of Kendall and Stuart's (1963) first approximation (firstterm Taylor series approximation) of the variance of a ratio has two major caveats. The approximation depends on large sample properties to approach the true variance of the ratio, i.e., with a small number of levels for random variables the approximation does not necessarily closely approximate the true variance of the ratio. Work by Pederson (1972) suggests that for approximating the variance of h2 at least ten parents are required in diallels before the approximation will converge to the true variance even after including Taylor series terms past the first derivative. Pederson's work also suggests that the approximation is progressively worse for increasing heritability with low numbers of parents. Using the field design in this chapter (two locations,four blocks and sixtree rowplots), simulation work (10,000 data sets) has demonstrated that with a heritability of 0.1 using four parents in a halfdiallel across two locations that the variance of a ratio approximation yields a variance estimate for h2 of 0.1 while the convergent value for the simulation was 0.08 (Huber unpublished data). One should remember the dependence of the first approximation of the variance of a ratio on large sample properties when applying the technique to real data. The second caveat is that the range of estimates of the denominator of the ratio cannot pass through zero (Kendall and Stuart 1963). This constraint is of no concern for h2; however, the structure of r. and y denominators allows unbiased minimum variance estimates of those denominators to pass through zero which means at one point in the distribution of the estimates 26 of the ratios they are undefined (the distributions of these ratio estimates are not continuous). Simulation has shown that the variances of r, and y are much greater than the approximation would indicate (Huber unpublished data). The discrepancy in variance of the estimates could be partially alleviated through using a variance component estimation technique which restricts estimates to the parameter space 0 < o2 < oo. Nevertheless, because of the two caveats, approximations of the variance of h2, r. and y estimates should be viewed only on a relative basis for comparisons among designs and not on an absolute scale. Additionally, the expectation of a ratio does not equal the ratio of the expectations (Hogg and Craig 1978). If a value of genetic ratios is sought so that the value equals the ratio of the expectations, then the appropriate way to calculate the ratio would be to take the mean of variance components or linear combinations of variance components across many experiments and then take the ratio. If the value sought for h2 is the expectation of the ratio, then taking the mean of many h2 estimates is the appropriate approach. Returning to the results from simulated data (10,000 data sets) where the h2 value was set at 0.1, using the ratio of the means of variance components rendered a value of 0.1 for h2, the mean of the h2 estimates returned a value of 0.08, and a Taylor series approximation of the mean of the ratio yielded 0.07 (Pederson 1972). Conclusions Results from this study should be interpreted as relative comparisons of the levels of the factors investigated. However, viewing the optimal design problem as illustrated in the discussion section of this chapter can provide insight to the more general problem. There is no globally most efficient number of locations, parents or mating design for the three ratios estimated even within the restricted range of this study; yet, some general conclusions can be drawn. For estimating h2 the halfsib design is always optimal or close to optimal in 27 terms of variance of estimation and efficiency. In the estimation of rB and 7, the circular mating design is always optimal or near optimal in variance reduction and efficiency. Across numbers of parents within a mating design only the halfdiallel shows optima for efficiency. The other mating designs have nondecreasing efficiency plots over the level of number of parent; so that while there is an optimal number of locations for a level of genetic control, the number of genetic entries per location is limited more by operational than efficiency constraints. Two locations is a near global optimum over five locations for the fullsib mating designs. Within the halfsib mating design optimality depends on the levels of h2 and rB: 1) for h2 estimation the optimal number of locations is inversely related to the level of h2, i.e. at the higher level two tests were optimal and at the lower level five tests were optimal; and 2) for rB estimation for the halfsib design, the optimal number of locations was also inversely related to the level of rB. Means of estimates from disconnected sets provide lower variance of estimation where the smaller experiments have higher efficiencies. Thus, disconnected sets are preferred according to number of locations for all mating designs and according to number of parents for the half diallel mating design. In practical consideration of the optimal mating design problem, the results of this study indicate that if h2 estimation is the primary use of a progeny test then the halfsib mating design is the proper choice. Further, the circular mating design is an appropriate choice if the estimation of rB is more important than h2,. Finally, if a fullsib design is required to furnish information about dominance variance, the circular design provides almost globally better efficiencies for h2, rB, and y than the halfdiallel. CHAPTER 3 ORDINARY LEAST SQUARES ESTIMATION OF GENERAL AND SPECIFIC COMBINING ABILITIES FROM HALFDIALLEL MATING DESIGNS Introduction The diallel mating system is an altered factorial design in which the same individuals (or lines) are used as both male and female parents. A full diallel contains all crosses, including reciprocal crosses and selfs, resulting in a total of p2 combinations, where p is the number of parents. Assumptions that reciprocal effects, maternal effects, and paternal effects are negligible lead to the use of the halfdiallel mating system (Griffing 1956, method 4) which has p(p1)/2 parental combinations and is the mating system addressed in this chapter. Half diallels have been widely used in crop and tree breeding (Sprague and Tatum 1942, Gilbert 1958, Matzinger et al. 1959, Burley et al. 1966, and Squillace 1973) and the widespread use of this mating system continues today (Weir and Zobel 1975, Wilcox et al. 1975, Snyder and Namkoong 1978, Hallauer and Miranda 1981, Singh and Singh 1984, Greenwood et al. 1986, and Weir and Goddard 1986). Most of the statistical packages available treat fixed effect estimation as the objective of the program with random variables representing nuisance variation. Within this context a common analysis of halfdiallel experiments is conducted by first treating genetic parameters as fixed effects for estimation of general (GCA) and specific (SCA) combining abilities and subsequently as random variables for variance component estimation (used for estimating heritabilities, genetic correlations, and general to specific combining ability variance ratios for 29 determining breeding strategies). This chapter focuses on the estimation of GCA's and SCA's as fixed effects. The treatment of GCA and SCA as fixed effects in OLS (ordinary least squares) is an entirely appropriate analysis if the comparisons are among parents and crosses in a particular experiment. If, as forest geneticists often wish to do, GCA estimates from disconnected experiments are to be compared, then methods such as checklots must be used to place the estimates on a common basis. Formulae (Griffing 1956, Falconer 1981, Hallauer and Miranda 1981, and Becker 1975) for hand calculation of general and specific combining abilities are based on a solution to the OLS equations for halfdiallels created by sumtozero restrictions, i.e., the sum of all effect estimates for an experimental factor equals zero. These formulae will yield correct OLS solutions for sum tozero genetic parameters provided the data have no missing cells. If cell (plot) means are used as the basis for the estimation of effects, there must be at least one observation per cell (plot) where a cell is a subclassification of the data defined by one level of every factor (Searle 1987). An example of a cell is the group of observations denoted by ABj for a randomized complete block design with factor A across blocks (B). If the above formulae are applied without accounting for missing cells, incorrect and possibly misleading solutions can result. The matrix algebra approach is described in this chapter for these reasons: 1) in forest tree breeding applications data sets with missing cells are extremely common; 2) many statistical packages do not allow direct specification of the halfdiallel model; 3) the use of a linear model and matrix algebra can yield relevant OLS solutions for any degree of data imbalance; and 4) viewing the mechanics of the OLS approach is an aid to understanding the properties of the estimates. The objectives of this chapter are to (1) detail the construction of ordinary least squares (OLS) analysis of halfdiallel data sets to estimate genetic parameters (GCA and SCA) as fixed effects, (2) recount the assumptions and mathematical features of this type of analysis, (3) 30 facilitate the reader's implementation of OLS analyses for diallels of any degree of imbalance and suggest a method for combining estimates from disconnected experiments, and (4) aid the reader in ascertaining what method is an appropriate analysis for a given data set. Methods Linear Model Plot means are used as the unit of observation for this analysis with unequal numbers of observations per plot. Plot (cell) means are always estimable as long as there is one observation per plot, and linear combinations of these means (least squares means) provide the most efficient way of estimating OLS fixed effects (Yates 1934). Throughout this chapter, estimates are denoted by lower case letters while the parameters are designated by upper case letters and matrices are in bold print. Using plot means as observations, a common scalar linear model for an analysis of a half diallel mating design with p(p1)/2 crosses planted at a single location in a randomized complete block design with one plot per block is yik = z + B, + GCAj + GCAk + SCAji + eij 31 where yk is the mean of the il block for the jkt cross; it is an overall mean; B, is the fixed effect of block i for i= 1 to b; GCAj is the fixed general combining ability effect of the jLh female parent or kh male parent, j or k = 1,. .,p (j k); SCAj, is the fixed specific combining ability effect of parents j and k; and ei, is the random error associated with the observation of the jk cross in the i1 block where eij (0, o2). Cross by block interaction as genotype by environment interaction is treated as confounded with between plot variation as for contiguous plots. The model in matrix notation is y = X# + e 32 where y is the vector of observation vectors (nxl = n rows and 1 column) where n equals the number of observations; X is the design matrix (nxm) whose function is to select the appropriate parameters for each observation where m equals the number of fixed effect parameters in the model; ( is the vector (mxl) of fixed effect parameters ordered in a column; and e is the vector (nxl) of deviations (errors) from the expectation associated with each observation. Ordinary Least Squares Solutions The matrix representation of an OLS fixed effects solution is b = (X'X)X'y 33 where b is the vector of estimated fixed effect parameters, i.e., an estimate of P, and X is the design matrix either made full rank by reparameterization, or a generalized inverse of X'X may be used. Inherent in this solution is the ordinary least squares assumption that the variance 32 covariance matrix (V) of the observations (y) is equal to Ia,, where I is an nxn identity matrix. The elements of an identity matrix are I's on the main diagonal and all other elements are 0. Multiplying I by i, places oa on the main diagonal. In the covariance matrix for the observations, the variance of the observations appears on the main diagonal and the covariance between observations appears in the offdiagonal elements. Thus, V = Io, states that the variance of the observations is equal to a. for each observation and there are no covariances between the observations (which is one direct result of considering genetic parameters as fixed effects). SumtoZero Restrictions The design matrix presented in this chapter is reparameterized by sumtozero restrictions to (1) reduce the dimension of the matrices to a minimal size, and (2) yield estimates of fixed effects with the same solution as common formulae in the balanced case. Other restrictions such as settozero could also be applied so the discussion that follows treats sumtozero restrictions as a specific solution to the more general problem which is finding an inverse for X'X. The subscripts 'o' and 's' refer to the overparameterized model and the reparameterized model with sumtozero restrictions, respectively. The matrix X, of Figure 31 is the design matrix for an overparameterized linear model (Milliken and Johnson 1984, page 96). Overparameterization means that the equations are written in more unknowns (parameters, in this case 13) than there are equations (number of observations minus degrees of freedom for error, in this case 12 5 = 7) with which to estimate the parameters. Reparameterization as a sumtozero matrix overcomes this dilemma by reducing the number of parameters through making some of the parameters linear combinations of others. Sumtozero restrictions make the resulting parameters and estimates sum to zero even though the unrestricted parameters (for example, the true GCA values as applied to a broader population) do not necessarily sumtozero within a diallel. This is the problem of comparability of GCA estimates from disconnected experiments. py B, B2 GCA, GCA2 GCA3 GCA4 SCA12 SCA,3 SCA4 SCA, SCA, SCA, 112 I 0 1 1 0 0 1 0 0 0 0 0 A Y13 110 1 0 1 0 0 1 0 0 0 0 B, y14 110 1 0 0 1 0 0 1 0 0 0 B2 Y,2s 1 0 0 1 1 0 0 0 0 1 0 0 GCA, y,2 11 0 0 1 0 1 0 0 0 0 1 0 GCA, y, = 1 1 0 0 0 1 1 0 0 0 0 0 1 GCA, 212 1 0 1 1 1 0 0 1 0 0 0 0 0 GCA4 Y213 0 1 1 0 1 0 0 1 0 0 0 0 SCA,, Y214 10 1 1 0 0 1 0 0 1 0 0 0 SCA,, y 1 0 1 0 1 1 0 0 0 0 1 0 0 SCA,4 y224 1 0 1 0 1 0 1 0 0 0 0 1 0 SCA23 Y2 1 0 1 0 0 1 1 0 0 0 0 0 1 SCA2 .SCA . y = X, # Figure 31. The overparameterized linear model for a fourparent halfdiallel planted on a single site in two blocks displayed as matrices. The design matrix (X.) and parameter vector (.) are shown in overparameterized form. I's and 0's denote the presence or absence of a parameter in the model for the observed means (data vector, y). The parameters displayed above the design matrix label the appropriate column for each parameter. Error vector not exhibited. t B, GCA, GCA2 GCA, SCAIn SCA,3 Y112 Y113 YlI4 Y123 Y124 Y134 Y2n1 Y2z13 Y214 y2m . Bl GCA, GCA, GCA, SCA,, SCA,3 . e112 ell3 e113 e114 e123 e124 e134 e212 e213 e214 e223 e224 e234 y = X, + e. Figure 32. The linear model for a fourparent halfdiallel planted on a single site in two blocks displayed as matrices. The design matrix (X) and the parameter vector (f,) are presented in sumtozero format. The parameters displayed above the design matrix label the appropriate column for each parameter. To illustrate the concept of sumtozero estimates versus population parameters, we use the expectation of a common formula. Becker (1975) gives equation 34 (which for balanced 34 cases is equivalent to gj = ((p1)/(p2))(Z,. Z..)) as the estimate for general combining ability for the jt line with p equalling the number of parents and Z4 equalling the site mean of the j x k cross. This equation yields the same solution as the matrix equations with no missing plots or crosses and with a design matrix which contains the sumtozero restrictions. An evaluation of this formula in a fourparent halfdiallel planted in b blocks for the GCA of parent 1 is obtained by substituting the expectation of the linear model (equation 31) for each observation: gj = (/(p(p2)))(pZj. 2Z..) 34 E{g,} = E{(1/(p(p2)))(pZ,. 2Z..)} E{g,}= 3/4(GCAI) 1/4(GCA2 + GCA3 + GCA4) + 1/4(SCA12 + SCA13 + SCA4)  1/4(SCA23 + SCA4 + SCA4). The result of equation 34 is obviously not GCA, from the unrestricted model (equation 31). Thus, gj, an estimable function and an estimate of parameter GCA,, (the estimate of the GCA of parent 1 given the sumtozero restrictions), does not have the same meaning as GCA, in the unrestricted model. An estimable function is a linear combination of the observations; but in order for an individual parameter in a model to be estimable, one must devise a linear combination of the observations such that the expectation has a weight of one on the parameter one wishes to estimate while having a weight of zero on all other parameters. A solution such as this does not exist for the individual parameters in the overparameterized model (equation 31). So, although the sumtozero restricted GCA parameters and estimates are forced to sumtozero for the sample of parents in a given diallel, the unrestricted GCA parameters only sumtozero across the entire population (Falconer 1981) and an evaluation of GCA1, demonstrates that the estimate contains other model parameters. The result of sumtozero restrictions is that the degrees of freedom for a factor equals the number of columns (parameters) for that factor in X, (Figure 32). Thus, a generalized 35 inverse for X,'X, is not required since the number of columns in the sumtozero X. matrix for each factor equals the degrees of freedom for that factor in the model (X, is full column rank and provides a solution to equation 33). Components of the Matrix Equation The equational components of 32 are now considered in greater detail. Data vector v Observations (plot means) in the data vector are ordered in the manner demonstrated in Figure 31. For our example Figure 31 is the matrix equation of a four parent halfdiallel mating design planted in two randomized complete blocks on a single site. There are six crosses present in the two blocks for a total of 12 observations in the data vector, y. The observations are first sorted by block. Second, within each block the observations should be in the same sequence (for simplicity of presentation only). This sequence is obtained by assigning numbers 1 through p to each of the p parents and then sorting all crosses containing parent 1 (whether as male or female) as the primary index in descending numerical order by the other parent of the cross as the secondary index. Next all crosses containing parent 2 (primary index, as male or female) in which the other parent in the cross (secondary index) has a number greater than 2 are then also sorted in descending order by the secondary index. This procedure is followed through using parent p1 as the primary index. Design matrix and parameter vector. X and B The design matrix for a model is conceptually a listing of the parameters present in the model for each observation (Searle 1987, page 243). In Figure 31, y and f. are exhibited and the parameters in f are displayed at the tops of the columns of X, (a visually correct interpretation of the multiplication of a matrix by a vector). For each observation in y, the scalar 36 model (equation 31) may be employed to obtain the listing of parameters for that observation (the row of the design matrix corresponding to the particular observation). The convention for design matrices is that the columns for the factors occur in the same order as the factors in the linear model (equation 31 and Figure 31). Since design matrices can be devised by first creating the columns pertinent to each factor in the model (submatrices) and then horizontally and/or vertically stacking the submatrices, the discussion of the reparameterized design matrix formulation will proceed by factor. Mean The first column of X, is for j and is a vector of I's with the number of rows equalling the number of observations (Figure 32). The linear model (equation 31) indicates that all observations contain u and the deviation of the observations from 1 is explained in terms of the factors and interactions in the model plus error. Block The number of columns for block is equal to the number of blocks minus one (column 2, X,). Each row of a block submatrix consists of I's and O's or l's according to the identity of the observation for which the row is being formed. The normal convention is that the first column represents block 1 and the second column block 2, etc. through block b1. Since we have used a sumtozero solution (Eb,=0), the effect due to block b is a linear combination of the other b1 effects, i.e., bb = E!"bi which in our example is 0 = b, + b2 and b2 = b,. Thus, the row of the block submatrix for an observation in block b (the last block) has a 1 in each of the b1 columns signifying that the block b effect is indeed a linear combination of the other b1 block effects. Columns 2 and 3 of X, (Figure 31) have become column 2 of X, (Figure 32). General combining ability This submatrix of X, is slightly more complex than previous factors as a result of having two levels of a main effect present per observation, i.e., the deviation of an observation from tA is modeled as the result of the GCA's of both the male and female parents (equation 31). Again we have imposed a restriction, Ejgcaj=O. Since GCA has pI degrees of freedom, the submatrix for GCA should have p1 columns, i.e., gcap = E;gca,. The GCA submatrix for X, (columns 3 through 5 in Figure 32) is formed from X, (columns 4 through 7 in Figure 31) according in the same manner as the block matrix: (1) add minus one to the elements in the other columns along each row containing a one for gca (p=4 in our example); and (3) delete the column from X, corresponding to gcap. The GCA submatrix has p(p1)/2 rows (the number of crosses). This, with no missing cells (plots), equals the number of observations per block. To form the GCA factor submatrix for a site, the GCA submatrix is vertically concatenated (stacked on itself) b times. This completes the portion of the X, matrix for GCA. Specific combining ability In order to facilitate construction of the SCA submatrix, a horizontal direct product should be defined. A horizontal direct product, as applied to two column vectors, is the element by element product between the two vectors (SAS/IML' User's Guide 1985) such that the element in the it row of the resulting product vector is the product of the elements in the i1 rows of the two initial vectors. The resultant product vector has dimension n x 1. A horizontal direct product is useful for the formation of interaction or nested factor submatrices where the initial matrices represent the main factors and the resulting matrix represents an interaction or a nested factor (product rule, Searle 1987). 'SAS/IML is the registered trademark of the SAS Institute Inc. Cary, North Carolina. 38 The SCA submatrix can be formulated from the horizontal direct products of the columns of the GCA submatrix in X, (Figure 32). The results from the GCA columns require manipulation to become the SCA submatrix (since degrees of freedom for SCA do not equal those of an interaction for a halfdiallel analysis), but the GCA column products provide a convenient starting point. The column of the SCA submatrix representing the cross between the jP and the kt parents (SCA.) is formed as the product between the GCAj and GCAk columns (Figure 33). The GCA columns in Figure 32 are multiplied in this order: column 1 times column 2 forming the first SCA column, column 1 times column 3 forming the second SCA column, and column 2 times column 3 forming the third SCA column (Figure 33). With four parents (six crosses) there are three degrees of freedom for GCA (p1) and two degrees of freedom for SCA (6 crosses  3 for GCA 1 for the mean). Since SCA has only two degrees of freedom, a sumtozero design matrix can have only two columns for SCA. Imposing the restriction that the sum of the SCA's across all parents equals zero is equivalent to making the last column for the SCA submatrix (Figure 33) a linear combination of the others (Figure 32). The procedure for deleting the third column product is identical to that for the GCA submatrix: add minus one to every element in the rows of the remaining SCA columns in which a one appears in the column which is to be deleted (Figure 32, columns 6 and 7). The number of rows in the SCA submatrix equals the number observations in a block and must be vertically concatenated b times to create the SCA submatrix for a site. An algebraic evaluation of SCA sumtozero restrictions requires that Ejscap = 0 for each k and that EEkscaj = 0; thus, for observations in the i block with i serving to denote the row of the SCA submatrix in block i, sca,14 = sca,12 sca,3 and entries in the submatrix row for y,14 are l's. The estimate for scai equals sca,14 because scai is the negative of the sum of the independently estimated SCA's (sca,,2 and scai1) from the restriction that the sum of the SCA's 39 across all parents equals zero. Similarly, by sumtozero definition sca~ = scam scan and by substitution scam = (sca,12 sca13) sca,2 = sca,13. By the same protocol, it can be shown that sca, = sca12n. The elements in the rows of the SCA submatrix are l's, l's and 0's in accordance with the algebraic evaluation. Thus, while it may seem that there should be 6 SCA values (one for each cross), only 2 can be independently estimated and the remaining 4 are linear combinations of the independently estimated SCA's. Again the SCA sumtozero estimates are not equal to the parametric population SCA's. An analogous illustration for SCA to that for GCA would show that the estimable function (linear combination of observations) for a given SCA, contains a variety of other parameters. OBS. GCAixGCA2 GCAixGCA3 GCA2xGCA, SCAt2 SCA,, SCA, Y2 (1)(1)=1 (1)(0)=0 (1)(0)=0 1 0 0 YS (1)(0)=0 (1)(1)=1 (0)(1)=0 0 1 0 Yi4 (0)(1)=0 (0)(1)=0 (1)(1)=1 0 0 1 Ym (0)(1)=0 (0)(1)=0 (1)(1)=1 0 0 1 Y" (1)(0)=0 (1)(1)=1 (0)(1)=0 0 1 0 Y (1)()= (1)(0)=0 (1)(0)=0 1 0 0 Figure 33. Intermediate result in SCA submatrix generation (SCA columns as horizontal direct products of GCAi, GCA2, and GCA3 columns within a block). The SCAj column is the horizontal direct product of the columns for GCA, and GCAk. Estimation of Fixed Effects GCA parameters The GCA parameters can be estimated (without mean, block, and SCA in the design matrix) through the use of equation 33, if there are no missing cell means (plots) for any cross and no missing crosses. The design matrix consists only of the GCA submatrix. This design matrix has {p1} (for GCA's) columns (the third through the fifth columns of X.). The b vector is an estimate of the GCA portion of 8, as in Figure 32 and the linear combinations for the estimation of gca, is gca, = pEgcaj. Parameters for any of the factors can be estimated 40 independently using the pertinent submatrix as long as there are no missing cell means (plots) and no missing crosses; this uses a property known as orthogonality. Orthogonality requires that the dot product between two vectors equals zero (Schneider 1987, page 168). The dot product (a scalar) is the sum of the values in a vector obtained from the horizontal direct product of two vectors. For two factors to be orthogonal, the dot products of all the column vectors making up the section of the design matrix for one factor with the column vectors making up the portion of the design matrix for the second must be zero. If all factors in the model are orthogonal, then the X,'X, matrix is block diagonal. A blockdiagonal X.'X. matrix is composed of square factor submatrices (degrees of freedom x degrees of freedom) along the diagonal with all offdiagonal elements not in one of the square factor submatrices equalling zero. A property of blockdiagonal matrices is that the inverse can be calculated by inverting each block separately and replacing the original block in the full X'X matrix by the inverted block. Because the blocks can be inverted separately and all other offdiagonal elements of the inverse are zero, the effects for factors which are orthogonal to all other factors may be estimated separately, i.e., there are no functions of other sumtozero factors in the sumtozero estimates. Mean, block. GCA and SCA parameters All parameters are estimated simultaneously by horizontally concatenating the mean, block, GCA, and SCA matrices to create X,. Equation 33 is again utilized to solve the system of equations. The b vector for the four parent example is an estimate of 3, of Figure 32. Again, one parameter is estimated for each column in the X, matrix and all parameter estimates not present are linear combinations of the parameter estimates in the b vector. So b, is equal to  E Ib, and gca, is equal to EIgcaj. The linear combinations for SCA effects can be obtained by reading along the row of the SCA submatrix associated with the observation containing the 41 parameter, i.e., in Figure 32 the observation ym contains the effect sca, which is estimated as the linear combination scal2 sca.13. This completes the estimation of fixed effect parameters from a data set which is balanced on a plotmean basis. Since field data sets with such completeness are a rarity in forestry applications, the next step is OLS analysis for various types of data imbalance. Calculations of solutions based on a complete data set and simulated data sets with common types of imbalance are demonstrated in numerical examples. Numerical Examples The data set analyzed in the numerical examples is from a fiveyearold, sixparent half diallel slash pine (Pinus elliottii var. elliottii Engelmn) progeny test planted on a single site in four complete blocks. Each cross is represented by a fivetree row plot within each block. Total height in meters and diameter at breast height (dbh in centimeters) are the traits selected for analysis. The data set is presented in Table 31 so that the reader may reconstruct the analysis and compare answers with the examples. The numbers 1 through 6 were arbitrarily assigned to the parents for analysis. Because of unequal survival within plots, plot means are used as the unit of observation. Balanced Data (Plotmean Basis) The sumtozero design matrix for the balanced data set has (4 blocks)x(15 crosses) = 60 rows (which equals the number of observations in y) and has the following columns: one column for iJ, three columns for blocks (b1), five columns for GCA (p1), and nine columns for SCA (15 crosses 5 1) for a total of 18 columns. With sixty plot means (degrees of freedom) and 18 degrees of freedom in the model, subtracting 18 from 60 yields 42 degrees of freedom for 42 error which matches the degrees of freedom for cross by block interaction, thus verifying that degrees of freedom concur with the number of columns in the sumtozero design matrix. To illustrate the principle of orthogonality in the balanced case, the X'X and (X'X)1 matrices may be printed to show that they are block diagonal. In further illustration, the effects within a factor may also be estimated without any other factors in the design matrix and compared to the estimates from the full design matrix. The vectors of parameter estimates for height and dbh (Table 32) were calculated from the same X, matrix because height and dbh measurements were taken on the same trees. In other words, if a height measurement was taken on a tree, a dbh measurement was also taken, so the design matrices are equivalent. Missing Plot To illustrate the problem of a missing plot, the cross, parent two by parent three, was arbitrarily deleted in block one (as if observation yz were missing). This deletion prompts adjustments to the factor matrices in order to analyze the new data set. The new vector of observations (y) now has 59 rows. This necessitates deletion of the row of the design matrix (XY) in block 1 which would have been associated with cross 2 x 3. This is the only matrix alteration required for the analysis. Thus, the resultant X, matrix has 60 1 = 59 rows and 18 columns. With 59 means in y and 18 columns in X., the degrees of freedom for error is 41. Comparisons between results of the analyses (Table 32) of the full data set and the data set missing observation y,3 reveal that for this case the estimates of parameters have been relatively unaffected by the imbalance (magnitudes of GCA's changed only slightly and rankings by GCA were unaffected). 43 Table 31. Data set for numerical examples. Fiveyearold slash pine progeny test with a 6 parent halfdiallel mating design present on a single site with four randomized complete blocks and a fivetree row plot per cross per block. Within Plot Trees Mean Mean Variance Variance per Block Female Male Height DBH Height DBH Plot Meters Centimeters n2 cm2 1 1 2 2.6899 3.810 0.9800 3.484 4 1 1 3 1.9080 2.134 1.4277 3.893 5 1 1 5 3.1242 4.445 0.4487 1.656 4 1 1 6 2.4933 3.200 0.8488 5.664 5 1 2 5 1.4783 1.588 0.6556 2.167 4 1 2 6 2.7026 3.471 0.1136 0.344 3 1 3 2 3.0480 4.699 0.2341 0.968 4 1 3 5 3.4991 5.131 0.0945 0.271 5 1 3 6 2.4003 2.794 0.5149 1.548 4 1 4 1 3.3955 4.928 0.1489 0.761 5 1 4 2 3.4290 5.144 0.7943 3.285 4 1 4 3 2.5298 2.984 0.9557 4.188 4 1 4 5 2.4155 3.175 0.5936 2.946 4 1 4 6 3.2004 4.521 1.7034 7.594 5 1 5 6 2.2403 2.794 1.0433 6.280 4 2 1 2 3.5662 5.080 0.9560 2.903 5 2 1 3 2.6335 3.353 0.7695 3.497 5 2 1 5 3.6942 5.893 0.0573 0.432 5 2 1 6 3.4808 4.928 0.9222 2.890 5 2 2 5 3.4260 4.877 0.7017 2.432 5 2 2 6 2.4282 3.302 0.0616 0.452 3 2 3 2 3.0480 4.064 0.0192 0.301 4 2 3 5 2.8895 4.013 0.1957 0.690 5 2 3 6 1.9406 1.863 0.0560 0.408 3 2 4 1 3.0114 3.962 1.9753 6.342 5 2 4 2 3.6454 5.283 0.1731 0.787 5 2 4 3 2.9566 3.861 0.0506 0.174 5 2 4 5 2.8118 4.382 1.1336 5.435 4 2 4 6 3.2674 4.318 1.1211 4.354 5 2 5 6 3.7917 5.893 0.0848 0.497 5 3 1 2 2.2961 2.625 0.3914 1.699 3 3 1 3 2.8956 4.128 1.2926 4.532 4 3 1 5 2.5359 3.607 0.8284 4.303 5 3 1 6 2.9032 3.937 0.8252 4.064 4 3 2 5 2.7737 4.064 0.9829 3.226 2 3 2 6 1.2040 0.635 0.4464 0.806 2 3 3 2 2.9870 4.191 0.9049 2.989 4 3 3 5 2.8407 3.962 0.7309 3.632 5 3 3 6 1.3564 0.000 0.1677 0.000 2 3 4 1 2.6746 3.620 0.8463 2.984 4 3 4 2 2.7066 3.353 0.5590 1.787 5 3 4 3 3.4198 4.623 0.3509 0.690 5 3 4 5 3.3299 4.953 0.4102 1.226 4 3 4 6 3.4564 4.978 0.8369 3.503 5 3 5 6 3.2614 4.826 . 4 1 2 1.8974 2.476 1.0160 3.629 4 4 1 3 1.3005 0.508 0.2019 0.774 3 4 1 5 2.0726 2.540 1.2235 5.097 3 4 1 6 1.8821 1.778 0.4728 3.312 4 4 2 5 1. 64 1.334 0.5354 2.382 4 4 2 6 1.5392 0.635 0.0376 0.806 2 4 3 2 1.8898 2.032 0.7364 1.892 4 4 3 5 2.5146 3.620 0.0876 0.446 4 4 3 6 1.8389 2.201 0.0941 0.280 3 4 4 1 2.3348 2.591 0.3816 2.722 5 4 4 2 1.7272 1.693 2.1640 8.602 3 4 4 3 1.6581 1.524 0.0537 0.903 5 4 4 5 2.1184 2.286 0.3137 2.366 4 4 4 6 1.5545 1.422 0.4803 1.019 5 4 5 6 1.4122 1.693 0.0338 0.150 3 Table 32. Numerical results for examples of data imbalance using the OLS techniques presented in the text. Balanced' Missing Plotb Missing Cross" Estimate of' It B' B, B, GCA, GCA, GCA, GCA, GCA, SCAt SCA,, SCA4, SCA,, SCA, SCA2 SCA2 SCA, SCA35 DBH 3.362 0.292 0.976 0.205 0.144 .180 .347 0.398 0.489 0.172 .628 .128 0.126 0.912 0.289 .706 0.164 0.677 Height 2.5787 0.1074 0.5274 0.1308 0.0760 .1186 .1426 0.2544 0.1320 0.0763 .3277 .0550 0.0700 0.3600 0.1627 .3084 .0493 0.3679 DBH 3.346 0.245 0.992 0.220 0.163 .220 .386 0.417 0.509 0.208 .592 .152 0.102 0.771 0.324 .670 0.129 0.712 Height 2.5386 0.1074 0.5386 0.1180 0.1260 .2186 .2426 0.3044 0.1820 0.1663 .2377 .1150 0.0100 0.2527 .2187 0.0406 0.4793 DBH 3.260 0.245 1.023 0.187 0.270 .434 .601 0.524 0.616 0.400 .400 .280 .026 0.517 .478 0.064 0.905 Five Missing Crossesd Height 2.4980 0.1393 0.6041 0.0689 0.1361 .2371 .3972 0.4241 0.1746 DBH 3.149 0.309 1.140 0.087 0.232 .493 .952 0.804 0.646 Height 2.5830 0.1203 0.5230 0.1264 0.0706 .1077 .1316 0.2489 0.1265 0.0665 .3374 .0484 0.0766 0.3995 0.1528 .3185 .0592 0.3580 "where (numerical examples are for height) b4= E3bi = .7697; gca6 = E5gca, = .2067; scaP = Escap, for j or k = p and p= 1,2,3 then sca,1 = .2428, sca, = .3002, and sca, = .3608; sca4 = Esca. = .2898, e = independently estimated sea's 1, 9; sca4 = sca,2 + sca3l + sca15 + sca3 + sca2 + sca35 = .2446; and sca5 = scale2 + sca43 + sca4 + sca2 + sca4 + sca, = .1737. where the linear combinations for parameter estimates are identical to the balanced example. 'where scap = Escaj for j or k = p and p= 1 to 3; sca4 = VEsca, e = independently estimated SCA's 1,. ..,8; sca4 = sca2, + scale3 + sca15 + sca2 + sca35; and sca = sca12 + sca13 + sca14+ sca, + sca,. where sca16 = sca14 sca,,, sca. = sca., sca, = sca,, sca, = sca=5, sca. = sca14 + sca2 + sca4, and scan = the negative of the sum of the four independently estimated sea's. "where for all cases linear combinations for block and gca are the same as in the balanced case. .2041 .410 0.0480 0.094 0.1920 0.408 0.1163 0.246 Missing Cross Another common form of imbalance in diallel data sets, the missing cross, is examined through arbitrary deletion of the 2 x 3 cross from all blocks, i.e., y12, y3, y33, y4 are missing in the data vector. This type of imbalance is representative of a particular cross that could not be made and is therefore missing from all blocks. The matrix manipulations required for this analysis are again presented by factor. For appropriate SCA restrictions, the data vector and design matrix should be ordered so that the p1 parent has no missing crosses. Since the labeling of a parent as parent p is entirely subjective, any parent with all crosses may be designated as parent p. The previous labelling directions are necessary since we generate the SCA submatrix as horizontal direct products of the columns of the GCA submatrix; and to account for missing crosses, the horizontal direct product for each particular missing parental combinations are not calculated which sets the missing SCA's to zero. If there is a cross missing from those of the p1 parent, we cannot account for the missing cross with this technique (Searle 1987, page 479). For the mean, block, and GCA submatrices, the adjustment for the missing cross dictates deleting the rows in the submatrices which would have corresponded to the y2 observations. The SCA submatrix must be reformed since a degree of freedom for SCA and hence a column of the submatrix has been lost. The SCA submatrix is reinstituted from the GCA horizontal direct products (remembering that one cross, 2x3, no longer exists and therefore that product GCA2 x GCA3 is inappropriate). Dropping the column for SCA, is equivalent to setting SCA, to zero (Searle 1987) so that the remaining SCA's will sumtozero. After that, the reformation is according to the established pattern. With one missing cross there are now 56 observations and hence 56 degrees of freedom available. The columns of the X, matrix are now: one for the mean, three for block, five for GCA, and eight for SCA for a total of 17 columns. The 46 remaining degrees of freedom for error is 39, matching the correct degrees of freedom ((14 1)x(41)= 39). For the missing cross example 4 is no longer equivalent to the mean of the plot means since 4 = 2.5386 and Egjyij)/N = 2.5715 where N = 56 (number of plot means). This is the result of GCA effects which are no longer orthogonal to the mean. Check the X,'X, matrix or try estimating factors separately and compare to the estimates when all factors are included in X. If formulae for balanced data (Becker 1975, Falconer 1981, and Hallauer and Miranda 1981) are applied to unbalanced data (plotmean basis) estimates of parameters are no longer appropriate because factors in the model are no longer independent orthogonall). Applying Becker's formula which uses totals of cross means for a site (y. to the missing cross example yields: gca, = .2992, gca2 = .5649, gca3 = .5888, gca4 = .4665, gca5 = .3552, and gca, = .0219. These answers are very different in magnitude from those in Table 32 for this example and gca6 also has a different sign. Employing these formulae in the analysis of unbalanced data is analogous to matrix estimation of GCA's without the other factors in the model which is inappropriate. Several Missing Crosses The concluding example (Table 32) is a drastically unbalanced data set resulting from the arbitrary deletion of five crosses (1 x 2, 1 x 3, 2 x 3, 3 x 5, and 4 x 5). The matrix manipulation for this example is an extension of the previous one cross deletion example. Rows corresponding to yi12, Yi13, yi2, yi, and y,5 are deleted from the mean, block and GCA submatrices for all blocks. The SCA matrix (now 4 columns = 10 crosses 5 1 = 4 degrees of freedom) is again reformed with only the relevant products of the GCA columns. Counting degrees of freedom (columns of the sumtozero design matrix), the mean has one, block has 47 three, GCA has five, and SCA has four degrees of freedom for a total of 13. Error has (41)(10 1) = 27 degrees of freedom. Totaling degrees of freedom for modeled effects and error yields 40 which equals the number of plot means. In increasingly unbalanced cases (Table 32), the spread among the GCA estimates tends to increase with increasing imbalance (loss of information). This is a general feature of OLS analyses and the basis for the feature is that the spread among the GCA estimates is due to both the innate spread due to additive genetics effects as well as the error in estimation of the GCA's. When there is less information, GCA estimates tend to be more widely spread due to the increase in the error variance associated with their estimation. This feature has been noted (White and Hodge 1989, page 54) as the tendency to pick as parental winners individuals in a breeding program which are the most poorly tested. Discussion After developing the OLS analysis and describing the inherent assumptions of the analysis, there are four important factors to consider in the interpretation of sumtozero OLS solutions: (1) the lack of uniqueness of the parameter estimates; (2) the weights given to plot means (yi) and in turn site means (y .) for crosses in data sets with missing crosses in parameter estimation; (3) the arbitrary nature of using a diallel mean (perforce a narrow genetic base) as the mean about which the GCA's sumtozero; and (4) the assumption that the covariance matrix for the observations (V) is Ia2,. Uniqueness of Estimates Sumtozero restrictions furnish what would appear to be unique estimates of the individual parameters, e.g. GCA1, when, in fact, these individual parameters are not estimable 48 (Graybill 1976, Freund and Littell 1981, and Milliken and Johnson 1984). The lack of estimability is again analogous to attempting to solve a set of equations in n unknowns with t equations where n is greater than t. Therefore, an infinite number of solutions exist for 8. There are quantities in this system of equations that are unique (estimable), i.e., the estimate is invariant regardless of the restriction (sumtozero or settozero) or generalized inverse (no restrictions) used (Milliken and Johnson 1984) and the estimable functions include sumtozero GCA and SCA estimates since they are linear combinations of the observations; but, these estimable quantities do not estimate the individual parametric GCA's and SCA's of the overparameterized model (equation 34) since there is no unique solution for those parameters. Weighting of Plot Means and Cross Means in Estimating Parameters With at least one measurement tree in each plot and with plot means as the unit of observation, use of the matrix approach produces the same results as the basic formulae. The weight placed on each plot mean in the estimation of a parameter can be determined by calculating (X,'XX)'X' which can be viewed as a matrix of weights W so that equation 33 can be written as b = Wy. The matrix W has these dimensions: the number of rows equals the number of parameters in f, and the number of columns equals the number of plot means in y. The i!t row of the W contains the weights applied to y to estimate the i1 parameter in b (6). In the discussion which follows gca, is utilized as 6,. If there are no missing plots, the cross mean in every block (yj) has the same weighting and weights can be combined across blocks to yield the weight on the overall cross mean (y.j). It can be shown that for the balanced numerical example gca, is calculated by weighting the overall cross means containing parent 1 by 1/6 and weighting all overall cross means not GCA3 GCA4 1/6 1/6 1/6 1/6 1/6 .16667 .16667 .16667 .16667 .16667 .14583 1/12 1/12 1/12 1/12 missing .08333 .08333 .08333 .08333 .14583 missing 1/12 1/12 1/12 missing missing .08333 .08333 .08333 .18056 .10417 .10417 1/12 1/12 .22549 .01961 .11765 .08333 .08333 .18056 .10417 .10417 .06944 1/12 .31372 .27451 missing missing t~ .08333 .18056 .10417 .10417 .06944 .06944 .29412 .08824 .04902 .29412 .20588 Figure 34. Weights on overall cross means (y.j) for the three numerical examples for estimation of GCAI. The weights for the balanced example (above the diagonal) are presented in both fractional and decimal form. The weights for the onecross missing and the fivecrosses missing are presented as the upper number and lower number, respectively, in cells below the diagonal. The marginal weights on GCA parameters (right margin) do not change although cells are missing. GCA1 GCA2 GCA3 GCA4 GCA5 GCA6 5/6 1/6 1/6 1/6 1/6 1/6 GCA1 GCA2 GCA5 GCA6 50 containing parent 1 by 1/12. Figure 34 (above the diagonal) demonstrates the weightings on the overall cross means for the balanced numerical example as well as the marginal weighting on the GCA parameters. These marginal weightings are obtained by summing along a row and/or column as one would to obtain the marginal totals for a parent (Becker 1975). One feature of sumtozero solutions is that these marginal weightings will be maintained no matter the imbalance due to missing crosses, as will be seen by considering the numerical examples for a missing cross (Figure 34 below the diagonal, upper number) and five missing crosses (Figure 34 below the diagonal, lower number). The marginal weights have remained the same as in the balanced case while the weights on the cross means differ among the crosses containing parent 1 and also among the crosses not containing parent 1. In the five missing crosses example, crosses y.2 and y.2 even receive a positive weighting where in the prior examples they had negative weighting. The expected value in all three examples is GCA,, (for sumtozero) despite the apparently nonsensical weightings to cross means with missing crosses; however, the evaluation of the estimates in terms of the original model changes with each new combination of missing cells, i.e., y.2 and y26 have a positive weight in the five missing crosses example in GCA, estimation. Whether this type of estimation is desirable with missing cell (cross) means has been the subject of some discussion (Speed, Hocking and Hackney 1978, Freund 1980, and Milliken and Johnson 1984). The data analyst should be aware of the manner in which sumtozero treats the data with missing cell means and decide whether that particular linear combination of cross means estimating the parameter is one of interest, realizing that the meaning of the estimates in terms of the original model is changing. Diallel Mean The use of the mean for a halfdiallel as the mean around which GCA's sumtozero is not satisfactory in that the diallel mean is the mean of a rather narrow genetically based population, and in particular that the comparisons of interest are not usually confined to the specific parents in a specific diallel on a particular site. A checklot can be employed to represent a base population against which comparison of half or fullsib families can be made to provide for comparison of GCA estimates from other tests (van Buijtenen and Bridgwater 1986). Mathematically, when effects are forced to sumtozero around their own mean, the absolute value of the GCA's is reflective of their value relative to the mean of the group. Even if the parents involved in the particular diallel were all far superior to the population mean for GCA, GCA's calculated on an OLS basis would show that some of these GCA's were negative. If the GCA's of the diallel parents were in fact all below the population mean, the opposite and equally undesirable result ensues. For disconnected diallels together on a single site, an OLS analysis would yield GCA estimates that sumtozero within each diallel since parents are nested within diallels. Unless the comparisons of interest are only in the combination of the parents in a specific diallel on a specific site, the checklot alternative is desirable. A method for obtaining the desired goal of comparable GCA's from disconnected experiments, disregarding the problem of heteroscedasticity, is to form a function from the data which yields GCA estimates properly located on the number scale. Such a function can be formed (using GCA, as an example) from gca,,, the diallel mean, and the checklot mean. From expectations of the scalar linear model (equation 31), GCA,. = ((p1)/p)GCA, (1/p)EJ.2GCAj + (1/p)E=2SCAlk 35 (2/(p(p2)))E E 1E=.=SCA+; E{diallel mean) = A + (E=,B.)/b + (2/p)Ee=,GCAj + (2/(p(pl)))EgIIIE=2SCAjk; and E{checklot mean} = j + (E1=B.)/b + 7; where j for GCA is j or k and r represents the fixed genetic parameter of the checklot. The function used to properly locate GCA,, (the subscript rel denotes the relocated GCAi) is gca,, = gca,, + (1/2)(diallel mean checklot mean). The expectation of gca,, with negligible SCA is GCA,,o = GCAi r/2; and since breeding value equals twice GCA, BV,, = BV, T. If SCA is nonnegligible then the expectation is GCA,, = GCAI + (l/(p1))E.=2SCA, (l/((pl)(p2)))E I .3SCA, r/2. 36 In either case the function provides a reasonable manner by which GCA estimates from disconnected diallels are centered at the same location on a number scale and are then comparable. Variance and Covariance of Plot Means The variances of plot means with unequal numbers of trees per plot are by definition unequal, i.e., Var(y1) = o2, + o2,,/n where o~, is plot variance, o, is the within plot variance and ni, is the number of observations per plot. Also, if blocks were considered random, there would be an additional source of variance for plot means due to blocks (as well as a covariance between plot means in the same block) and this could be incorporated into the V matrix with Var(yj) = oab + a2, + o2 ,n,/n. Since the variances of the means in the observation vector are not equal and there is a covariance between the means if blocks are being considered random, best linear unbiased estimates (BLUE) would be secured by weighting each mean by it's true associated variance (Searle 1987, page 316). This is the generalized least squares (GLS) approach as b = (X,'V'X)'X,'Vy 53 The GLS approach relaxes the OLS assumptions of equal variance of and no covariance between the observations (plot means) while still treating genetic parameters as fixed effects. The entries along the diagonal of the V matrix are the variances of the plot means (Var(yk)) in the same order as means in the data vector. The offdiagonal elements of V would be either 0 or o2b (the variance due to the random variable block) for elements corresponding to observations in the same block. BLUE requires exact knowledge of V; if estimates of ao, o2,, and o2, are utilized in the V matrix, estimable functions of f approximate BLUE. The OLS assumption that SCA and GCA are fixed effects can also be relaxed to allow for covariances due to genetic relatedness. In particular, the information that means are from the same half or fullsib family could be included in the V matrix. Relaxation of the zero covariance assumption implies that GCA and SCA are random variables. If GCA and SCA are treated as random variables, then the application of best linear prediction (BLP) or best linear unbiased prediction (BLUP) to the problem would be more appropriate (White and Hodge 1989, page 64). The treatment of the genetic parameters as random variables is consistent with that used in estimating genetic correlations and heritabilities. The V matrix of such an application would include, in addition to the features of the GLS V matrix, the covariance between fullsib or half sib families added to the offdiagonal elements in V, i.e., if the first and second plot means in the data vector had a covariance due to relationship, then that covariance is inserted twice in the V matrix. The covariance would appear as the second element in the first row and the first element in the second row of V (V is a symmetric matrix). Also the diagonal elements of V would increase by 2o,, (the variance due to treating GCA as a random variable) + o. (the variance due to treating SCA as a random variable). Comparison of Prediction and Estimation Methodologies Which methodology (OLS, GLS, BLP, or BLUP) to apply to individual data bases is somewhat a subjective decision. The decision can be based both on the computational or conceptual complexity of the method and the magnitude of the data base with which the analyst is working. To aid in this decision, this discussion highlights the differences in the inherent properties and assumptions of the techniques. For all practical purposes the answers from the four techniques will never be equal; however, there are two caveats. First, OLS estimates equal GLS estimates if all the cell means are known with the same precision (variance), (Searle 1987, page 490). Otherwise, GLS discounts the means that are known with less precision in the calculations and different estimates result. The second caveat is if the amount of data is infinite, i.e., all cross means are known without error, then all four techniques are equivalent (White and Hodge 1989, pages 104106). In all other cases BLP and BLUP shrink predictions toward the location parameters) and produce predictions which are different from OLS or GLS estimates even with balanced data. During calculations GLS, BLP, and BLUP place less weight on observations known with less precision, which is intuitively pleasing. With OLS and GLS forest geneticists treat GCA's and SCA's as fixed effects for estimation and then as random variables for genetic correlations and heritabilities. BLP and BLUP provide a consistent treatment of GCA's and SCA's as random variables while differing in their assumptions about location parameters (fixed effects). In BLP fixed effects are assumed known without error (although they are usually estimated from the data) while with BLUP fixed effects are estimated using GLS. BLP and BLUP techniques also contain the assumption that the covariance matrix of the observations is known without error (most often variances must be estimated). In many BLUP applications (Henderson 1974), mixed model equations are utilized 55 iteratively to estimate fixed effects and to predict random variables from a data set. A BLUP treatment of fixed effects allows any connectedness between experiments to be utilized in the estimation of the fixed effects. This provides an intuitive advantage of BLUP over BLP in experimentation where connectedness among genetic experiments is available or where the data are so unbalanced that treating the fixed effects as known is less desirable than a GLS estimate of the fixed effects. An ordering of computational complexity and conceptual complexity from least to most complex of the four methods is OLS, GLS, BLP and BLUP. The latter three methods require the estimation of the covariance matrix of the observations either separately (a priori) or iteratively with the fixed effects. Precise estimation of the covariance matrix for observations requires a great number of observations and the precision of GLS, BLP and BLUP estimations or predictions is affected by the error of estimation of the components of V. Selection of a method can then be based on weighing the computational complexity and size of the available data base against the advantages offered by each method. Thus, if complexity of the computational problem is of paramount concern, the analyst necessarily would choose OLS. With a small data base (one that does not allow reasonable estimates of variances), the analyst would again choose OLS. With a large data base and no qualms with computational complexity, the analyst can choose between BLP and BLUP based on whether there is sufficient connectedness or imbalance among the experiments to make BLUP advantageous. Conclusions Methods of solving for GCA and SCA estimates for balanced (plotmean basis) and unbalanced data have been presented along with the inherent assumptions of the analysis. The use of plot means and the matrix equations will produce sumtozero OLS estimates for GCA and 56 SCA for all types of imbalance. Formulae in the literature which yield OLS solutions for balanced data can yield misleading solutions for unbalanced data because of the loss of orthogonality and also weightings on site means for crosses (or totals) are constants. GCA's and SCA's obtained through sumtozero restriction are not truly estimates of parametric population GCA's and SCA's. There are an infinite number of solutions for GCA's and SCA's from the system of equations as a result of the overparameterized linear model. Yet, if the only comparisons of interest are among the specific parents on a particular site, then the estimates calculated by sumtozero restrictions are appropriate. Checklots may be used to provide comparability among estimates derived from disconnected sets. Having discussed the innate mathematical features of OLS analysis, knowledge of these features should help the data analyst decide if OLS is the most desirable technique for the data at hand. It may be desirable to relax OLS assumptions, which are in all likelihood invalid for the covariance matrix of the observations. This could lead to GLS, BLP or BLUP as better alternatives. CHAPTER 4 VARIANCE COMPONENT ESTIMATION TECHNIQUES COMPARED FOR TWO MATING DESIGNS WITH FOREST GENETIC ARCHITECTURE THROUGH COMPUTER SIMULATION Introduction In many applications of quantitative genetics, geneticists are commonly faced with the analysis of data containing a multitude of flaws (e.g. nonnormality, imbalance, and heteroscedasticity). Imbalance, as one of these flaws, is intrinsic to quantitative forest genetics research because of the difficulty in making crosses for fullsib tests and the biological realities of long term field experiments. Few definitive studies have been conducted to establish optimal methods for estimation of variance components from unbalanced data. Simulation studies using simple models (oneway or twoway random models) have been conducted for certain data structures, i.e., imbalance, experimental design, and variance parameters (Corbeil and Searle 1976, Swallow 1981, Swallow and Monahan 1984, interpretations by Littell and McCutchan 1986). The results from these studies indicate that technique optimality is a function of the data structure. In practice (both historically and still common place), estimation of variance components in forest genetics applications has been achieved by using sequentially adjusted sums of squares as an application of Henderson's Method 3 (HM3, Henderson 1953). Under normality and with balanced data, this technique has the desirable properties of being the minimum variance unbiased estimator. If the data are unbalanced, then the only property retained by HM3 estimation is 58 unbiasedness (Searle 1971, Searle 1987 pp. 492,493,498). Other estimators have been shown to be locally superior to HM3 in variance or mean square error properties in certain cases (Klotz et al. 1969, Olsen et al. 1976, Swallow 1981, Swallow and Monahan 1984). Over the last 25 years, there has been a proliferation of variance component estimation techniques including minimum norm quadratic unbiased estimation (MINQUE, Rao 1971a), minimum variance quadratic unbiased estimation (MIVQUE, Rao 1971b), maximum likelihood (ML, Hartley and Rao 1967), and restricted maximum likelihood (REML, Patterson and Thompson 1971). The practical application of these techniques has been impeded by their computational complexity. However, with continuing advances in computer technology and the appearance of better computational algorithms, the application of these procedures continues to become more tractable (Harville 1977, Geisbrecht 1983, Meyer 1989). Whether these methods of analysis are superior to HM3 for many genetics applications remains to be shown. With balanced data and disregarding negative estimates, all previously mentioned techniques except ML produce the same estimates (Harville 1977). With unbalanced data, each technique produces a different set of variance component estimates. Criteria must then be adopted to discriminate among techniques. Candidate criteria for discrimination include unbiasedness (large number convergence on the parametric value), minimum variance (estimator with the smallest sampling variance), minimum mean square error (minimum of sampling variance plus squared bias, Hogg and Craig 1978), and probability of nearness (probability that sample estimates occur in a certain interval around the parametric value, Pitman 1937). Negative estimates are also problematic in the estimation of variance components. Five alternatives for dealing with the dilemma of estimates less than zero (outside the natural parameter space of zero to infinity) are (Searle 1971): 1) accept and use the negative estimate, 2) set the negative estimate to zero (producing biased estimates), 3) resolve the system with the offending 59 component set to zero, 4) use an algorithm which does not allow negative estimates, and 5) use the negative estimate to infer that the wrong model was utilized. The purpose of this research was to determine if the criteria of unbiasedness, minimum variance, minimum mean square error, and probability of nearness discriminated among several variance component estimation techniques while exploring various alternatives for dealing with negative variance component estimates. In order to make such comparisons, a large number of data sets were required for each experimental level. Using simulated data, this chapter compares variance component estimation techniques for plotmean and individual observations, two mating systems (modified halfdiallel and halfsib) and two sets of parametric variance components. Types of imbalance and levels of factors were chosen to reflect common situations in forest genetics. Methods Experimental Approach For each experimental level 1000 data sets were generated and analyzed by various techniques (Table 41) producing numerous sets of variance component estimates for each data set. This workload resulted in enormous computational time being associated with each experimental level. The overall experimental design for the simulation was originally conceived as a factorial with two types of mating design (halfdiallel and halfsib), two sets of true variance components (Table 42), two kinds of observations (individual and plot mean) and three types of imbalance: 1) survival levels (80% and 60%, with 80% representing moderate survival and 60% representing poor survival; 2) for fullsib designs three levels of missing crosses (0, 2, and 5 out of 15 crosses); and 3) for halfsib designs two levels of connectedness among tests (15 and 10 common families between tests out of 15 families per test). Because of the computational time Table 41. Abbreviation for and description of variance component estimation methods utilized for analyses based on individual observations (if utilized for plotmean analysis the abbreviation is modified by prefixing a 'P'). Abbreviation Description Citation ML Maximum Likelihood: estimates not restricted to the parameter Hartley and Rao 1967; PML space (individual and plotmean analysis). Shaw 1987 MODML Maximum Likelihood: negative estimates set to zero after Hartley and Rao 1967 convergence (individual analysis). NNML Maximum Likelihood: if negative estimates appeared at Hartley and Rao 1967; convergence, they were set to zero and the system resolved Miller 1973 (individual analysis). REML Restricted Maximum Likelihood: estimates not restricted to the Patterson and PREML parameter space (individual and plotmean analysis). Thompson 1971; Shaw 1987; Harville 1977 MODREML Restricted Maximum Likelihood: negative estimates set to zero Patterson and after convergence (individual analysis). Thompson 1971 NNREML Restricted Maximum Likelihood: if negative estimates appeared Patterson and PNNREML at convergence, they were set to zero and the system resolved Thompson 1971; Miller (individual and plotmean analysis). 1983 MIVQUE Minimum Variance Quadratic Unbiased: noniterative with true Rao 1971b PMIVQUE parametricc) values of the variance components as priors (individual and plotmean analysis). MINQUE1 Minimum Norm Quadratic Unbiased: noniterative with ones as Rao 1971a PMINQUEI priors for all variance components (individual and plotmean analysis). TYPE3 Sequentially Adjusted Sums of Squares; Henderson's Method 3 Henderson 1953 PTYPE3 (individual and plotmean analysis). MIVPEN MIVQUE with a penalty algorithm to prevent negative estimates Harville 1977 (individual analysis). constraint, the experiment could not be run as a complete factorial and the investigation continued as a partial factorial. In general, the approach was to run levels which were at opposite ends of the imbalance spectrum, i.e., 80% survival and no missing crosses versus 60% survival and 5 missing crosses, within a variance component level. If results were consistent across these treatment combinations, intermediate levels were not run. 61 Designation of a treatment combination is by five character alphanumeric field. The first character is either "H" (halfsib) or "D" (halfdiallel). The second character denotes the set of parametric variance components where "1" designated the set of variance components associated with heritability of 0.1 and "2" designated the set of variance components associated with heritability of 0.25 (Table 41). The third character is an "S" indicating that the last two characters determine the imbalance level. The fourth character designates the survival level either "6" for 60% or "8" for 80%. The final character specifies the number of missing crosses (half diallel) or lack of connectedness (halfsib). The treatment combination 'H1S80' is a halfsib mating design (H), the set of variance components associated with heritability equalling 0.1 (1), 80% survival (8), and 15 common parents across tests (0). Table 42. Sets of true variance components for the halfdiallel and halfsib mating designs generated from specification of two levels of singletree heritability (h2), type B correlation (rB), and nonadditive to additive variance ratio (d/a). Genetic Ratios"* True Variance Components" SMating r. r, d/a is gn I e I aa 1 fullsib 1.0 0.5 0.25 0.25 0.25 0.25 .595 7.905 0.1 0.5 1.0 halfsib 1.0 0.5 0.25 NA 0.25 NA .475 7.9964 0.25 0.8 .25 fullsib 1.0 0.5 0.625 .1562 .1562 .0391 .5769 7.6649 a h = 4,2g / I2phnypc; re = 4o, / (402, + 4u2); and uD / 2A as d/a = 4o2 / 40,. b See definitions in equation 41. Experimental Design for Simulated Data The mating design for the simulation was either a sixparent halfdiallel (no selfs) or a fifteenparent halfsib. The randomized complete block field design was in three locations (i.e., separate field tests) with four complete blocks per location and six trees per family in a block; where family is a fullsib family for halfdiallel or a halfsib family for the halfsib design. This 62 field design and the mating designs reflect typical designs in forestry applications (Squillace 1973, Wilcox et al. 1975, Bridgwater et al. 1983, Weir and Goddard 1986, LooDinkins et al. 1991) and are also commonly used in other disciplines (Matzinger et al. 1959, Hallauer and Miranda 1981, Singh and Singh 1984). The six trees per family could be considered as contiguous or noncontiguous plots without affecting the results or inferences. FullSib Linear Model The scalar linear model employed for halfdiallel individual observations is Yim = + ti + bj + gk + + Su + tga + tga + tSa+ + p + wju, 41 where yj1 is the mL observation of the klW cross in the jh block of the ih test; AL is the population mean; ti is the random variable test location ~ NID(0,,); byj is the random variable block ~ NID(O,,2b); gk is the random variable female general combining ability (gca) ~ NID(0,o2); g, is the random variable male gca NID(0,2); s, is the random variable specific combining ability (sca) ~ NID(0,o2,); tg,~ is the random variable test by female gca interaction ~ NID(O,es); tg, is the random variable test by male gca interaction ~ NID(0,o); ts, is the random variable test by sca interaction ~ NID(0,o,); pij is the random variable plot ~ NID(O0,p); wyjkl is the random variable withinplot NID(0,o2,); and there is no covariance between random variables in the model. This linear model in matrix notation is (dimensions below model component) y = Zl + ZTr + Z4eB + ZeG + Zss + ZeCG + Zse + Zpep + ew 42 nxl n nxt txl nxb bxl nggl nxs sxl nxtgtgxl nxts tsxl nxp pxl nxl where y is the observation vector; Z, is the portion of the design matrix for the ith random variable; e, is the vector of unobservable random effects for the it random variable; 1 is a vector of l's; and n, t, b, g, s, tg, ts, and p are the number of observations, tests, blocks, gca's, sca's, test by gca interactions, test by sea interactions and plots, respectively. Utilizing customary assumptions in halfdiallel mating designs (Method 4, Griffing 1956), the variance of an individual observation is Var(yljm) = o2 + o2+ 2 + 0 2, + 2o+ 2 + o2, + 2p + o2w; 43 and in matrix notation the covariance matrix for the observations is var(y) = ZrZo, + zZo2b + ZG Zo2 + ZSZ 2. + z GG74G2, + ZrSZ 2, + ZP z2, + I.o2. 44 where indicates the transpose operator, all matrices of the form ZZ' are nxn, and I, is an nxn identity matrix. Halfsib Linear Model The scalar linear model for halfsib individual observations is yij = V + ti + bi + gk + tg, + phi* + Whi, 45 where yi, is the mh observation of the kh halfsib family in the j'h block of the iL test; ,u, ti, b1,, gk, and tg, retain the definition in Eq.41; phj is the random variable plot containing different genotype by environment components than the corresponding term in Eq.41 NID(O,a2,); Whj, is the random variable withinplot containing different levels of genotypic and genotype by environment components than the corresponding term in Eq.41 ~ NID(O,c,); and there is no covariance between random variables in the model. The matrix notation model is (dimensions below model component) y = pl + Zrer + Ze, + ZGG + ZrTG + Zep + ew 46 Snxl nxt txl nxtt b bxl nxg gxl nxtg tgxl nxp pxl nx The variance of an individual observation in halfsib designs is Var(ygi = a2 + e + 0 + + 0 p+ 2w 47 and Var(y) = ZrZ;Wo + ZBZob + ZGZGa, + ZGZ'2tg + ZZp'2ph + Ia, 48 For an observational vector based on plot means, the plot and withinplot random variables were combined by taking the arithmetic mean across the observations within a plot. The resulting plot means model has a new o2 or o2p, (a. or fh.) term being a composite of the plot and withinplot variance terms of the individual observation model. Three estimates of ratios among variance components were determined: 1) single tree heritability adjusted for test location and block as f2 = 4g2 / 26,I where a2?,,, is the estimate of the variance of an individual observation from equations 43 and 47 with the variance components for test location and block deleted; 2) type B correlation as (i, = 4a2g / (4a2, + 4i2); and dominance to additive variance ratio as d/a = 42, / 42,. Data Generation and Deletion Data generation was accomplished by using a Cholesky upperlower decomposition of the covariance matrix for the observations (Goodnight 1979) and a vector of pseudorandom standard normal deviates generated using the BoxMuller transformation with pseudorandom uniform deviates (Knuth 1981, Press et al. 1989). The upperlower decomposition creates a matrix (U) with the property that Var(y) = U'U. The vector of pseudorandom standard normal deviates 65 (z) has a covariance matrix equal to an identity matrix (I) where n is the number of observations. The vector of observations is created as y = U'z. Then Var(y) = U'(Var(z))U and since Var(z) = I., Var(y) = U'IU = U'U. Analyses of survival patterns using data from the Cooperative Forest Genetic Research Program (CFGRP) at the University of Florida were used to develop survival distributions for the simulation. The data sets chosen for survival analysis were from fullsib slash pine (Pinus elliottii var elliottii Engelm) tests planted in randomized complete block designs with the families in row plots and were selected because the survival levels were either approximately 60% or 80%. Survival levels for most crosses (fullsib families) clustered around the expected value, i.e., approximately 60% for an average survival level of 60%; however, there were always a few crosses that had much poorer survival than average and also a small number of crosses that had much better survival than average. This survival pattern was consistent across the 50 experiments analyzed. Thus, a lower than average survival level was arbitrarily assigned to certain crosses, a higher than average survival level was assigned to certain crosses, and the average survival level assigned to most crosses. This modeling of survival pattern was also extended to the half sib mating design. At 80% survival no missing plots were allowed and at 60% survival missing plots occurred at random. Fullsib family deletion simulated crosses which could not be made and were therefore missing from the experiment. When deleting five crosses, the deletion was restricted to a maximum of four crosses per parent to prevent loss of all the crosses in which a single parent appeared since this would have resulted in changing a sixparent to a fiveparent halfdiallel. Tests having only subsets of the halfsib families in common are a frequent occurrence in data analysis at CFGRP. This partial connectedness was simulated by generating data in which 66 only 10 of the 15 families present in a test were common to either one of the other two tests comprising a data set. Variance Component Estimation Techniques Two algorithms were utilized for all estimation techniques: sequentially adjusted sums of squares (Milliken and Johnson 1984, p 138) for HM3; and Giesbrecht's algorithm (Giesbrecht 1983) for REML, ML, MINQUE and MIVQUE. Giesbrecht's algorithm is primarily a gradient algorithm (the method of scoring), and as such allows negative estimates (Harville 1977, Giesbrecht 1983). Negative estimates are not a theoretical difficulty with MINQUE or MIVQUE; however, for REML and ML, estimates should be confined to the parameter space. For this reason estimators referred to as REML and ML in this chapter are not truly REML and ML when negative estimates occur; further, there is the possibility that the iterative solution stopped at a local maxima not the global maximum. These concerns are commonplace in REML and ML estimation (Corbeil and Searle 1976, Harville 1977, Swallow and Monahan 1984); however, ignoring these two points, these estimators are still referred to as REML and ML. The basic equation for variance component estimation under normality (Giesbrecht 1983) for MIVQUE, MINQUE and REML is {tr(QViQV4)}^2 = {y'QViQy} 49 rxr rxl rxl then = {tr(QVQVj)}l{y'QViQy}; and for ML (tr(V'VV'Vj)}~i = {y'QVQy} 410 rxr rxl rxl where {tr(QViQVj)} is a matrix whose elements are tr(QViQVj) where in the fullsib designs i= 1 to 8 and j=l to 8, i.e., there is a row and column for every random variable in the linear model; tr is the trace operator that is the sum of the diagonal elements of a matrix; Q = V1 V'X(X'V'X)X'V for V as the covariance matrix of y and X as the design matrix for fixed effects; V, = ZZ'i where i = the random variables test, block, etc.; y is the vector of variance component estimates; and r is the number of random variables in the model. The MINQUE estimator used was MINQUE1 i.e., ones as priors for all variance components; calculated by applying Giesbrecht's algorithm noniteratively. MINQUE1 was chosen because of results demonstrating MINQUEO (prior of 1 for the error term and of 0 for all others) to be an inferior estimation technique for many cases (Swallow and Monahan 1984, R.C. Littell unpublished data). With normallydistributed uncorrelated random variables, the use of the true values of the variance components as priors in a noniterative application of Giesbrecht's algorithm produced the MIVQUE solutions (equation 45). Obtaining true MIVQUE estimation is a luxury of computer simulation and would not be possible in practice since the true variance components are required (Swallow and Searle 1978). This estimator was included to provide a standard of comparison for other estimators.. An additional MIVQUEtype estimator, referred to as MIVPEN, was also included. MIVPEN was also a noniterative application of the algorithm with the true variance components as priors; however, this estimator was conditioned on the variance component parameter space and did not allow negative estimates. The nonnegative conditioning of MIVPEN was accomplished by adding a penalty algorithm to MIVQUE such that no variance component was allowed to be less than 1x10l7. Estimates from MIVPEN were equal to MIVQUE for data sets for which there were no negative MIVQUE variance component estimates. When negative MIVQUE estimates occur the two techniques were no longer equivalent. The penalty 68 algorithm operated by using A = a2 and by choosing a scalar weight w such that no element of W,, is less than lx107. Then ~Z = o + wA, where A is the vector of departure from the true values (a2), 1x107 is an arbitrary constant and ,~, is the vector of estimated variance components conditioned on nonnegativity. REML estimates were from repeated application of Giesbrecht's algorithm (equation 49) in which the estimates from the kh iteration become the priors for the k+ 1l iteration. The iterations were stopped when the difference between the estimates from the k* and k+1* iterations met the convergence criterion; then the estimates of the k + 1 iteration became the REML estimates. The convergence criterion utilized was E=, I o2ik) 2i+1) < 1x104. This criterion imposed convergence to the fourth decimal place for all variance components. Since for this experimental workload it was desired that the simulation run with little analyst intervention and in as few iterations as possible, the robustness of REML solutions obtained from Giesbrecht's algorithm to priors (or starting points) was explored. The difference in solutions starting from two distinct points (a vector of ones and the true values) was compared over 2000 data sets of different structures (imbalance, true variance components, and field design). The results (agreeing with those of Swallow and Monahan 1984) indicated that the difference between the two solutions was entirely dependent on the stringency of the convergence criterion and not on the starting point (priors). Also the number of iterations required for convergence was greatly decreased by using the true values as priors. Thus, all REML estimates were calculated starting with the true values as priors. Three alternatives for coping with negative estimates after convergence were used for REML solutions: accept and use the negative estimates (Shaw 1987), arbitrarily set negative estimates to zero, and resolve the system setting negative estimates to zero (Miller 1973). The first two alternatives are selfexplanatory and the latter is accomplished by reanalyzing those data 69 sets in which the initial unrestricted REML estimates included one or more negative estimates. During reanalysis if a variance component became negative, it was set to zero (could never be any value other than zero) and the iterations continued. This procedure persisted until the convergence criterion was met with a solution in which all variance components were either positive or zero. Harville (1977) suggested several adaptations of Henderson's mixed model equations (Henderson et al. 1959) which do not allow variance component estimates to become negative; however, the estimates can become arbitrarily close to zero. After trial of these techniques versus the set the negative estimates to zero after convergence and resolve the system approach, comparison of results using the same data sets indicates that there is little practical advantage (although more desirable theoretically) in using the approach suggested by Harville. The differences between sets of estimates obtained by the two methods are extremely minor (solving the system with a variance component set to zero versus arbitrarily close to zero). ML solutions, as iterative applications of equation 46, were calculated from the same starting points and with the same convergence criterion as REML solutions. The three negative variance component alternatives explored for ML were to accept and use the negative estimates, to arbitrarily set negative estimates to zero after converging to a solution for the former, and (for halfsib data only) to resolve the system setting negative variance components to zero. The algorithm to calculate solutions for HM3 (sequentially adjusted sums of squares) was based on the upper triangular G2 sweep (Goodnight 1979) and Hartley's method of synthesis (Hartley 1967). The equation solved was E{MS}) = MS where MS is the vector of mean squares and E{MS} is their expectation. The alternative used for negative estimates was to accept and use the negative estimates. Comparison Among Estimation Techniques For the simulation MIVQUE estimates were the basis for all comparisons because MIVQUE is by definition the minimum variance quadratic unbiased estimator. The results of comparing the mean of 1000 MIVQUE estimates for an experimental level to the means for other techniques were termed "apparent bias". "Apparent bias" denotes that 1000 data sets were not sufficient to achieve complete convergence to the true values of the variance components. Sampling variances of estimation were calculated from the 1000 observations within an experimental level and estimation technique for variance components and genetic ratios (single tree heritability, Type B correlation and dominance to additive variance ratio). Mean square error then equalled variance plus squared "apparent bias". While mean square error was investigated, there was never sufficient bias for mean square error to lead to a different decision concerning techniques than sampling variance of the estimates; so mean square error was deleted from the remainder of this discussion. Probability of nearness is the probability that an estimate will lie within a certain interval around the true parameter. The three total interval widths utilized were onehalf, equal to, and twice the parameter size. The percentage of 1000 estimates falling within these intervals were calculated for the different estimation techniques within an experimental level for variance components and ratios and utilized as an estimate of probability of nearness. Results are presented by variance component or genetic ratio estimated as a percentage of MIVQUE (except in the case of probability of nearness). MIVQUE estimates represent 100% with estimates with greater variance having values larger than 100% and "apparently biased" estimates having values different from 100%. The percentages were calculated as equal to 100 times the estimate divided by the MIVQUE value. For the criterion of variance, the lower the 71 percentage the better the estimator performed; for bias, values equalling 100% (0 bias) are preferred; and for probability of nearness, larger percentages (probabilities) are favored since they are indicative of greater density of estimates near the parametric value. Results and Discussion Variance Components Sampling variance of the estimators For all variance components estimated, REML and ML estimation techniques were consistently equal to or less than MIVQUE for sampling variance of the estimator (Table 43). The variance among estimates from these techniques was further reduced by setting the negative components to zero (MODML and MODREML) or setting negative estimates to zero plus re solving the system (NNREML, NNML, and PNNREML). Variance among MINQUE1 estimates is always equal to or greater than for MIVQUE, as one might expect, since they are, in this application, the same technique with MIVQUE having perfect priors (the true values). Variances for HM3 estimators (TYPE3 and PTYPE3) are either equal to or greater than MIVQUE (HM3 estimates have progressively larger relative variance with higher levels of imbalance. MIVPEN, although impractical because of the need for the true priors, had much more precise estimates of variance components than other techniques illustrating what could be accomplished given the true values as priors plus maintaining estimates within the parameter space. In general, the spread among the percentages for variance of estimation for the estimation techniques is highly dependent on the degree of imbalance and the type of mating system. With increasing imbalance the likelihoodbased estimators realized greater advantage for sampling variance of the estimates over HM3 for both mating systems. The most advantageous application Table 43. Sampling variance for the estimates of a2, (upper number), oa2 (second number), and h2 (third number where calculated) as a percentage of the MIVQUE estimate by type of estimator and treatment combination; NA is not applied. Values greater than 100 indicate larger variance among 1000 estimates. Estimator II S80 DIS65 D2S65 H1S80 H1S65 REML 99.9 102.6 101.5 99.6 106.3 100.2 100.0 104.1 99.7 98.0 100.0 101.0 101.4 99.6 105.8 ML 77.3 78.2 76.4 95.9 103.9 106.9 104.8 110.7 100.8 99.1 82.5 82.9 86.4 96.2 103.8 MINQUEI 100.0 104.2 104.0 104.0 146.7 101.2 118.8 123.6 112.5 139.7 100.3 105.8 103.9 104.0 145.8 NNREML 80.8 71.6 95.2 88.0 68.6 67.9 48.3 54.9 78.7 48.6 76.8 64.2 92.2 87.3 67.7 NNML NA NA NA 83.3 65.3 79.4 48.9 83.1 64.7 MODML 58.2 50.0 69.5 84.7 74.6 12.8 81.4 81.6 86.6 68.5 58.1 46.1 72.0 83.8 71.4 MODREML 81.5 74.5 96.1 88.9 78.1 89.1 74.0 73.7 85.4 66.9 76.4 63.5 88.9 87.7 74.3 TYPE3 101.0 101.0 105.5 100.6 121.0 101.1 101.0 115.5 100.9 125.6 100.5 108.4 102.9 100.4 121.6 PREML 100.3 106.3 101.7 107.5 146.9 102.7 113.5 119.8 122.0 150.7 PML 77.6 81.9 77.1 103.6 143.4 109.7 117.3 127.2 123.3 151.9 PMINQUEI 100.3 107.6 105.4 107.5 179.3 102.7 129.0 137.3 122.0 180.6 PNNREML 80.9 71.1 93.9 92.7 86.6 69.8 53.2 60.5 94.0 68.1 PTYPE3 100.3 106.6 105.4 107.5 168.1 102.7 124.7 133.3 122.0 184.9 100.6 110.8 104.1 106.9 168.0 MIVPEN NA 36.2 29.1 80.0 45.6 26.6 20.0 74.3 39.6 34.7 30.2 79.8 45.4 PMIVQUE 100.3 104.2 102.4 107.5 146.9 102.7 114.4 117.8 122.0 150.7 73 of likelihoodbased estimators is in the H1S65 case where the imbalance is not only random deletions of individuals but also incomplete connectedness across locations, i.e. the same families are not present in each test (akin to incomplete blocks within a test). An analysis of variance was conducted to determine the importance of the treatment of negative variance component estimates in the variance of estimation for REML and ML estimates. The model of sampling variance of the estimates as a result of mating design, imbalance level, treatment of negative estimates and size of the variance component demonstrated consistently (for all variance components except error) that treatment of negative estimates is an important component of the variance of the estimates (p < .05). The model accounted for up to 99% of the variation in the variance of the variance component estimates with 1) accepting and using negative estimates producing the highest variance; 2) setting the negative components to zero being intermediate; and 3) resolving the system with negative estimates set to zero providing the lowest variance. For all estimation techniques, lower variance among estimates was obtained by using individual observations as compared to plot means. The advantage of individual over plotmean observations increased with increasing imbalance. Bias The most consistent performance for bias (Table 44) across all variance components was TYPE3 known from inherent properties to be unbiased. The consistent convergence of the TYPE3 value to the MIVQUE value indicated that the number of data sets used (1000 per technique and experimental level) was suitable for the purpose of examining bias. The other two consistent performers were REML and MINQUE1. PTYPE3 (HM3 based on plot means) was unbiased when no plot means were missing, but produced "apparently biased" estimates when plot means were missing. Table 44. Bias for the estimates of o2 (upper number), o2' (second number), and h2 (third number where calculated) as a percentage of the MIVQUE estimate by type of estimator and experimental combination; NA is not applied. Values different from 100 denote "apparent" bias. Estimator DIS80 DIS65 D2S65 H1S80 HIS65 REML 99.9 101.5 98.7 99.9 102.8 99.9 102.2 99.8 99.9 98.9 99.9 101.3 98.6 99.9 102.6 ML 74.6 61.6 76.0 96.2 98.2 106.5 114.6 109.7 101.3 101.8 75.5 61.8 77.9 96.3 98.2 MINQUE 99.7 96.4 99.0 99.4 102.0 100.1 100.8 101.3 100.8 98.3 99.7 96.6 98.9 99.4 101.3 NNREML 107.9 116.5 98.1 101.9 107.8 93.1 92.9 92.9 100.5 102.3 108.7 118.4 98.2 102.2 107.7 NNML NA NA NA 101.9 107.8 100.5 102.3 98.2 103.8 MODML 86.6 90.4 79.0 98.1 114.1 109.9 129.9 127.4 101.3 122.9 87.8 91.5 79.4 99.6 112.6 MODREML 109.5 124.2 100.6 103.1 117.8 103.7 119.8 119.2 104.6 120.6 109.5 123.2 98.4 102.9 116.2 TYPE3 100.1 99.4 99.6 100.2 99.6 100.2 101.0 102.4 100.2 100.9 100.0 99.5 99.3 100.2 99.7 PREML 99.7 98.7 97.7 99.5 110.6 100.1 103.6 100.2 102.4 98.3 PML 74.2 58.5 73.6 95.9 105.2 106.9 116.2 111.5 103.2 102.0 PMINQUE 99.7 95.2 98.8 99.5 106.5 100.1 102.1 102.9 102.4 114.8 PNNREML 107.9 114.5 96.7 101.8 115.6 92.9 94.0 95.0 104.5 110.2 PTYPE3 99.7 96.8 99.0 99.5 104.5 100.1 97.2 96.0 102.4 108.7 99.8 98.0 98.8 99.6 104.1 MIVPEN NA 107.5 98.6 102.0 103.2 99.0 91.7 101.4 105.1 112.6 103.9 102.1 103.4 PMIVQUE 99.7 97.4 99.2 99.5 106.8 100.1 101.7 100.5 102.4 98.8 Table 45. Probability of nearness for o2 (upper number), o2, (second number), and h2 (third number where calculated). The probability interval is equal to the magnitude of the parameter. Estimator DI D1S65 D2S65 H1S80 H1S65 REML 32.8 24.3 41.8 45.3 28.6 43.0 26.2 25.7 36.6 27.1 34.2 25.3 45.4 45.0 28.3 ML 33.6 22.3 40.7 45.4 29.2 42.9 26.4 24.8 36.2 26.7 34.6 22.3 45.0 45.7 28.2 MINQUE 32.6 24.6 41.0 45.1 26.1 43.1 24.3 25.4 34.2 23.2 33.7 25.0 44.6 44.7 25.6 NNREML 33.4 23.4 41.7 45.1 29.3 44.9 28.1 25.6 38.0 28.9 34.3 24.3 46.1 45.2 29.5 NNML NA NA NA 45.9 29.7 37.9 29.1 46.0 29.0 TYPE3 34.0 23.2 42.5 45.3 27.1 42.6 27.1 24.8 37.3 25.0 35.3 23.8 45.8 45.9 27.3 PREML 32.1 20.0 41.6 43.7 24.6 42.7 26.8 24.6 32.3 20.4 PML 33.5 19.8 39.7 44.0 24.4 41.0 26.3 23.6 31.6 21.1 PMINQUE 32.1 21.4 40.4 43.7 24.5 42.7 24.8 23.1 32.3 21.9 PNNREML 31.9 19.2 41.0 43.4 26.0 43.3 28.0 23.3 33.1 21.3 PTYPE3 32.1 23.3 41.7 43.7 25.2 42.7 25.4 24.1 32.3 22.4 32.6 24.1 46.0 44.6 24.6 MIVQUE 33.6 25.7 43.7 45.1 29.2 42.9 28.6 26.4 36.9 26.3 34.8 26.8 47.7 45.4 29.4 MIVPEN NA 41.1 78.5 48.4 35.6 47.0 60.3 39.2 31.2 42.4 80.5 48.7 35.3 PMIVQUE 32.1 20.0 41.8 43.7 25.9 42.7 28.5 26.8 32.3 20.8 76 Among estimators which displayed bias, maximum likelihood estimators (ML and PML) were known to be inherently biased (Harville 1977, Searle 1987) with the amount of bias proportional to the number of degrees of freedom for a factor versus the number of levels for the factor. Other biases resulted from the method of dealing with negative estimates. Living with negative estimates produced the estimators with the least bias. Setting negative variance components to zero resulted in the greatest bias. Intermediate in bias were the estimates resulting from resolving the system with negative components set to zero. Probability of nearness Results for probability of nearness proved to be largely nondiscriminatory among techniques (Table 45). The low levels of probability density near the parametric values are indicative of the nature of the variance component estimation problem. Figure 41 illustrates the distribution of MIVQUE variance component estimates for h2 (4la) and a2g (4lb) for level D1S80. The distributions for all unconstrained variance component estimates have the appearance of a chisquare distribution, positively skewed with the expected value (mean) occurring to the right of the peak probability density and a proportion of the estimates occurring below zero (except error). With increasing imbalance, the variance among estimates increases and the probability of nearness decreases for all interval widths. Ratios of Variance Components Single tree heritability Results for estimates of single tree heritability adjusted for locations and blocks are shown in Tables 43 and 44 (third number from the top in each cell, if calculated). For these relatively low heritabilities (0.1 and 0.25), the bias and variance properties of the estimated ratio are similar to those for acg estimates (Figure 41). This implies that knowing the properties of the numerator 4la. h2 w , .25 .10 0.0 .10  15  I 101  .6 0.5 .5 0.0 U .2 1.5 2. 0.6 0.625 .250.0 .25 .625 1.0 1.5 2.0 MIVQUE ESTIMATES 1000 DATA SETS Figure 41. Distribution of 1000 MIVQUE estimates ofh2 (4la) and o02 (41b) for experimental level D1S80 illustrating the positive skew and similarity of the distributions. The true values are .1 for h2 and .25 for oa,. The interval width of the bars is onehalf the parametric value. n .25 0.4 4lb. oa i .  78 of heritability reveals the properties of the ratio (especially true of ratios with expected values of 0.1 and 0.25, Kendall and Stuart 1963, Ch. 10). Variance component estimation techniques which performed well for bias and/or variance among estimates for oa2 also performed well for h2. Type B correlation and dominance to additive variance ratio Type B correlation (Table 43 and 44 as ol) and dominance to additive variance ratio (not shown) estimates both proved to be too unstable (extremely large variance among estimates) in their original formulations to be useful in discrimination among variance component estimation techniques. This high variance is due to the estimates of the denominators of these ratios approaching zero and to the high variance of the denominator of ratios (Table 42). These ratios were reformulated with numerators of interest (4a' for additive genetic by test interaction and 4o2, for dominance variance, respectively) and a denominator equal to the estimate of the phenotypic variance. With this reformulation the variance and bias properties of estimates of the altered ratios is approximated by the properties of estimates of the numerators. For increasing imbalance maximumlikelihoodbased estimation offers an increasing advantage over HM3, and for all techniques individual observations offer increasing advantage over plotmean observations for variance of the estimates of these ratios. Bias, other than inherently biased methods (ML), is associated with the probability of negative estimates which is increased by increasing imbalance. This assertion is supported by comparing the biases of REML, NNREML, and MODREML estimates across imbalance levels. General Discussion Observational Unit Some general conclusions regarding the choice of a variance component estimation methodology can be drawn from the results of this investigation. For any degree of imbalance the use of individual observations is superior to the use of plot means for estimation of variance component or ratios of variance components. If the data are nearly balanced (close to 100% survival with no missing plots, crosses (fullsib) or lack of connectedness (halfsib)), the properties of the estimation techniques based on individual and plotmean observations become similar; so if departure from balance is nominal, plot means can be used effectively. However, using individual observations obviates the need for a survey of imbalance in the data since individual observations produce better results than plot means for any of the estimation techniques examined. Negative Estimates Drawing on the results of this investigation, the discussion of practical solutions for the negative estimates problem will revolve around two solutions: 1) accept and use the negative estimates; and 2) resolving the system with negative estimates set to zero. Given that the property of interest is the true value of a variance component or genetic ratio, often estimated as a mean across data sets, then negativity constraints come into play if the component of interest is small in comparison to other underlying variance components in the data, or the variance of estimates is high due to an inadequate experimental design for variance component estimation. These factors lead to an increased number of negative estimates. If the data structure is such that negative estimates would occur frequently, then accepting negative estimates is a good alternative. 80 If negative estimates tend to occur infrequently or bias is of less concern than variance among estimates, then resolving the system after convergence yields negative estimates is the preferable solution. This tactic reduces both bias and variance among estimates below that of arbitrarily setting negative estimates to zero. Estimation Technique The primary competitors among estimation techniques that are practically achievable are REML and TYPE3 (HM3). Both techniques produce estimates with little or no bias; however, REML estimates for the most part have slightly less sampling variance than TYPE3 estimates. If only subsets of the parents are in common across tests as in the case H1S65, REML has a distinct advantage in variance among estimates over TYPE3. REML does have three additional advantages over TYPE3 which are 1) REML offers generalized least squares estimation of fixed effects while TYPE3 offers ordinary least squares estimation; 2) Best Linear Unbiased Predictions (BLUP) of random variables are inherent in REML solutions, i.e., gca predictions are available; and thus in solving for the variance components with REML, fixed effects are estimated and random variables are predicted simultaneously (Harville 1977); and 3) REML offers greater flexibility in the model specification both in univariate and multivariate forms as well as heterogeneous or correlated error terms. Further, although the likelihood equations for common REML applications are based on normality, the technique has been shown to be robust against the underlying distribution (Westfall 1987, Banks et al. 1985). Recommendation If one were to choose a single variance component estimation technique from among those tested which could be applied to any data set with confidence that the estimates had desirable properties (variance, MSE, and bias), that technique would be REML and the basic unit of observation would be the individual. This combination (REML plus individual observations) performed well across mating design and types and levels of imbalance. Treatment of negative estimates would be determined by the proposed use of the estimates that is whether unbiasedness (accepting and using the negative estimates) is more important than sampling variance (resolve the system setting negative estimates to zero). A primary disadvantage of REML and individual observations is that they are both computationally expensive (computer memory and time). HM3 estimation could replace REML on many data sets and plot means could replace individual observations on some data sets; but general application of these without regard to the data at hand does result in a loss in desirable properties of the estimates in many instances. The computational expense of REML and individual observations ensures that estimates have desirable properties for a broad scope of applications. With the advent of bigger and faster computers and the evolution of better REML algorithms, what was not feasible in the past on most mainframe computers can now be accomplished on personal computers. CHAPTER 5 GAREML: A COMPUTER ALGORITHM FOR ESTIMATING VARIANCE COMPONENTS AND PREDICTING GENETIC VALUES Introduction The computer program described in this chapter, called GAREML for Giesbrecht's algorithm of restricted maximum likelihood estimation (REML), is useful for both estimating variance components and predicting genetic values. GAREML applies the methodology of Giesbrecht (1983) to the problems of REML estimation (Patterson and Thompson 1971) and best linear unbiased prediction (BLUP, Henderson 1973) for univariate (single trait) genetics models. GAREML can be applied to halfsib (openpollinated or polymix) and fullsib (partial diallels, factorials, halfdiallels [no selfs] or disconnected sets of halfdiallels) mating designs when planted in single or multiple locations with single or multiple replications per location. When used for variance component estimation, this program has been shown to provide estimates with desirable properties across types of imbalance commonly encountered in forest genetics field tests (Huber et al. in press) and with varying underlying distributions (Banks et al. 1985, Westfall 1987). GAREML is also useful for determining efficiencies of alternative field and mating designs for the estimation of variance components. Utilizing the power of mixedmodel methodology (Henderson 1984), GAREML provides BLUP of parental general (gca) and specific combining abilities (sca) as well as generalized least squares (GLS) solutions for fixed effects. The application of BLUP to forest genetics problems has been addressed by White and Hodge (1988, 1989). With certain assumptions, the desirable 83 properties of BLUP predictions include maximizing the probability of obtaining correct parental rankings from the data and minimizing the error associated with using the parental values obtained in future applications. GLS fixed effect estimation weights the observations comprising the estimates by their associated variances approximating best linear unbiased estimation (BLUE) for fixed effects (Searle 1987, p 489490). The purpose of this chapter is to describe the theory and use of GAREML in enough detail to facilitate use by other investigators. The program is written in FORTRAN and is not dependent on other analysis programs. An interactive version of this program can be obtained as a standalone executable file from the senior author; this file will run on any IBM compatible PC under DOS or WINDOWS2 operating systems. The size of the problem an investigator can solve will be dependent on the amount of extended memory and hard disk space (for swap files) available for program use. In addition, the FORTRAN source code can be obtained for analysts wishing to compile the program for use on alternate systems (e.g. mainframe computers). Algorithm GAREML proceeds by reading the data and forming a design matrix based on the number of levels of factors in the model. Any portions of the design matrix for nested factors or interactions are formed by horizontal direct product. Columns of zeroes in the design matrix (the result of imbalance) are then deleted. The design matrix columns are in an order specified by Giesbrecht's algorithm: columns for fixed effects are first, followed by the data vector, and the last section of the matrix is for random effects. The design matrix is the only fully formed matrix in the program. All other matrices are symmetric; therefore, to save computational space 2Windows is the trademark of the Microsoft Corporation, Redmond, WA. 84 and time, only the diagonal and the above diagonal portions of matrices are formed and utilized (i.e., halfstored). A halfstored matrix of the dot products of the design columns is formed and either kept in common memory or stored in temporary disk space so that the matrix is available for recall in the iterative solution process. The algorithm proceeds by modifying the matrix of dot products such that the inverse of the covariance matrix for the observations (V) is enclosed by the column specifiers in the dot products as X'X becoming X'V'X. This transfer is completed without inversion of the total V matrix. The identity used to accomplish this transfer is if Vh = hZhZl' + V,+1) where Vh is nonsingular; then Vh = V'(+l) ahV'(+l)Zh(Ih + CahZh'V'lh+I)Zh) Vl(h+l). 51 A compact form of equation 51 is obtained by premultiplying by Z,' and postmultiplying by Zj where h = 1, kl (k = the total number of random factors), at is the prior associated with random variable h, Vk = akI, V, = V and Z, is the portion of the design matrix for random variable i (Giesbrecht 1983). A partitioned matrix is formed in order to update V1',+1) until V,1' or V is obtained. This matrix is of the form: S h + ahZaV +1)'Z, VhZa,'V +,)'(X Y I Z ,1 ... I Zk1) V. (X I y I ZI... I Zi)Vh+ )'Z( T , where Tk, = (XI y I Z ... I Zk)'Vk.,'(X I y Z, ... IZk). The sweep operator of Goodnight (1979) is applied to the upper left partition of the matrix (equation 52) and the result of equation 51 is obtained. The matrix is sequentially updated and swept until T, = (XIy Z, I... Zk,)'V'(X y IZ, I... I Zk.) is obtained. T, is then swept on the columns for fixed effects (X'V'X). This sweep operation produces generalized least squares estimates for fixed effects, results which can be scaled into predictions of random variables, the residual sum of squares and all the necessary ingredients for assembling the 85 equation to solve for the variance components. The equation to be solved for the variance components is {tr(QVQVj)}f2 = {y'QV,Qy} rxr rxI rxl then ( = {tr(QVQVj)}1{y'QVQy}; 53 where {tr(QVQV)} is a matrix whose elements are tr(QVQVj) where i= 1 to r and j=l to r, i.e., there is a row and column for every random variable in the linear model; tr is the trace operator that is the sum of the diagonal elements of a matrix; Q = V V X(X'V'X)X'V" for V as the covariance matrix of y and X as the design matrix for fixed effects; V, = Z1Z', where the i's are the random variables; Z is the vector of variance component estimates; and r is the number of random variables in the model (k1). The entire procedure from forming T, to solving for the variance components continues until the variance component estimates from the last iteration are no more different from the estimates of the previous iteration than the convergence criterion specifies. The fixed effect estimates and predictions of random variables are then those of the final iteration. The asymptotic covariance matrix for the variance components is obtained as Var(2) = 2{tr(QV,QVj)} 54 by utilizing intermediate results from the solution for the variance components. The coefficient matrix of Henderson's mixed model equations is formed in order to calculate the covariance matrix for fixed and random effects. The covariance matrix for 86 observations is constructed using the variance components estimates from Giesbrecht's algorithm. The coefficient matrix is S X'R''X X'R'Z ]55 Z'RIX Z'R'Z + D" where R is the error covariance matrix which in this application is lo2, where a2, is the variance of random variable w (equation 56 and 57); X is the fixed effects design matrix; Z is the random effects design matrix; and D is the covariance matrix for the random variables which, in this application, has variance components on the diagonal and zeroes on the offdiagonal (no covariance among random variables). The generalized inverse of the matrix (equation 55) is the error covariance matrix of the fixed effect estimates and random predictions assuming the covariance matrix for observation is known without error. Operating GAREML While GAREML will run in either batch or interactive mode, we focus on the interactive PCversion which begins by prompting the analyst to answer questions determining the factors to be read from the data. Specifically, the analyst answers yes or no to these questions: 1) are there multiple locations? 2) are there multiple blocks? 3) are there disconnected sets of fullsibs? i.e., usually referring to disconnected halfdiallels and 4) is the mating design halfsib or fullsib? The program then determines the proper variables to read from the data as well as the most complicated (number of main factors plus interactions) scalar linear model allowed. The most complicated linear model allowed for fullsib observations is 87 yiu = + t+ b + set, + gk + S, + tg + tg, + ts, + pi3j + Wi. 56 where y, is the mm observation of the kld cross in the jI block of the it test; pt is the population mean; ti is the random or fixed variable test environment; bu is the random or fixed variable block; set, is the random or fixed variable set, i.e., a variable is created so that disconnected sets of halfdiallels planted in the same experiment can be analyzed in the same run or to analyze provenances and families within provenance where provenance equals set; sets are assumed to be across test environments and blocks with families nested within sets and interactions with set are assumed unimportant. gk is the random variable female general combining ability (gca); g, is the random variable male gca; s, is the random variable specific combining ability (sca); tgg is the random variable test by female gca interaction; tgo is the random variable test by male gca interaction; tsj is the random variable test by sea interaction; pi is the random variable plot; w1i is the random variable withinplot; and there is no covariance between random variables in the model. The assumptions utilized are the variance for female and male random variables are equal (~2 = o = o2); and female and male environmental interactions are the same (oa2 = 0 = 02). The most complicated scalar linear model allowed for halfsib observations is yjhn = M + t. + bj + set, + gk + tg, + ph + Whi, 57 where yi~ is the mL observation of the kL halfsib family in the jh block of the i test; /, t., bj, set., gk, and tga retain the definition in the fullsib equation; ph\ is the random variable plot containing different genotype by environment components than the fullsib model; whig is the random variable withinplot containing different levels of genotypic and genotype by environment components than the fullsib model; and there is no covariance between random variables in the model. The analyst builds the linear model by answering further prompts. If test, block and/or set are in the model, they must be declared as fixed or random effects. When any of the three effects is declared random, the analyst must furnish prior values for the variance. If no prior value is known, 1.0's may be used as priors. Using 1.0's as priors will not affect the values for resulting variance component estimates within the constraints of the convergence criterion; but there may be a time penalty due to increasing the number of iterations required for convergence. All remaining factors in the model are treated as random variables. To complete the definition of the model, the analyst chooses to include or exclude each possible factor by answering yes or no when prompted. After each yes answer, the program asks for a prior value for the variance. Again, if no known priors exist, 1.0's may be substituted. After the model has been specified, the program counts the number of fixed effects and the number of random effects and asks if the number fits the model expected. A "yes" answer proceeds through the program while a "no" returns the program to the beginning. GAREML is now ready to read the data file (which must be an ASCII data file) in this order: test, block, set, female, male, and the response variable. The analyst is prompted to furnish a proper FORTRAN format statement for the data. Test, block, set, female and male are read as character variables (A fields) with as many as eight characters per field, while the data 89 vector (response variable) is read as a double precision variable (F field). An example of a format statement for a fullsib mating design across locations and blocks is "(4A8,F10.5)" which reads four character variables sequentially occupying 8 columns each and the response variable beginning in column 33 and ending in column 42 having five decimal places. After reading the data, GAREML begins to furnish information to the analyst. This information should be scanned to make sure the data read are correct. This information includes the number of parents, the number of fullsib crosses, the number of observations, the maximum number of fixed effect design matrix columns, and the maximum number of random effect design matrix columns. If there is an error at this point, use CTRLBRK to exit the program. Probable causes of errors are the data are not in the format specified, missing values are included, blank lines or other similar errors are in the data file, or the model was not correctly specified. At this point, there are three other prompts concerning the data analysis (number of iterations, convergence criterion and treatment of negative variance components). The number of iterations is arbitrarily set to 30 and can be changed at the analyst's discretion. No warning is issued that the maximum number of iterations has been reached; however, the current iteration number and variance component estimates are output to the screen at the beginning of each iteration. The convergence criterion used is the sum of the absolute values of the difference between variance component estimates for consecutive iterations. The criterion has been set to lx104 meaning that convergence is required to the fourth decimal place for all variance components. The convergence criterion should be modified to suit the magnitude of the variances under consideration as well as the practical need for enhanced resolution. Enhanced resolution is obtained at the cost of increasing the number of iterations to convergence. The analyst must decide whether to accept and use negative estimates or to set negative estimates to zero and resolve the system. The latter solution results in variance component 90 estimates with lower sampling variance and slight bias. If one is interested in unbiased estimates of variance components that have a high probability of negative estimates, then accepting and using the negative estimates may be the proper course to take. Interpreting GAREML Output Analysis is now underway. The priors for each iteration and the iteration number are printed out to the screen. GAREML continues to iterate until the convergence criterion is met or the maximum number of iterations is reached. The next time that analyst intervention is required is to provide a name for the output file for variance component estimates. The file name follows normal DOS file naming protocol; however, alternative directories may not be specified, i.e., all outputs will be found in the same directory as the data file. The program will now quiz the analyst to determine if additional outputs are desired. These additional outputs are gca predictions, sea predictions (if applicable), the asymptotic covariance matrix for the variance components, generalized least squares fixed effect estimates, error covariance matrix of the gca predictions and error covariance matrix for fixed effects. An answer of yes to the inclusion of an output will result in a prompting for a file name. In addition, for gca and sea predictions the analyst may input a different value for 2v. or &2, with which to scale predictions. The discussion which follows furnishes more detailed information concerning GAREML outputs. Variance Component Estimates Ignoring concerns about convergence to a global maximum and negative values, variance component estimates are restricted maximum likelihood estimates of Patterson and Thompson (1971). The estimates are robust against starting values (priors), i.e., the same estimates, within the limits of the convergence criterion, can be obtained from diverse priors. However, priors 91 close to the true values will, in general, reduce the number of iterations required to reach convergence. The value of the convergence criterion must be less than or equal to the desired precision for the variance components. REML variance component estimates from this program have been shown to have more desirable properties (variance and bias) than other commonly used estimation techniques (maximum likelihood, minimum norm quadratic unbiased estimation and Henderson's Method 3) over a wide range of data imbalance. The properties of the estimates are further enhanced by using individual observations as data rather than plot means. The output is labelled by the variance component estimated. Predictions of Random Variables The predictions output are for general and specific combining abilities and approximate best linear unbiased predictions (BLUP) of the random variables. BLUP predictions have several optimal properties: 1) the correlation between the predicted and true values is maximized; 2) if the distribution is multivariate normal then BLUP maximizes the probability of obtaining the correct rankings (Henderson 1973) and so maximizes the probability of selecting the best candidate from any pair of candidates (Henderson 1977). Predictions are of the form: 6 = DZ'V'(yX6) 58 where ii is the vector of predictions; I is the estimated covariance matrix for random variables from the REML variance component estimates, see equation 55; Z' is the transpose of the design matrix for random variables; y is the data vector; X is the design matrix for fixed effects; 