Optimal mating designs and optimal techniques for analysis of quantitative traits in forest genetics

MISSING IMAGE

Material Information

Title:
Optimal mating designs and optimal techniques for analysis of quantitative traits in forest genetics
Physical Description:
ix, 151 leaves : ill. ; 29 cm.
Language:
English
Creator:
Huber, Dudley Arvle, 1948-
Publication Date:

Subjects

Subjects / Keywords:
Forest Resources and Conservation thesis Ph. D
Dissertations, Academic -- Forest Resources and Conservation -- UF
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1993.
Bibliography:
Includes bibliographical references (leaves 145-150).
Statement of Responsibility:
by Dudley Arvle Huber.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001915152
oclc - 30335713
notis - AJZ0671
System ID:
AA00003661:00001

Full Text










OPTIMAL MATING DESIGNS AND OPTIMAL TECHNIQUES FOR ANALYSIS OF
QUANTITATIVE TRAITS IN FOREST GENETICS

















By

DUDLEY ARVLE HUBER


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1993












ACKNOWLEDGEMENTS


I express my gratitude to Drs. T. L. White, G. R. Hodge, R. C. Littell, M. A.

DeLorenzo and D. L. Rockwood for their time and effort in the pursuit of this work. Their

guidance and wisdom proved invaluable to the completion of this project.

I further acknowledge Dr. Bruce Bongarten for his encouragement to continue my

academic career. I am grateful to Dr. T. L. White and the School of Forest Resources and

Conservation at the University of Florida for funding this work.

I extend special thanks to George Bryan and Dr. M. A. DeLorenzo of the Dairy Science

Department and Greg Powell of the School of Forest Resources and Conservation for the use of

computing facilities, programming help and aid in running the simulations required.

Most importantly, I thank my family, Nancy, John and Heather, for their understanding

and encouragement in this endeavor.












TABLE OF CONTENTS



ACKNOWLEDGEMENTS ........................................ ii

LIST OF TABLES ................................. ........... vi

LIST OF FIGURES ............................................. vii

ABSTRACT ................................................ viii

CHAPTER 1
INTRODUCTION .................... 1

CHAPTER 2
THE EFFICIENCY OF HALF-SIB, HALF-DIALLEL
AND CIRCULAR MATING DESIGNS IN THE ESTIMATION
OF GENETIC PARAMETERS WITH VARIABLE NUMBERS OF
PARENTS AND LOCATIONS ................ 4
Introduction ............................................... 4
M ethods .............................................. 6
Assumptions Concerning Block Size ...................... 6
The Use of Efficiency (i) ............................. 7
General Methodology .................................. 8
Levels of Genetic Determination .......................... 10
Covariance Matrix for Variance Components ............... 12
Covariance Matrix for Linear Combinations of Variance Components
and Variance of a Ratio ............... .......... 13
Comparison Among Estimates of Variances of Ratios ............ .14
Results .............. ... ........ ... ........... ...... 17
H eritability ........................................ 17
Type B Correlation .................................. 18
Dominance to Additive Variance Ratio ...................... 21
Discussion ................... ........... ............... 22
Comparison of Mating Designs ........................... 22
A General Approach to the Estimation Problem ................ 23
Use of the Variance of a Ratio Approximation ................. 25
Conclusions ............................................. 26






CHAPTER 3


ORDINARY LEAST SQUARES ESTIMATION OF GENERAL
AND SPECIFIC COMBINING ABILITIES FROM
HALF-DIALLEL MATING DESIGNS ........


Introduction


. . .. .. . . .. 28


M ethods ........................
Linear Model ................
Ordinary Least Squares Solutions ...
Sum-to-Zero Restrictions ........
Components of the Matrix Equation .
Estimation of Fixed Effects .......
Numerical Examples ................
Balanced Data (Plot-mean Basis) ....
Missing Plot ................
Missing Cross ...............
Several Missing Crosses .........
Discussion ......................
Uniqueness of Estimates .........
Weighting of Plot Means and Cross Me
Diallel Mean ................
Variance and Covariance of Plot Means
Comparison of Prediction and Estimatio


......................
......................
......................
......................
................ol..
...........l.........

......................
......................
......................

......................
.....................
...eto ol .is ...........
........... ....o.

,ans in Estimating Parameters .. .
................o..
)n Methodologies .. .. .. .. .. .


Conclusions ............................................

CHAPTER 4
VARIANCE COMPONENT ESTIMATION TECHNIQUES
COMPARED FOR TWO MATING DESIGNS


WITH FOREST GENETIC ARCHITECTURE
THROUGH COMPUTER SIMULATION .....
Introduction ....................................


M ethods ............................
Experimental Approach .............
Experimental Design for Simulated Data .
Full-Sib Linear Model ..............
Half-sib Linear Model ..............
Data Generation and Deletion .... .....
Variance Component Estimation Techniques
Comparison Among Estimation Techniques
Results and Discussion ...................
Variance Components ..............
Ratios of Variance Components ........
General Discussion .....................
Observational Unit ................
Negative Estimates ................
Estimation Technique ...............
Recommendation .................


i







CHAPTER 5
GAREML: A COMPUTER ALGORITHM FOR
ESTIMATING VARIANCE COMPONENTS AND
PREDICTING GENETIC VALUES ............... 82
Introduction ............................................. 82
Algorithm .............................................. 83
Operating GAREML ...................................... 86
Interpreting GAREML Output ................................ 90
Variance Component Estimates ........................... 90
Predictions of Random Variables .......................... 91
Asymptotic Covariance Matrix of Variance Components ........... 92
Fixed Effect Estimates ............................... 93
Error Covariance Matrices ............................. 93
Example ............... ......................... .... 94
Data .................. ........ .... .............. 94
Analysis ..................................... ... 94
O utput .......................................... 98
Conclusions .............................................. 103

CHAPTER 6
CONCLUSIONS ..................... 104

APPENDIX
FORTRAN SOURCE CODE FOR GAREML ............ 107

REFERENCE LIST ........................................... 145

BIOGRAPHICAL SKETCH ...................................... 151













LIST OF TABLES


Table 2-1. Parametric variance components .. ..................... .11

Table 3-1. Data set for numerical examples .......................... 43

Table 3-2. Numerical results for examples ............................. 44

Table 4-1. Abbreviation for and description of variance component estimation

m ethods .................. ............................ 60

Table 4-2. Sets of true variance components ............................ .61

Table 4-3. Sampling variance for the estimates .......................... 72

Table 4-4. Bias for the estimates ................................... 74

Table 4-5. Probability of nearness .................................. 75

Table 5-1. Data for example ...................................... 95













LIST OF FIGURES


Figure 2-1. Efficiency () for h .................................... 16

Figure 2-2. Efficiency () for rB ................. ................. 19

Figure 2-3. Efficiency () fory .................................... 20

Figure 3-1. The overparameterized linear model ......................... 33

Figure 3-2. The linear model for a four-parent half-diallel .................. 33

Figure 3-3. Intermediate result in SCA submatrix generation .................. 39

Figure 3-4. Weights on overall cross means ............................ 49

Figure 4-1. Distribution of 1000 MIVQUE estimates ....................... 77













Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

OPTIMAL MATING DESIGNS AND OPTIMAL TECHNIQUES FOR ANALYSIS OF
QUANTITATIVE TRAITS IN FOREST GENETICS

By

Dudley Arvie Huber
May 1993

Chairperson: Timothy L. White
Major Department: School of Forest Resources and Conservation

First, the asymptotic covariance matrix of the variance component estimates is used to

compare three common mating designs for efficiency (maximizing the variance reducing property

of each observation) for genetic parameters across numbers of parents and locations and varying

genetic architectures. It is determined that the circular mating design is always superior in

efficiency to the half-diallel design. For single-tree heritability, the half-sib design is most

efficient. For estimating type B correlation, maximum efficiency is achieved by either the half-

sib or circular mating design and that change in rank for efficiency is determined by the

underlying genetic architecture.

Another intent of this work is comparing analysis methodologies for determining parental

worth. The first of these investigations is ordinary least squares assumptions in the estimation

of parental worth for the half-diallel mating design with balanced and unbalanced data. The

conclusion from comparison of ordinary least squares to alternative analysis methodologies is that

best linear unbiased prediction and best linear prediction are more appropriate to the problem of

determining parental worth.







The next analysis investigation contrasts variance component estimation techniques across

levels of imbalance for the half-diallel and half-sib mating designs for the estimation of genetic

parameters with plot means and individuals used as the unit of observation. The criteria for

discrimination are variance of the estimates, mean square error, bias and probability of nearness.

For all estimation techniques individuals as the unit of observation produced estimates with the

most desirable properties. Of the estimation techniques examined, restricted maximum likelihood

is the most robust to imbalance.

The computer program used to produce restricted maximum likelihood estimates of

variance components was modified to form a user friendly analysis package. Both the algorithm

and the outputs of the program are documented. Outputs available from the program include

variance component estimates, generalized least squares estimates of fixed effects, asymptotic

covariance matrix for variance components, best linear unbiased predictions for general and

specific combining abilities and the error covariance matrix for predictions and estimates.












CHAPTER 1
INTRODUCTION


Analysis of quantitative traits in forest genetic experiments has traditionally been

approached as a two-part problem. Parental worth would be estimated as fixed effects and later

considered as random effects for the determination of genetic architecture. While traditional, this

approach is most probably sub-optimal given the proliferation of alternative analysis approaches

with enhanced theoretical properties (White and Hodge 1989).

In this dissertation emphasis is placed on the half-diallel mating design because of its

omnipresence and the uniqueness of the analysis problem this mating design presents. The half-

diallel mating design has been and continues to be used in plant sciences (Sprague and Tatum

1942, Gilbert 1958, Matzinger et al. 1959, Burley et al. 1966, Squillace 1973, Weir and Zobel

1975, Wilcox et al. 1975, Snyder and Namkoong 1978, Hallauer and Miranda 1981, Singh and

Singh 1984, Greenwood et al. 1986, and Weir and Goddard 1986). The unique feature of the

half-diallel mating system which hinders analysis with many statistical packages is that a single

observation contains two levels of the same main effect.

Optimality of mating design for the estimation of commonly needed genetic parameters

(single-tree heritability, type B correlation and dominance to additive variance ratio) is examined

utilizing the asymptotic covariance of the variance components (Kendall and Stuart 1963,

Giesbrecht 1983 and McCutchan et al. 1989). Since genetic field experiments are composed of

both a mating design and a field design, the central consideration in this investigation is which

mating design with what field design (how many parents and across what number of locations







2

within a randomized complete block design) is most efficient. The criterion for discernment

among designs is the efficiency of the individual observation in reducing the variance of the

estimate (Pederson 1972). This question is considered under a range of genetic architectures

which spans that reported for coniferous growth traits (Campbell 1972, Stonecypher et al. 1973,

Snyder and Namkoong 1978, Foster 1986, Foster and Bridgwater 1986, Hodge and White [in

press]).

The investigation into optimal analysis proceeds by considering the ordinary least squares

(OLS) treatment of estimating parental worth for the half-diallel mating design. OLS assumptions

are examined in detail through the use of matrix algebra for both balanced and unbalanced data.

The use of matrix algebra illustrates both the uniqueness of the problem and the interpretation

of the OLS assumptions. Comparisons among OLS, generalized least squares (GLS), best linear

unbiased prediction (BLUP) and best linear prediction (BLP) are made on a theoretical basis.

Although consideration of field and mating design of future experiments is essential, the

problem of optimal analysis of current data remains. In response to this need, simulated data

with differing levels of imbalance, genetic architecture and mating design is utilized as a basis

for discriminating among variance component estimation techniques in the determination of

genetic architecture. The levels of imbalance simulated represent those commonly seen in forest

genetic data as less than 100% survival, missing crosses for full-sib mating designs and only

subsets of parents in common across location for half-sib mating designs. The two mating

designs are half-sib and half-diallel with a subset of the previously used genetic architectures.

The field design is a randomized complete block with fifteen families per block and six trees per

family per block. The four criteria used to discriminate among variance component estimation

techniques are probability of nearness (Pittman 1937), bias, variance of the estimates and mean

square error (Hogg and Craig 1978).







3

The techniques compared for variance component estimation are minimum variance

quadratic unbiased estimation (Rao 1971b), minimum norm quadratic unbiased estimation (Rao

1971a), restricted maximum likelihood (Patterson and Thompson 1971), maximum likelihood

(Hartley and Rao 1967) and Henderson's method 3 (Henderson 1953). These techniques are

compared using the individual and plot means as the unit of observation. Further, three

alternatives are explored for dealing with negative variance component estimates which are accept

and live with negative estimates, set negative estimates to zero, and re-solve the system setting

negative components to zero.

The algorithm used for the method which provided estimates with optimal properties

across experimental levels was converted to a user friendly program. This program providing

restricted maximum likelihood variance component estimates uses Giesbrecht's algorithm (1983).

Documentation of the algorithm and explanation of the program's output are provided along with

the Fortran source code (appendix).













CHAPTER 2
THE EFFICIENCY OF HALF-SIB, HALF-DIALLEL
AND CIRCULAR MATING DESIGNS IN THE ESTIMATION
OF GENETIC PARAMETERS WITH VARIABLE NUMBERS OF
PARENTS AND LOCATIONS


Introduction


In forest tree improvement, genetic tests are established for four primary purposes:

1) ranking parents, 2) selecting families or individuals, 3) estimating genetic parameters, and 4)

demonstrating genetic gain (Zobel and Talbert 1984). While the four purposes are not mutually

exclusive, a test design optimal for one purpose is most probably not optimal for all (Burdon and

Shelbourne 1971, White 1987). A breeder then must prioritize the purposes for which a given

test is established and choose a design based on these priorities. Within a genetic test design

there are two primary components: mating design and field design. There have been several

investigations of optimal designs for these two components either separately or simultaneously

under various criteria. These criteria have included the efficient and/or precise estimation of

heritability (Pederson 1972, Namkoong and Roberds 1974, Pepper and Namkoong 1978,

McCutchan et al. 1985, McCutchan et al. 1989), precise estimation of variance components

(Braaten 1965, Pepper 1983), and efficient selection of progeny (van Buijtenen 1972, White and

Hodge 1987, van Buijtenen and Burdon 1990, Loo-Dinkins et al. 1990).

Incorporated within this body of research has been a wide range of genetic and

environmental variance parameters and field and mating designs. However, the models in

previous investigations have been primarily constrained to consideration of testing in a single






5

environment with a corresponding limited number of factors in the model, i.e., genotype by

environment interaction and/or dominance variance are usually not considered. This chapter

focuses on optimal mating designs through consideration of three common mating designs (half-

sib, half-diallel, and circular with four crosses per parent) for estimation of genetic parameters

with a field design extending across multiple locations.

In this chapter the approach to the optimal design problem is to maintain the basic field

design within locations as randomized complete block with four blocks and a six-tree row-plot

representing each genetic entry within a block (noted as one of the most common field designs

by Loo-Dinkins et al. 1990). The number of families in a block, number of locations, mating

design and number of parents within a mating design are allowed to change. Since optimality,

besides being a function of the field and mating designs, is also a function of the underlying

genetic parameters, all designs are examined across a range of levels of genetic determination (as

varying levels of heritability, genotype by environment interaction and dominance) reflecting

estimates for many economically important traits in conifers (Campbell 1972, Stonecypher et al.

1973, Snyder and Namkoong 1978, Foster 1986, Foster and Bridgwater 1986, Hodge and White

(in press)).

For each design and level of genetic determination, a Minimum Variance Quadratic

Unbiased Estimation (MIVQUE) technique and an approximation of the variance of a ratio

(Kendall and Stuart 1963, Giesbrecht 1983 and McCutchan et al. 1989) are applied to estimate

the variance of estimates of heritability, additive to additive plus additive by environment variance

ratio, and dominance to additive variance ratio. These techniques use the true covariance matrix

of the variance component estimates (utilizing only the known parameters and the test design and

precluding the need for simulated or real data) and a Taylor series approximation of the variance

of a ratio. The relative efficiencies of different test designs are compared on the basis of i (the







6

efficiency of an individual observation in reducing the variance of an estimate, Pederson 1972).

Thus this research explores which mating design, number of parents and number of locations is

most efficient per unit of observation in estimating heritability, additive to additive plus additive

by environment variance ratio, and dominance to additive variance ratio for several variance

structures representative of coniferous growth traits.


Methods


Assumptions Concerning Block Size


As opposed to McCutchan et al. (1985), where block sizes were held constant and

including more families resulted in fewer observations per family per block, in this chapter the

blocks are allowed to expand to accommodate increasing numbers of families. This expansion is

allowed without increasing either the variance among block or the variance within blocks. For

the three mating designs which are discussed, the addition of one parent to the half-sib design

increases block size by 6 trees (plot for a half-sib family), the addition of a parent to the circular

design increases block size by 12 trees (two plots for full-sib families), and the addition of a

parent to the half-diallel design increases block size by 6p (where p is the number of parents

before the addition or there are p new full-sib families per block). Therefore, block size is

determined by the mating design and the number of parents.

All comparisons among mating designs and numbers of locations are for equal block

sizes, i.e., equal numbers of observations per location. This results in comparing mating designs

with unequal numbers of parents in the designs and comparing two location experiments against

five location experiments with equal numbers of observations per location but unequal total

numbers of observations.








The Use of Efficiency (i)


Efficiency is the tool by which comparisons are made and is the efficacy of the individual

observations in an experiment in lowering the variance of parameter estimates. An increasing

efficiency indicates that for increasing experimental size the additional observations have

enhanced the variance reducing property of all observations. Efficiency is calculated as i = 1

/ N(Var(x)) where N is the total number of observations and Var(x) is the variance of a generic

parameter estimate. Increasing N always results in a reduction of the variance of estimation, all

other things being equal. Yet the change in efficiency with increasing N is dependent on whether

the reduction in variance is adequate to offset the increase in N which caused the reduction.

Comparing a previous efficiency with that obtained by increasing N, i.e., increasing the number

of parents in a mating design or increasing the number of locations in which an experiment is

planted:

since i, = 1 / N(Var(x)), 2-1

then N(Var(x)) = 1 / i,

and (N + AN)(Var(x) + AVar(x)) = 1 / i;

if i, (the old efficiency) = i. (the new efficiency),

then AVar(x) / Var(x) = AN / (N + AN);

if i, < in, then AVar(x) / Var(x) < AN / (N + AN);

and if i, > i,, then AVar(x) / Var(x) > AN / (N + AN);

where A denotes the change in magnitude.

Viewing equation 2-1, if N is held constant and one design has a higher efficiency (i), the design

must also produce parameter estimates which have a lower variance.









General Methodology


Sets of true variance components are calculated in accordance with a stated level of

genetic control and the design matrix is generated in correspondence with the field and mating

design. Knowing the design matrix and a set of true variance components, a true covariance

covariancee) matrix of variance component estimates is generated. Once the covariance matrix

of the variance components is in hand, the variance of and covariances between any linear

combinations of the variance component estimates are calculated. From the covariance matrix

for linear combinations, the variance of genetic ratios as ratios of linear combinations of variance

components are approximated by a Taylor series expansion. Since definition of a set of variance

components and formation of the design matrix are dependent on the linear model employed,

discussion of specific methodology begins with linear models.


Linear Models


Half-diallel and circular designs

The scalar linear model employed for half-diallel and circular mating designs is

yijm = IA + ti + bij + g + g, + Sw + tgk + tg, + tsiJ + pij + wijv 2-2

where yix. is the mL observation of the kli cross in the jth block of the ih test;

jL is the population mean;

ti is the random variable test environment ~ NID(0,o,);

by is the random variable block NID(O,o2);

gk is the random variable female general combining ability (gca) ~ NID(0,^g);

g, is the random variable male gca ~ NID(0,og,);

sk is the random variable specific combining ability (sca) ~ NID(0,ol,);

tg, is the random variable test by female gca interaction ~ NID(0,olg);








tgn is the random variable test by male gca interaction NID(O,a2;

ts, is the random variable test by sea interaction NID(O,a );

p, is the random variable plot ~ NID(O,a2,);

wij is the random variable within plot ~ NID(0,a2,); and

there is no covariance between random variables in the model.

This linear model in matrix notation is (dimensions below model component):

y = Al + Zer + Ze, + ZGeG + Zes + + G + se + Zpe, +ew 2-3

nxl nxI nxt txl nxb bxl nxg gxl nxs sxl nxtg tgxl nxts tsxl nxp pxl nxl

where y is the observation vector;

Z7 is the portion of the design matrix for the il random variable;

e, is the vector of unobservable random effects for the it random variable;

1 is a vector of l's; and

n, t, b, g, s, tg, ts, and p are the number of observations, tests, blocks, gca's, sca's,

test by gca interactions, test by sea interactions and plots, respectively.

Utilizing customary assumptions in half-diallel mating designs (Method 4, Griffing 1956), the

variance of an individual observation is

Var(yI,~j = oa + o2, + 2oW + a2e + 2oa + ou, + o2p + o2,; 2-4

and in matrix notation the covariance matrix for the observations is

Var(y) = ZrZ;t + ZBZo2b + ZGZoga. + ZsZo + ZcZeoG + Z oZr + Z22p, + I.o2, 2-5

where indicates the transpose operator, all matrices of the form Z.Z1' are nxn, and I, is an

nxn identity matrix.

Half-sib design

The scalar linear model for half-sib mating designs is

yjkm = L + t- + bi + gk + tgik + p*,jk + W*,2 2-6








where y,, is the mh observation of the kL half-sib family in the jL block of the ih test;

I, t., bj, gk, and tga, retain the definition in Eq.2-2;

p*4k is the random variable plot containing different genotype by environment

components than Eq.2-2 NID(0,2,.);

w*'jB is the random variable within plot containing different levels of genotypic and

genotype by environment components than Eq.2-2 ~ NID(O,o2.); and

there is no covariance between random variables in the model.

The matrix notation model is

y = ll + Zre + Z4e, + Zce, + ZrGe + Ze, + ew 2-7

nxl nxl nxt txl nxb bxl nxg gxl nxtg tgxl nxp pxl nxl

The variance of an individual observation in half-sib designs is

Var(yij = o, + o2b + o2 + o, + o2,. + o.; 2-8

and Var(y) = Z.r 2, + ZBZoa2b + ZGZGoa9cr + ZrGZtG02g + ZpZp'o,. + I2, 2-9


Levels of Genetic Determination


Eight levels of genetic determination are derived from a factorial combination of two

levels of each of three genetic ratios: heritability (h2 = 4o2gq / (202p + o2, + 2o2, + o2, +

a2p + o2,) for full-sib models and h2 = 4o2~ / (o2gs + a2, + o2, + o2,) for half-sib models);

additive to additive plus additive by environment variance ratio (r. = a2, I (oa2 + o,), Type

B correlation of Burdon 1977); and dominance to additive variance ratio (7 = o2 / agJ. The

levels employed for each ratio are h2 = 0.1 and 0.25; rE = 0.5 and 0.8; and 7 = 0.25 and 1.0.

To generate sets of true variance components (Table 2-1) for half-diallel and circular

mating designs from the factorial combinations of genetic parameters, the denominator of h2 is

set to 10 (arbitrarily, but without loss of generality) which, given the level of h2, leads to the







11

solution for oa Solving for o, and knowing y yields the value for o2,. Knowing the level

of rB and a2, allows the equation for rB to be solved for o2l. An assumption that the ratio of y

Table 2-1. Parametric variance components for the factorial combination of heritability (.1 and
.25), Type B Correlation (.5 and .8) and dominance to additive variance ratio (.25 and 1.0) for
full and half-sib designs. o2, and o2b were maintained at 1.0 and .5, respectively for all levels and
designs.
Design Level h2 r, y o., oa2 o o. o, t2.
Full 1 .1 .8 1.0 .2500 .2500 .0625 .0625 .6344 8.4281
2 .1 .5 1.0 .2500 .2500 .2500 .2500 .5950 7.9050
3 .1 .8 .25 .2500 .0625 .0625 .0156 .6508 8.6461
4 .1 .5 .25 .2500 .0625 .2500 .0625 .6212 8.2538
5 .25 .8 1.0 .6250 .6250 .1562 .1562 .5359 7.1203
6 .25 .5 1.0 .6250 .6250 .6250 .6250 .4376 5.8125
7 .25 .8 .25 .6250 .1562 .1562 .0391 .5769 7.6649
8 .25 .5 .25 .6250 .1562 .6250 .1562 .5031 6.6844
Half I and .1 .8 .2500 .0625 .4844 9.2031
3
2 and .1 .5 .2500 .2500 .4750 9.0250
4
5 and .25 .8 .6250 .1562 .4609 8.7579
7
6 and .25 .5 .6250 .6250 .4375 8.3125
8


equals the ratio of a2, / o, permits a solution for a2,. A further assumption that 2, is seven

percent of a02 + o2, yields a solution for both a2p and a2,. Finally, a2 and o2, are set to 1.0 and

0.5, respectively, for all treatment levels.

In order to facilitate comparisons of half-sib mating designs with full-sib mating designs,

og,. and o2, retain the same values for given levels of h2 and r. and the denominator of

heritability again is set to 10. To solve for o2,. and o2,, the assumption that o21. is five percent

of o2. + o2,. permits a solution for a2p. and o2, and maintains ap. approximately equal to and

no larger than o2 of the full-sib mating designs (Namkoong et al. 1966) for the same levels of







12

h2 and rB. Under the previous definitions all consideration of differences in y changing the

magnitudes of o2,. and o2, is disallowed. Thus, there are only four parameter sets for the half-

sib mating design (Table 2-1).


Covariance Matrix for Variance Components


The base algorithm to produce the covariance matrix for variance component estimates

is from Giesbrecht (1983) and was rewritten in Fortran for ease of handling the study data. In

using this algorithm, we assume that all random variables are independent and normally

distributed and that the true variances of the random variables are known. Under these

assumptions, Minimum Norm Quadratic Unbiased Estimation (MINQUE, Rao 1972) using the

true variance components as priors (the starting point for the algorithm) becomes MIVQUE (Rao

1971b), which requires normality and the true variance components as priors (Searle 1987), and

for a given design the covariance matrix of the variance component estimates becomes fixed. A

sketch of the steps from the MIVQUE equation (Eq.2-10, Giesbrecht 1983, Searle 1987) to the

true covariance matrix for variance components estimates is

{tr(QVQVj)})2 = {y'QV,Qy) 2-10

rxr rxl rxl

then ( = {tr(QVQVj)}-'{y'QVQy)

and Var(2) = {tr(QVQVj)}-'Var({y'QVQy}){tr(QVQVj)}-

rxr rxr rxr rxr

where {aj is a matrix whose elements are aj where in the full-sib designs i= 1

to 8 and j= 1 to 8, i.e., there is a row and column for every random

variable in the linear model;







13

tr is the trace operator that is the sum of the diagonal elements of a

matrix;

Q = V' V-'X(X'V'X)-X'V- for V = the covariance matrix of y and X as

the design matrix for fixed effects;

V, = ZZZ', where i = the random variables test, block, etc.;

W is the vector of variance component estimates; and

r is the number of random variables in the model.

The variance of a quadratic form (where A is any non-negative definite matrix of proper

dimension) under normality is Var(y'Ay) = 2tr(AVAV) + jI'Ajz (Searle 1987); however,

MINQUE derivation (Rao 1971) requires that AX = 0 which in our case is Al =0 and is

equivalent to /W'Al11 = 0, thus

Var({y'QViQy}) = 2{tr(QVQVj)}; 2-11

and using Eq.2-10 and Eq.2-11 Var(2) = {tr(QV.QVj)}-'2{tr(QVQVj)}{tr(QVQVQV)}-1

and therefore Var(2) = V, = 2{tr(QVQVJ)}-'. 2-12

From Eq.2-12 it is seen that the MIVQUE covariance matrix of the variance component estimates

is dependent only on the design matrix (the result of the field design and mating design) and the

true variance components; a data vector is not needed.


Covariance Matrix for Linear Combinations of Variance Components and Variance of a Ratio


Once the covariance matrix for the variance component estimates (Eq.2-12) is created,

then the covariance matrix of linear combinations of these variance components is formed as

V, = L'VL 2-13

2x2 2xrrxrrx2







14

where L specifies the linear combinations of the variance components which are the

combinations of variance components in the denominator and numerator of the genetic ratio being

estimated. A Taylor series expansion (first approximation) for the variance of a ratio using the

variances of and covariance between numerator and denominator is then applied using the

elements of V, to produce the approximate variance of the three ratio estimates as (Kendall and

Stuart 1963):

Var(ratio) (1/D)2(Vk(1,1)) 2(N/D3)(V,(1,2)) + (N2/D4)(Vk(2,2)) 2-14

where the generic ratio is N/D and N and D are the parametric values;

V,(1,1) is the variance of N;

V,(1,2) is the covariance between N and D; and

Vl(2,2) is the variance of D.


Comparison Among Estimates of Variances of Ratios


The approximate variances of the three ratio estimates (h2, r., and y) are compared across

mating designs with equal (or approximately equal) numbers of observations, across numbers of

locations, and across numbers of parents within a mating design all within a level of genetic

determination. The standard for comparison is i. Results are presented by the genetic ratio

estimated so that direct comparisons may be made among the mating designs for equal numbers

of observations within a number of locations for varying levels of genetic control. Number of

genetic entries (number of crosses for full-sib designs and number of half-sib families for half-sib

designs) is used as a proxy for number of observations since, for all designs, number of

observations equals twenty-four times the number of locations times the number of genetic

entries. Further, by plotting the two levels of numbers of locations on a single figure, a







15

comparison is made of the utility of replication of a design across increasing numbers of

locations.

Efficiency plots also permit contrasts of the absolute magnitude of variance of estimation

among designs. For a given number of genetic entries and locations, the design with the highest

efficiency is the most precise (lowest variance of estimation). Increasing the number of genetic

entries or locations always results in greater precision (lower variance of estimation), but is not

necessarily as efficient (the reduction in variance was not sufficient to offset the increase in

numbers of observations). A primary justification for using the efficiency of a design as a

criterion is that a more precise estimate of a genetic ratio is obtained by using the mean of two

estimates from replication of the small design as two disconnected experiments as opposed to the

estimate from single large design. This is true when 1) the number of observations in the large

design (N) equals twice the number of observations in small design (n), 2) the small design is

more efficient, and 3) the variances are homogeneous. This is proven below:

Since N = n, + n2

and n, = n,

then N = 2n,.

By definition i = 1 / (N*(Var(Ratio)));

and Var(Ratio) = 1 /(i*N).

The proposition is (Var,(Ratio) + Var,(Ratio))/4.0 < Varl(Ratio);

substitution gives ((1/(n1*i)) + (1/(n,*i,)))/4.0 < (1/(N*i)).

Simplification yields (1/(2.0*n,*ij) < (1/(N*i));

and multiplication by N produces /i, < l/i, 2-15

which is strictly true so long as i, > ii where i, is the efficiency of the smaller experiment and

i, is the efficiency of the larger experiment.
















,i











i : i i,
a i
ie







\ 5 -






\ \\
.7



S 5


/ ; SW
9J

a 1
o o
13 *, ;

4N


\\ <; !

-- 8

_______________________


: uIi


S iU I I l

U s


a Ib



oq r
i i
i----------------i 8 ?




a


do 00
,/ /- ,, I




\ Q -


S a 8
a d d
I--------------------4
^ ? f 1

^9 4 H ~ 8


: g v 5 v
d d d


d d


AON30tUi


0
4. -0

to







g'
. E
~II










a



a"



- 4
'-



a)
-.g







II

4-3


a)'^

U U,


kaN3DiIJ









Results


Heritability


Half-sib designs are almost globally superior to the two full-sib designs in precision of

heritability estimates (results not shown for variance but may be seen from efficiencies in Figure

2-1). For designs of equal size, half-sib designs excel with the exception of genetic level three

(Figure 2-1c, h2 = 0.1, r, = 0.8, and y = 0.25). In genetic level three, the circular design

provides the most precise estimate of h2 for two location designs; however, when the design is

extended across five locations, the half-sib mating design again provides the most precise

estimates. The circular mating design is superior in precision to the half-diallel design across all

levels of genetic control and location, even with a relatively large number of crosses per parent

(four).

Half-sib designs are, in general, (seven genetic control levels out of eight, Figure 2-1)

more efficient with the exception of level three across two locations (Figure 2-1c). For the

circular and half-sib mating designs considered, increasing the number of genetic entries always

improves the efficiency of the design. However, definite optima exist for the half-diallel mating

design for number of genetic entries, i.e., crosses which convert to a specific number of parents.

These optima are not constant but tend to be six parents or less, lower with increasing h2 or

number of locations. The six-parent half-diallel is never far from the half-diallel optima, and

increasing the number of parents past the optimum results in decreased efficiency.

For half-sib designs with h2 = 0.1, five locations are more efficient than two locations;

however, at h2 = 0.25 two locations are most efficient. Further, the number of locations

required to efficiently estimate h2 for half-sib designs is determined only by the level of h2 and

does not depend on the levels of the other ratios. Although estimates over larger numbers of







18

observations are more precise (five-location estimates are more precise than two-location

estimates), the efficiency (increase in precision per unit observation) declines. So that if h2 =

0.25 and estimates of a certain precision are required, disconnected sets of two-location

experiments are preferred to five-location experiments. The relative efficiencies of five locations

versus two locations is enhanced with decreasing r, (increasing genotype by environment

interaction) within a level of h2 (compare Figures 2-la to 2-lb and 2-lc to 2-1d for h2 = 0.1, and

2-le to 2-If and 2-lg to 2-lh for h2 = 0.25). Yet, this enhancement is not sufficient to cause

a change in efficiency ranking between the location levels.

The full-sib designs differ markedly from this pattern (Figure 2-1) in that, for these

parameter levels, it is never more efficient to increase the number of locations from two to five

for heritability estimation. As observed with half-sib designs, for full-sib designs the relative

efficiency status of five locations improves with decreasing r.. To further contrast mating designs

note that the efficiency status of full-sib designs relative to the half-sib design improves with

decreasing y and increasing r. (Figures 2-lb versus 2-lc and 2-if versus 2-1g).


Type B Correlation


As opposed to h2 estimation, no mating design performs at or near the optima for

precision of rB estimates across all levels of genetic control (Figure 2-2). However, the circular

mating designs produce globally more precise estimates than those of the half-diallel mating

design. In general, the utility of full-sib versus half-sib designs is dependent on the level of ra.

The lower r, value favors half-sib designs while the higher r, tends to favor full-sib designs

(compare Figures 2-2a to 2-2b, 2-2c to 2-2d, 2-2e to 2-2f and 2-2g to 2-2h).

Decreasing y and lowering h2 always improves the relative efficiency of full-sib designs to half-

sib designs (compare Figures 2-2c and 2-2d to 2-2e and 2-2f).











i


mi e- '
v i










: ; i










I i
^ --.-












da .
I I ,


7-

R

S -
a

S al


* X1






I


b D U
i / ,d


T I I
i / -





' i 5 5 5 5 -




SI


. .. o .



I j
4(31
St4


AON131I3O


4-
3
'a

U
g



So
I-

Bs
at
o,
.8 8

at
8 0










h|
oo a






.a


00
St
ru II
^U












08
4,O




















'gcr
'E-.
-s
















gi
ba
^ II
0 ~

























s g
4-
II


















i '.
1


I "4



73


4; '4
%41 3C


bL
i
4*
o .
a
XD ;k



sg~~I~rrgj


0


"i


1 1 1 1 1 1 Id 1 1 1 1 1 1 1


A3N313J3












(3a) h= .1; re = .5; y"= 1.0


0.0006


0.000


0.0005


0.0001


0.00041


0.000


0.00031


0.000




0.0012




0.001




0.0008




0.0006




0.0004




0.0002


0 10 20 30 40


(3c) h2=.25; r = .8; /= 1.0


-----------

..

AA


I

I
I

d










I I I


GENETIC ENTRIES

Circulr 2 location
l-------o-------i

Cicular 5localrn
------ ----


5


B


5


5


5


4


5


0.016




0.014




0.012




0.01




0.008




0.006




0.04



0.035



0.03



0.025



0.02



0.015



0.01


n0E0


50 0 10 20 30
GENETIC ENTRIES
SHalf-diallel 2 locations
A------


HaF-diallel 5locations
-----A-----


Figure 2-3. Efficiency (i) for y plotted against number of genetic entries for four levels for
genetic control for circular, half-diallel, and half-sib mating designs across levels of location
where i = 1/(N(Var(y))) and N = the total number of observations.


I I I I


-a-*--
.EB ...--


Er

S. ......




AA
I









I I I I

0 10 20 30 40 50


(3d) h=.25; re = .8; /= .25




-. .. -O
8-


N- -

'
.r--.














-
c-u:
Y,^
...... ..... .. ..-



0 10 20 30 40 5


(3b) h2 = .1; r = .5; -/= .25






21

For estimation of rs, full-sib designs are more efficient than half-sib designs except in the

three cases of low r. (0.5) and high y (1.0) for h2 = 0.1 (Figure 2-2b) and low r. for h2 = 0.25

(Figures 2-2f and 2-2h). Within full-sib designs the circular design is globally superior to the

half-diallel. As with h2 estimation, half-diallel designs have optimal levels for numbers of

parents. The six-parent half-diallel is again close to these optima for all genetic levels and

numbers of locations.

At low h2 for full-sib designs, planting in two locations is always more efficient than five

locations. For half-sib designs at low h2, the relative efficiency of two versus five locations is

dependent on the level of r, with lower r. favoring replication across more locations. At h2 =

0.25, half-sib designs are more efficient when replicated across five locations. At the higher h2

value, full-sib design efficiency across locations is dependent on the level of rB. With rB = 0.5

and h2 = 0.25, replication of full-sib designs is for the first time more efficient across five

locations than across two locations; however, at the higher rB level two locations is again the

preferred number.


Dominance to Additive Variance Ratio


In comparing the two full-sib designs for relative efficiency in estimating y, the circular

design is always approximately equal to or, for most cases, superior to the half-diallel design

(Figure 2-3). The relative superiority of the circular design is enhanced by decreasing y and r,

(not shown). The half-diallel design again demonstrates optima for number of parents with the

six-parent design being near optimal. Within a mating design the use of two locations is always

more efficient than the use of five locations. The magnitude of this superiority escalates with

increasing rB and h2 (Figures 2-3a and 2-3b versus 2-3c and 2-3d).








Discussion


Comparison of Mating Designs


A prior knowledge of genetic control is required to choose the optimal mating and field

design for estimation of h2, rB and 7. Given that such knowledge may not be available, the

choices are then based on the most robust mating designs and field designs for the estimation of

certain of the genetic ratios. If h2 is the only ratio desired, then the half-sib mating design is

best. Estimation of both h2 and rB requires a choice between the half-sib and circular designs.

If there is no prior knowledge then the selection of a mating design is dependent on which ratio

has the highest priority. For experiments in which h2 received highest weighting, the half-sib

design is preferred and in the alternative case the circular design is the better choice. In the last

scenario information on all three ratios is desired from the same experiment and in this case the

circular design is the better selection since the circular design is almost globally more efficient

than the half-diallel design.

After choosing a mating design, the next decision is how many locations per experiment

are required to optimize efficiency. For the half-sib design the number of locations required to

optimize efficiency is dependent on both the ratio being estimated and the level of genetic control.

A broad inference is that for h2 estimation a two location experiment is more efficient and for rB

a five location experiment has the better efficiency. Estimation of any of the three ratios with

a full-sib design is almost globally more efficient in two location experiments.

The disparity between the behavior of the half-sib and full-sib designs with respect to the

efficiency of location levels can be explained in terms of the genetic connectedness offered by the

different designs. Genetic connectedness can be viewed as commonality of parentage among

genetic entries. The more entries having a common parent the more connectedness is present.






23

The half-sib design is only connected across locations by the one common parent in a half-sib

family in each replication. Full-sib designs are connected across locations in each replication by

the full-sib cross plus the number of parents minus two (half-diallel) or three (circular) for each

of the two parents in a cross. The connectedness in a full-sib design means each observation is

providing information about many other observations. The result of this connectedness is that,

in general, fewer observations (number of locations) are required for maximum efficiency.


A General Approach to the Estimation Problem


The estimation problems may be viewed in a broader context than the specific solutions

in this chapter. The technique for comparison of mating designs and numbers of locations across

levels of genetic determination may be construed, for the case of h2 estimation, to be the effect

of these factors on the variance of o2g, estimates. Viewing the variance approximation formula,

the conclusion may be reached that the variance of o2, estimates is the controlling factor in the

variance of h2 estimates since the other factors at these heritability levels are multiplied by

constants which reduce their impact dramatically. Given this conclusion, the variance of h2

estimates is essentially the (3,3) element in 2{tr(QVQVj)})1 (Eq. 2-11). Further, since the

covariances of the other variance component estimates with oe estimates are small, the variance

of o2g, estimates is basically determined by the magnitude of the (3,3) element of {tr(QVQVj)}

which is tr(QVQV). Thus, the variance of h2 estimates is minimized by maximizing

tr(QVgQVg with h2 used as an illustration because this simplification is possible.

Considering the impact of changing levels of genetic control, while holding the mating

and field designs constant, V, is fixed, the diagonal elements of V are fixed at 11.5 because of

our assumptions, and only the off-diagonal elements of V change with genetic control levels.

Since Q is a direct function of V', what we observe in Figure 2-1 comparing a design across






24

levels of genetic control are changes in V" brought about by changes in the magnitude of the off-

diagonal elements of V (covariances among observations). The effect of positive (the linear

model specifies that all off-diagonal elements in V are zero or positive) off-diagonal elements on

V' is to reduce the magnitude of the diagonal elements and often also result in negative off-

diagonal elements. If one increases the magnitude of the off-diagonal elements in V, then the

magnitude of the diagonal elements of V' is reduced and the magnitude of negative off-diagonal

elements is increased. Since tr(QVQV) is the sum of the squared elements of the product of

a direct function of VY and a matrix of non-negative constants (V), as the diagonal elements of

V' are reduced and the off-diagonal elements become more negative, tr(QVQV) must become

smaller and the variance of h2 estimates increases.

Mating designs may be compared by the same type of reasoning. Within a constant field

design changes in mating design produce alterations in V. Of the three designs the half-sib

produces a V matrix with the most zero off-diagonal elements, the circular design next, and the

half-diallel the fewest number of zero off-diagonal elements. Knowing the effect of off-diagonal

elements on the variance of h2 estimates, one could surmise that the variance of estimates is

reduced in the order of least to most non-zero off-diagonal elements. This tenant is in basic

agreement with the results in Figures 2-1 through 2-3.

The effects of rB and y on the variance of h2 estimates can also be interpreted utilizing

the above approach. In the results section of this chapter it is noted that decreasing the magnitude

of rB and/or y causes full-sib designs to rise in efficiency relative to the half-sib design. In

accordance with our previous arguments this would be expected since decreasing the magnitude

of those two ratios causes a decrease in the magnitude of off-diagonal elements. More precisely,

decreasing y results in the reduction of off-diagonal elements in V of the full-sib designs while

not affecting the half-sib design, and decreasing r% results in the reduction of off-diagonal






25

elements in V of full-sib and half-sib designs. Relative increases in efficiency of full-sib designs

result from the elements due to location by additive interaction occurring much less frequently

in the half-sib designs; thus, the relative impact of reduction in r, in half-sib designs is less than

that for full-sibs.


Use of the Variance of a Ratio Approximation


Use of Kendall and Stuart's (1963) first approximation (first-term Taylor series

approximation) of the variance of a ratio has two major caveats. The approximation depends on

large sample properties to approach the true variance of the ratio, i.e., with a small number of

levels for random variables the approximation does not necessarily closely approximate the true

variance of the ratio. Work by Pederson (1972) suggests that for approximating the variance of

h2 at least ten parents are required in diallels before the approximation will converge to the true

variance even after including Taylor series terms past the first derivative. Pederson's work also

suggests that the approximation is progressively worse for increasing heritability with low

numbers of parents. Using the field design in this chapter (two locations,four blocks and six-tree

row-plots), simulation work (10,000 data sets) has demonstrated that with a heritability of 0.1

using four parents in a half-diallel across two locations that the variance of a ratio approximation

yields a variance estimate for h2 of 0.1 while the convergent value for the simulation was 0.08

(Huber unpublished data). One should remember the dependence of the first approximation of

the variance of a ratio on large sample properties when applying the technique to real data.

The second caveat is that the range of estimates of the denominator of the ratio cannot

pass through zero (Kendall and Stuart 1963). This constraint is of no concern for h2; however,

the structure of r. and y denominators allows unbiased minimum variance estimates of those

denominators to pass through zero which means at one point in the distribution of the estimates






26

of the ratios they are undefined (the distributions of these ratio estimates are not continuous).

Simulation has shown that the variances of r, and y are much greater than the approximation

would indicate (Huber unpublished data). The discrepancy in variance of the estimates could be

partially alleviated through using a variance component estimation technique which restricts

estimates to the parameter space 0 < o2 < oo. Nevertheless, because of the two caveats,

approximations of the variance of h2, r. and y estimates should be viewed only on a relative basis

for comparisons among designs and not on an absolute scale.

Additionally, the expectation of a ratio does not equal the ratio of the expectations (Hogg

and Craig 1978). If a value of genetic ratios is sought so that the value equals the ratio of the

expectations, then the appropriate way to calculate the ratio would be to take the mean of

variance components or linear combinations of variance components across many experiments and

then take the ratio. If the value sought for h2 is the expectation of the ratio, then taking the mean

of many h2 estimates is the appropriate approach. Returning to the results from simulated data

(10,000 data sets) where the h2 value was set at 0.1, using the ratio of the means of variance

components rendered a value of 0.1 for h2, the mean of the h2 estimates returned a value of 0.08,

and a Taylor series approximation of the mean of the ratio yielded 0.07 (Pederson 1972).


Conclusions


Results from this study should be interpreted as relative comparisons of the levels of the

factors investigated. However, viewing the optimal design problem as illustrated in the

discussion section of this chapter can provide insight to the more general problem.

There is no globally most efficient number of locations, parents or mating design for the

three ratios estimated even within the restricted range of this study; yet, some general conclusions

can be drawn. For estimating h2 the half-sib design is always optimal or close to optimal in






27

terms of variance of estimation and efficiency. In the estimation of rB and 7, the circular mating

design is always optimal or near optimal in variance reduction and efficiency. Across numbers

of parents within a mating design only the half-diallel shows optima for efficiency. The other

mating designs have non-decreasing efficiency plots over the level of number of parent; so that

while there is an optimal number of locations for a level of genetic control, the number of genetic

entries per location is limited more by operational than efficiency constraints.

Two locations is a near global optimum over five locations for the full-sib mating designs.

Within the half-sib mating design optimality depends on the levels of h2 and rB: 1) for h2

estimation the optimal number of locations is inversely related to the level of h2, i.e. at the higher

level two tests were optimal and at the lower level five tests were optimal; and 2) for rB

estimation for the half-sib design, the optimal number of locations was also inversely related to

the level of rB.

Means of estimates from disconnected sets provide lower variance of estimation where

the smaller experiments have higher efficiencies. Thus, disconnected sets are preferred according

to number of locations for all mating designs and according to number of parents for the half-

diallel mating design.

In practical consideration of the optimal mating design problem, the results of this study

indicate that if h2 estimation is the primary use of a progeny test then the half-sib mating design

is the proper choice. Further, the circular mating design is an appropriate choice if the

estimation of rB is more important than h2,. Finally, if a full-sib design is required to furnish

information about dominance variance, the circular design provides almost globally better

efficiencies for h2, rB, and y than the half-diallel.












CHAPTER 3
ORDINARY LEAST SQUARES ESTIMATION OF GENERAL
AND SPECIFIC COMBINING ABILITIES FROM
HALF-DIALLEL MATING DESIGNS


Introduction


The diallel mating system is an altered factorial design in which the same individuals (or

lines) are used as both male and female parents. A full diallel contains all crosses, including

reciprocal crosses and selfs, resulting in a total of p2 combinations, where p is the number of

parents. Assumptions that reciprocal effects, maternal effects, and paternal effects are negligible

lead to the use of the half-diallel mating system (Griffing 1956, method 4) which has p(p-1)/2

parental combinations and is the mating system addressed in this chapter.

Half diallels have been widely used in crop and tree breeding (Sprague and Tatum 1942,

Gilbert 1958, Matzinger et al. 1959, Burley et al. 1966, and Squillace 1973) and the widespread

use of this mating system continues today (Weir and Zobel 1975, Wilcox et al. 1975, Snyder and

Namkoong 1978, Hallauer and Miranda 1981, Singh and Singh 1984, Greenwood et al. 1986,

and Weir and Goddard 1986).

Most of the statistical packages available treat fixed effect estimation as the objective of

the program with random variables representing nuisance variation. Within this context a

common analysis of half-diallel experiments is conducted by first treating genetic parameters as

fixed effects for estimation of general (GCA) and specific (SCA) combining abilities and

subsequently as random variables for variance component estimation (used for estimating

heritabilities, genetic correlations, and general to specific combining ability variance ratios for






29

determining breeding strategies). This chapter focuses on the estimation of GCA's and SCA's

as fixed effects. The treatment of GCA and SCA as fixed effects in OLS (ordinary least squares)

is an entirely appropriate analysis if the comparisons are among parents and crosses in a

particular experiment. If, as forest geneticists often wish to do, GCA estimates from

disconnected experiments are to be compared, then methods such as checklots must be used to

place the estimates on a common basis.

Formulae (Griffing 1956, Falconer 1981, Hallauer and Miranda 1981, and Becker 1975)

for hand calculation of general and specific combining abilities are based on a solution to the OLS

equations for half-diallels created by sum-to-zero restrictions, i.e., the sum of all effect estimates

for an experimental factor equals zero. These formulae will yield correct OLS solutions for sum-

to-zero genetic parameters provided the data have no missing cells. If cell (plot) means are used

as the basis for the estimation of effects, there must be at least one observation per cell (plot)

where a cell is a subclassification of the data defined by one level of every factor (Searle 1987).

An example of a cell is the group of observations denoted by ABj for a randomized complete

block design with factor A across blocks (B). If the above formulae are applied without

accounting for missing cells, incorrect and possibly misleading solutions can result. The matrix

algebra approach is described in this chapter for these reasons: 1) in forest tree breeding

applications data sets with missing cells are extremely common; 2) many statistical packages do

not allow direct specification of the half-diallel model; 3) the use of a linear model and matrix

algebra can yield relevant OLS solutions for any degree of data imbalance; and 4) viewing the

mechanics of the OLS approach is an aid to understanding the properties of the estimates.

The objectives of this chapter are to (1) detail the construction of ordinary least squares

(OLS) analysis of half-diallel data sets to estimate genetic parameters (GCA and SCA) as fixed

effects, (2) recount the assumptions and mathematical features of this type of analysis, (3)






30

facilitate the reader's implementation of OLS analyses for diallels of any degree of imbalance and

suggest a method for combining estimates from disconnected experiments, and (4) aid the reader

in ascertaining what method is an appropriate analysis for a given data set.


Methods


Linear Model


Plot means are used as the unit of observation for this analysis with unequal numbers of

observations per plot. Plot (cell) means are always estimable as long as there is one observation

per plot, and linear combinations of these means (least squares means) provide the most efficient

way of estimating OLS fixed effects (Yates 1934). Throughout this chapter, estimates are

denoted by lower case letters while the parameters are designated by upper case letters and

matrices are in bold print.

Using plot means as observations, a common scalar linear model for an analysis of a half-

diallel mating design with p(p-1)/2 crosses planted at a single location in a randomized complete

block design with one plot per block is

yik = z + B, + GCAj + GCAk + SCAji + eij 3-1



where yk is the mean of the il block for the jkt cross;

it is an overall mean;

B, is the fixed effect of block i for i= 1 to b;

GCAj is the fixed general combining ability effect of the jLh female parent or

kh male parent, j or k = 1,. .,p (j k);

SCAj, is the fixed specific combining ability effect of parents j and k; and








ei, is the random error associated with the observation of the jk- cross in

the i1 block where eij (0, o2).

Cross by block interaction as genotype by environment interaction is treated as confounded with

between plot variation as for contiguous plots.

The model in matrix notation is

y = X# + e 3-2

where y is the vector of observation vectors (nxl = n rows and 1 column) where n equals

the number of observations;

X is the design matrix (nxm) whose function is to select the appropriate parameters

for each observation where m equals the number of fixed effect parameters in the

model;

( is the vector (mxl) of fixed effect parameters ordered in a column; and

e is the vector (nxl) of deviations (errors) from the expectation associated with each

observation.


Ordinary Least Squares Solutions


The matrix representation of an OLS fixed effects solution is

b = (X'X)-X'y 3-3



where b is the vector of estimated fixed effect parameters, i.e., an estimate of P, and

X is the design matrix either made full rank by reparameterization,

or a generalized inverse of X'X may be used.

Inherent in this solution is the ordinary least squares assumption that the variance-






32

covariance matrix (V) of the observations (y) is equal to Ia,, where I is an nxn identity matrix.

The elements of an identity matrix are I's on the main diagonal and all other elements are 0.

Multiplying I by i, places oa on the main diagonal. In the covariance matrix for the

observations, the variance of the observations appears on the main diagonal and the covariance

between observations appears in the off-diagonal elements. Thus, V = Io, states that the

variance of the observations is equal to a. for each observation and there are no covariances

between the observations (which is one direct result of considering genetic parameters as fixed

effects).


Sum-to-Zero Restrictions


The design matrix presented in this chapter is reparameterized by sum-to-zero restrictions

to (1) reduce the dimension of the matrices to a minimal size, and (2) yield estimates of fixed

effects with the same solution as common formulae in the balanced case. Other restrictions such

as set-to-zero could also be applied so the discussion that follows treats sum-to-zero restrictions

as a specific solution to the more general problem which is finding an inverse for X'X. The

subscripts 'o' and 's' refer to the overparameterized model and the reparameterized model with

sum-to-zero restrictions, respectively.

The matrix X, of Figure 3-1 is the design matrix for an overparameterized linear model

(Milliken and Johnson 1984, page 96). Overparameterization means that the equations are written

in more unknowns (parameters, in this case 13) than there are equations (number of observations

minus degrees of freedom for error, in this case 12 5 = 7) with which to estimate the

parameters. Reparameterization as a sum-to-zero matrix overcomes this dilemma by reducing

the number of parameters through making some of the parameters linear combinations of others.

Sum-to-zero restrictions make the resulting parameters and estimates sum to zero even though









the unrestricted parameters (for example, the true GCA values as applied to a broader population)

do not necessarily sum-to-zero within a diallel. This is the problem of comparability of GCA

estimates from disconnected experiments.



py B, B2 GCA, GCA2 GCA3 GCA4 SCA12 SCA,3 SCA4 SCA, SCA, SCA,

112 I 0 1 1 0 0 1 0 0 0 0 0 A
Y13 110 1 0 1 0 0 1 0 0 0 0 B,
y14 110 1 0 0 1 0 0 1 0 0 0 B2
Y,2s 1 0 0 1 1 0 0 0 0 1 0 0 GCA,
y,2 11 0 0 1 0 1 0 0 0 0 1 0 GCA,
y, = 1 1 0 0 0 1 1 0 0 0 0 0 1 GCA,
212 1 0 1 1 1 0 0 1 0 0 0 0 0 GCA4
Y213 0 1 1 0 1 0 0 1 0 0 0 0 SCA,,
Y214 10 1 1 0 0 1 0 0 1 0 0 0 SCA,,
y 1 0 1 0 1 1 0 0 0 0 1 0 0 SCA,4
y224 1 0 1 0 1 0 1 0 0 0 0 1 0 SCA23
Y2 1 0 1 0 0 1 1 0 0 0 0 0 1 SCA2
.SCA .
y = X, #

Figure 3-1. The overparameterized linear model for a four-parent half-diallel planted on a single
site in two blocks displayed as matrices. The design matrix (X.) and parameter vector (.) are
shown in overparameterized form. I's and 0's denote the presence or absence of a parameter in
the model for the observed means (data vector, y). The parameters displayed above the design
matrix label the appropriate column for each parameter. Error vector not exhibited.


t B, GCA, GCA2 GCA, SCAIn SCA,3


Y112
Y113
YlI4
Y123
Y124
Y134
Y2n1
Y2z13
Y214

y2m .


Bl

GCA,

GCA,
GCA,
SCA,,
SCA,3 .


e112
ell3
e113
e114
e123
e124
e134
e212
e213
e214
e223
e224
e234


y = X, + e.

Figure 3-2. The linear model for a four-parent half-diallel planted on a single site in two blocks
displayed as matrices. The design matrix (X) and the parameter vector (f,) are presented in
sum-to-zero format. The parameters displayed above the design matrix label the appropriate
column for each parameter.



To illustrate the concept of sum-to-zero estimates versus population parameters, we use

the expectation of a common formula. Becker (1975) gives equation 3-4 (which for balanced






34

cases is equivalent to gj = ((p-1)/(p-2))(Z,. Z..)) as the estimate for general combining ability

for the jt line with p equalling the number of parents and Z4 equalling the site mean of the j x

k cross. This equation yields the same solution as the matrix equations with no missing plots or

crosses and with a design matrix which contains the sum-to-zero restrictions. An evaluation of

this formula in a four-parent half-diallel planted in b blocks for the GCA of parent 1 is obtained

by substituting the expectation of the linear model (equation 3-1) for each observation:

gj = (/(p(p-2)))(pZj. 2Z..) 3-4

E{g,} = E{(1/(p(p-2)))(pZ,. 2Z..)}

E{g,}= 3/4(GCAI) 1/4(GCA2 + GCA3 + GCA4) + 1/4(SCA12 + SCA13 + SCA4) -

1/4(SCA23 + SCA4 + SCA4).

The result of equation 3-4 is obviously not GCA, from the unrestricted model (equation

3-1). Thus, gj, an estimable function and an estimate of parameter GCA,, (the estimate of the

GCA of parent 1 given the sum-to-zero restrictions), does not have the same meaning as GCA,

in the unrestricted model. An estimable function is a linear combination of the observations; but

in order for an individual parameter in a model to be estimable, one must devise a linear

combination of the observations such that the expectation has a weight of one on the parameter

one wishes to estimate while having a weight of zero on all other parameters. A solution such

as this does not exist for the individual parameters in the overparameterized model (equation 3-1).

So, although the sum-to-zero restricted GCA parameters and estimates are forced to sum-to-zero

for the sample of parents in a given diallel, the unrestricted GCA parameters only sum-to-zero

across the entire population (Falconer 1981) and an evaluation of GCA1, demonstrates that the

estimate contains other model parameters.

The result of sum-to-zero restrictions is that the degrees of freedom for a factor equals

the number of columns (parameters) for that factor in X, (Figure 3-2). Thus, a generalized






35

inverse for X,'X, is not required since the number of columns in the sum-to-zero X. matrix for

each factor equals the degrees of freedom for that factor in the model (X, is full column rank and

provides a solution to equation 3-3).


Components of the Matrix Equation


The equational components of 3-2 are now considered in greater detail.

Data vector v

Observations (plot means) in the data vector are ordered in the manner demonstrated in

Figure 3-1. For our example Figure 3-1 is the matrix equation of a four parent half-diallel

mating design planted in two randomized complete blocks on a single site. There are six crosses

present in the two blocks for a total of 12 observations in the data vector, y. The observations

are first sorted by block. Second, within each block the observations should be in the same

sequence (for simplicity of presentation only). This sequence is obtained by assigning numbers

1 through p to each of the p parents and then sorting all crosses containing parent 1 (whether as

male or female) as the primary index in descending numerical order by the other parent of the

cross as the secondary index. Next all crosses containing parent 2 (primary index, as male or

female) in which the other parent in the cross (secondary index) has a number greater than 2 are

then also sorted in descending order by the secondary index. This procedure is followed through

using parent p-1 as the primary index.

Design matrix and parameter vector. X and B

The design matrix for a model is conceptually a listing of the parameters present in the

model for each observation (Searle 1987, page 243). In Figure 3-1, y and f. are exhibited and

the parameters in f are displayed at the tops of the columns of X, (a visually correct

interpretation of the multiplication of a matrix by a vector). For each observation in y, the scalar







36

model (equation 3-1) may be employed to obtain the listing of parameters for that observation

(the row of the design matrix corresponding to the particular observation). The convention for

design matrices is that the columns for the factors occur in the same order as the factors in the

linear model (equation 3-1 and Figure 3-1). Since design matrices can be devised by first

creating the columns pertinent to each factor in the model (submatrices) and then horizontally

and/or vertically stacking the submatrices, the discussion of the reparameterized design matrix

formulation will proceed by factor.

Mean

The first column of X, is for j and is a vector of I's with the number of rows equalling

the number of observations (Figure 3-2). The linear model (equation 3-1) indicates that all

observations contain u and the deviation of the observations from 1 is explained in terms of the

factors and interactions in the model plus error.

Block

The number of columns for block is equal to the number of blocks minus one (column

2, X,). Each row of a block submatrix consists of I's and O's or -l's according to the identity

of the observation for which the row is being formed. The normal convention is that the first

column represents block 1 and the second column block 2, etc. through block b-1. Since we

have used a sum-to-zero solution (Eb,=0), the effect due to block b is a linear combination of

the other b-1 effects, i.e., bb = -E!"bi which in our example is 0 = b, + b2 and b2 = -b,.

Thus, the row of the block submatrix for an observation in block b (the last block) has a -1 in

each of the b-1 columns signifying that the block b effect is indeed a linear combination of the

other b-1 block effects. Columns 2 and 3 of X, (Figure 3-1) have become column 2 of X,

(Figure 3-2).








General combining ability

This submatrix of X, is slightly more complex than previous factors as a result of having

two levels of a main effect present per observation, i.e., the deviation of an observation from tA

is modeled as the result of the GCA's of both the male and female parents (equation 3-1). Again

we have imposed a restriction, Ejgcaj=O. Since GCA has p-I degrees of freedom, the submatrix

for GCA should have p-1 columns, i.e., gcap = -E;gca,. The GCA submatrix for X, (columns

3 through 5 in Figure 3-2) is formed from X, (columns 4 through 7 in Figure 3-1) according in

the same manner as the block matrix: (1) add minus one to the elements in the other columns

along each row containing a one for gca (p=4 in our example); and (3) delete the column from

X, corresponding to gcap. The GCA submatrix has p(p-1)/2 rows (the number of crosses). This,

with no missing cells (plots), equals the number of observations per block. To form the GCA

factor submatrix for a site, the GCA submatrix is vertically concatenated (stacked on itself) b

times. This completes the portion of the X, matrix for GCA.

Specific combining ability

In order to facilitate construction of the SCA submatrix, a horizontal direct product

should be defined. A horizontal direct product, as applied to two column vectors, is the element

by element product between the two vectors (SAS/IML' User's Guide 1985) such that the

element in the it row of the resulting product vector is the product of the elements in the i1 rows

of the two initial vectors. The resultant product vector has dimension n x 1. A horizontal direct

product is useful for the formation of interaction or nested factor submatrices where the initial

matrices represent the main factors and the resulting matrix represents an interaction or a nested

factor (product rule, Searle 1987).


'SAS/IML is the registered trademark of the SAS Institute Inc. Cary, North Carolina.







38

The SCA submatrix can be formulated from the horizontal direct products of the columns

of the GCA sub-matrix in X, (Figure 3-2). The results from the GCA columns require

manipulation to become the SCA submatrix (since degrees of freedom for SCA do not equal those

of an interaction for a half-diallel analysis), but the GCA column products provide a convenient

starting point. The column of the SCA submatrix representing the cross between the jP and the

kt parents (SCA.) is formed as the product between the GCAj and GCAk columns (Figure 3-3).

The GCA columns in Figure 3-2 are multiplied in this order: column 1 times column 2 forming

the first SCA column, column 1 times column 3 forming the second SCA column, and column

2 times column 3 forming the third SCA column (Figure 3-3). With four parents (six crosses)

there are three degrees of freedom for GCA (p-1) and two degrees of freedom for SCA (6 crosses

- 3 for GCA 1 for the mean). Since SCA has only two degrees of freedom, a sum-to-zero

design matrix can have only two columns for SCA. Imposing the restriction that the sum of the

SCA's across all parents equals zero is equivalent to making the last column for the SCA

submatrix (Figure 3-3) a linear combination of the others (Figure 3-2). The procedure for

deleting the third column product is identical to that for the GCA submatrix: add minus one to

every element in the rows of the remaining SCA columns in which a one appears in the column

which is to be deleted (Figure 3-2, columns 6 and 7). The number of rows in the SCA submatrix

equals the number observations in a block and must be vertically concatenated b times to create

the SCA submatrix for a site.

An algebraic evaluation of SCA sum-to-zero restrictions requires that Ejscap = 0 for

each k and that EEkscaj = 0; thus, for observations in the i- block with i serving to denote the

row of the SCA submatrix in block i, sca,14 = -sca,12 -sca,3 and entries in the submatrix row for

y,14 are -l's. The estimate for scai equals sca,14 because scai is the negative of the sum of the

independently estimated SCA's (sca,,2 and scai1) from the restriction that the sum of the SCA's







39

across all parents equals zero. Similarly, by sum-to-zero definition sca~ = -scam -scan and by

substitution scam = -(-sca,12 -sca13) -sca,2 = sca,13. By the same protocol, it can be shown that

sca, = sca12n. The elements in the rows of the SCA submatrix are l's, -l's and 0's in

accordance with the algebraic evaluation. Thus, while it may seem that there should be 6 SCA

values (one for each cross), only 2 can be independently estimated and the remaining 4 are linear

combinations of the independently estimated SCA's. Again the SCA sum-to-zero estimates are

not equal to the parametric population SCA's. An analogous illustration for SCA to that for

GCA would show that the estimable function (linear combination of observations) for a given

SCA, contains a variety of other parameters.


OBS. GCAixGCA2 GCAixGCA3 GCA2xGCA, SCAt2 SCA,, SCA,
Y2 (1)(1)=1 (1)(0)=0 (1)(0)=0 1 0 0
YS (1)(0)=0 (1)(1)=1 (0)(1)=0 0 1 0
Yi4 (0)(-1)=0 (0)(-1)=0 (-1)(-1)=1 0 0 1
Ym (0)(1)=0 (0)(1)=0 (1)(1)=1 0 0 1
Y" (-1)(0)=0 (-1)(-1)=1 (0)(-1)=0 0 1 0
Y (-1)(-)= (-1)(0)=0 (-1)(0)=0 1 0 0

Figure 3-3. Intermediate result in SCA submatrix generation (SCA columns as horizontal direct
products of GCAi, GCA2, and GCA3 columns within a block). The SCAj column is the
horizontal direct product of the columns for GCA, and GCAk.


Estimation of Fixed Effects


GCA parameters

The GCA parameters can be estimated (without mean, block, and SCA in the design

matrix) through the use of equation 3-3, if there are no missing cell means (plots) for any cross

and no missing crosses. The design matrix consists only of the GCA submatrix. This design

matrix has {p-1} (for GCA's) columns (the third through the fifth columns of X.). The b vector

is an estimate of the GCA portion of 8, as in Figure 3-2 and the linear combinations for the

estimation of gca, is gca, = -pE-gcaj. Parameters for any of the factors can be estimated






40

independently using the pertinent submatrix as long as there are no missing cell means (plots) and

no missing crosses; this uses a property known as orthogonality.

Orthogonality requires that the dot product between two vectors equals zero (Schneider

1987, page 168). The dot product (a scalar) is the sum of the values in a vector obtained from

the horizontal direct product of two vectors. For two factors to be orthogonal, the dot products

of all the column vectors making up the section of the design matrix for one factor with the

column vectors making up the portion of the design matrix for the second must be zero. If all

factors in the model are orthogonal, then the X,'X, matrix is block diagonal. A block-diagonal

X.'X. matrix is composed of square factor submatrices (degrees of freedom x degrees of freedom)

along the diagonal with all off-diagonal elements not in one of the square factor submatrices

equalling zero. A property of block-diagonal matrices is that the inverse can be calculated by

inverting each block separately and replacing the original block in the full X'X matrix by the

inverted block. Because the blocks can be inverted separately and all other off-diagonal elements

of the inverse are zero, the effects for factors which are orthogonal to all other factors may be

estimated separately, i.e., there are no functions of other sum-to-zero factors in the sum-to-zero

estimates.

Mean, block. GCA and SCA parameters

All parameters are estimated simultaneously by horizontally concatenating the mean,

block, GCA, and SCA matrices to create X,. Equation 3-3 is again utilized to solve the system

of equations. The b vector for the four parent example is an estimate of 3, of Figure 3-2.

Again, one parameter is estimated for each column in the X, matrix and all parameter estimates

not present are linear combinations of the parameter estimates in the b vector. So b, is equal to -

E -Ib, and gca, is equal to -EI-gcaj. The linear combinations for SCA effects can be obtained

by reading along the row of the SCA submatrix associated with the observation containing the






41

parameter, i.e., in Figure 3-2 the observation ym contains the effect sca, which is estimated as

the linear combination -scal2 -sca.13.

This completes the estimation of fixed effect parameters from a data set which is balanced

on a plot-mean basis. Since field data sets with such completeness are a rarity in forestry

applications, the next step is OLS analysis for various types of data imbalance. Calculations of

solutions based on a complete data set and simulated data sets with common types of imbalance

are demonstrated in numerical examples.



Numerical Examples


The data set analyzed in the numerical examples is from a five-year-old, six-parent half-

diallel slash pine (Pinus elliottii var. elliottii Engelmn) progeny test planted on a single site in

four complete blocks. Each cross is represented by a five-tree row plot within each block. Total

height in meters and diameter at breast height (dbh in centimeters) are the traits selected for

analysis. The data set is presented in Table 3-1 so that the reader may reconstruct the analysis

and compare answers with the examples. The numbers 1 through 6 were arbitrarily assigned to

the parents for analysis. Because of unequal survival within plots, plot means are used as the unit

of observation.


Balanced Data (Plot-mean Basis)


The sum-to-zero design matrix for the balanced data set has (4 blocks)x(15 crosses) = 60

rows (which equals the number of observations in y) and has the following columns: one column

for iJ, three columns for blocks (b-1), five columns for GCA (p-1), and nine columns for SCA

(15 crosses 5 1) for a total of 18 columns. With sixty plot means (degrees of freedom) and

18 degrees of freedom in the model, subtracting 18 from 60 yields 42 degrees of freedom for







42

error which matches the degrees of freedom for cross by block interaction, thus verifying that

degrees of freedom concur with the number of columns in the sum-to-zero design matrix.

To illustrate the principle of orthogonality in the balanced case, the X'X and (X'X)1 matrices

may be printed to show that they are block diagonal. In further illustration, the effects within

a factor may also be estimated without any other factors in the design matrix and compared to

the estimates from the full design matrix.

The vectors of parameter estimates for height and dbh (Table 3-2) were calculated from the

same X, matrix because height and dbh measurements were taken on the same trees. In other

words, if a height measurement was taken on a tree, a dbh measurement was also taken, so the

design matrices are equivalent.


Missing Plot


To illustrate the problem of a missing plot, the cross, parent two by parent three, was

arbitrarily deleted in block one (as if observation yz were missing). This deletion prompts

adjustments to the factor matrices in order to analyze the new data set. The new vector of

observations (y) now has 59 rows. This necessitates deletion of the row of the design matrix (XY)

in block 1 which would have been associated with cross 2 x 3. This is the only matrix alteration

required for the analysis. Thus, the resultant X, matrix has 60 1 = 59 rows and 18 columns.

With 59 means in y and 18 columns in X., the degrees of freedom for error is 41.

Comparisons between results of the analyses (Table 3-2) of the full data set and the data

set missing observation y,3 reveal that for this case the estimates of parameters have been

relatively unaffected by the imbalance (magnitudes of GCA's changed only slightly and rankings

by GCA were unaffected).







43

Table 3-1. Data set for numerical examples. Five-year-old slash pine progeny test with a 6-
parent half-diallel mating design present on a single site with four randomized complete blocks
and a five-tree row plot per cross per block.

Within Plot Trees
Mean Mean Variance Variance per
Block Female Male Height DBH Height DBH Plot
Meters Centimeters n2 cm2
1 1 2 2.6899 3.810 0.9800 3.484 4
1 1 3 1.9080 2.134 1.4277 3.893 5
1 1 5 3.1242 4.445 0.4487 1.656 4
1 1 6 2.4933 3.200 0.8488 5.664 5
1 2 5 1.4783 1.588 0.6556 2.167 4
1 2 6 2.7026 3.471 0.1136 0.344 3
1 3 2 3.0480 4.699 0.2341 0.968 4
1 3 5 3.4991 5.131 0.0945 0.271 5
1 3 6 2.4003 2.794 0.5149 1.548 4
1 4 1 3.3955 4.928 0.1489 0.761 5
1 4 2 3.4290 5.144 0.7943 3.285 4
1 4 3 2.5298 2.984 0.9557 4.188 4
1 4 5 2.4155 3.175 0.5936 2.946 4
1 4 6 3.2004 4.521 1.7034 7.594 5
1 5 6 2.2403 2.794 1.0433 6.280 4
2 1 2 3.5662 5.080 0.9560 2.903 5
2 1 3 2.6335 3.353 0.7695 3.497 5
2 1 5 3.6942 5.893 0.0573 0.432 5
2 1 6 3.4808 4.928 0.9222 2.890 5
2 2 5 3.4260 4.877 0.7017 2.432 5
2 2 6 2.4282 3.302 0.0616 0.452 3
2 3 2 3.0480 4.064 0.0192 0.301 4
2 3 5 2.8895 4.013 0.1957 0.690 5
2 3 6 1.9406 1.863 0.0560 0.408 3
2 4 1 3.0114 3.962 1.9753 6.342 5
2 4 2 3.6454 5.283 0.1731 0.787 5
2 4 3 2.9566 3.861 0.0506 0.174 5
2 4 5 2.8118 4.382 1.1336 5.435 4
2 4 6 3.2674 4.318 1.1211 4.354 5
2 5 6 3.7917 5.893 0.0848 0.497 5
3 1 2 2.2961 2.625 0.3914 1.699 3
3 1 3 2.8956 4.128 1.2926 4.532 4
3 1 5 2.5359 3.607 0.8284 4.303 5
3 1 6 2.9032 3.937 0.8252 4.064 4
3 2 5 2.7737 4.064 0.9829 3.226 2
3 2 6 1.2040 0.635 0.4464 0.806 2
3 3 2 2.9870 4.191 0.9049 2.989 4
3 3 5 2.8407 3.962 0.7309 3.632 5
3 3 6 1.3564 0.000 0.1677 0.000 2
3 4 1 2.6746 3.620 0.8463 2.984 4
3 4 2 2.7066 3.353 0.5590 1.787 5
3 4 3 3.4198 4.623 0.3509 0.690 5
3 4 5 3.3299 4.953 0.4102 1.226 4
3 4 6 3.4564 4.978 0.8369 3.503 5
3 5 6 3.2614 4.826 .
4 1 2 1.8974 2.476 1.0160 3.629 4
4 1 3 1.3005 0.508 0.2019 0.774 3
4 1 5 2.0726 2.540 1.2235 5.097 3
4 1 6 1.8821 1.778 0.4728 3.312 4
4 2 5 1. 64 1.334 0.5354 2.382 4
4 2 6 1.5392 0.635 0.0376 0.806 2
4 3 2 1.8898 2.032 0.7364 1.892 4
4 3 5 2.5146 3.620 0.0876 0.446 4
4 3 6 1.8389 2.201 0.0941 0.280 3
4 4 1 2.3348 2.591 0.3816 2.722 5
4 4 2 1.7272 1.693 2.1640 8.602 3
4 4 3 1.6581 1.524 0.0537 0.903 5
4 4 5 2.1184 2.286 0.3137 2.366 4
4 4 6 1.5545 1.422 0.4803 1.019 5
4 5 6 1.4122 1.693 0.0338 0.150 3










Table 3-2. Numerical results for examples of data imbalance using the OLS techniques presented
in the text.


Balanced' Missing Plotb Missing Cross"


Estimate

of'
It
B'
B,
B,
GCA,
GCA,
GCA,
GCA,
GCA,
SCAt
SCA,,
SCA4,
SCA,,
SCA,
SCA2
SCA2
SCA,
SCA35


DBH
3.362
0.292
0.976
0.205
0.144
-.180
-.347
0.398
0.489
0.172
-.628
-.128
0.126
0.912
0.289
-.706
0.164
0.677


Height
2.5787
0.1074
0.5274
0.1308
0.0760
-.1186
-.1426
0.2544
0.1320
0.0763
-.3277
-.0550
0.0700
0.3600
0.1627
-.3084
-.0493
0.3679


DBH
3.346
0.245
0.992
0.220
0.163
-.220
-.386
0.417
0.509
0.208
-.592
-.152
0.102
0.771
0.324
-.670
0.129
0.712


Height
2.5386
0.1074
0.5386
0.1180
0.1260
-.2186
-.2426
0.3044
0.1820
0.1663
-.2377
-.1150
0.0100

0.2527
-.2187
0.0406
0.4793


DBH
3.260
0.245
1.023
0.187
0.270
-.434
-.601
0.524
0.616
0.400
-.400
-.280
-.026

0.517
-.478
0.064
0.905


Five
Missing Crossesd


Height
2.4980
0.1393
0.6041
0.0689
0.1361
-.2371
-.3972
0.4241
0.1746


DBH
3.149
0.309
1.140
0.087
0.232
-.493
-.952
0.804
0.646


Height
2.5830
0.1203
0.5230
0.1264
0.0706
-.1077
-.1316
0.2489
0.1265
0.0665
-.3374
-.0484
0.0766
0.3995
0.1528
-.3185
-.0592
0.3580


"where (numerical examples are for height)
b4= -E3bi = -.7697;
gca6 = -E5gca, = -.2067;
scaP = -Escap, for j or k = p and p= 1,2,3 then sca,1 = .2428,
sca, = -.3002, and sca, = -.3608; sca4 = -Esca. = -.2898,
e = independently estimated sea's 1, 9;
sca4 = sca,2 + sca3l + sca15 + sca3 + sca2 + sca35 = .2446;
and sca5 = scale2 + sca43 + sca4 + sca2 + sca4 + sca, = .1737.

where the linear combinations for parameter estimates are identical
to the balanced example.

'where scap = -Escaj for j or k = p and p= 1 to 3; sca4 = -VEsca,
e = independently estimated SCA's 1,. ..,8;
sca4 = sca2, + scale3 + sca15 + sca2 + sca35; and
sca = sca12 + sca13 + sca14+ sca, + sca,.


where


sca16 = -sca14 -sca,,, sca. = -sca., sca, = sca,,
sca, = sca=5, sca. = sca14 + sca2 + sca4, and
scan = the negative of the sum of the four independently
estimated sea's.


"where for all cases linear combinations for block and gca are the same as in the balanced case.


-.2041 -.410
0.0480 0.094

0.1920 0.408

0.1163 0.246








Missing Cross


Another common form of imbalance in diallel data sets, the missing cross, is examined

through arbitrary deletion of the 2 x 3 cross from all blocks, i.e., y12, y3, y33, y4 are missing

in the data vector. This type of imbalance is representative of a particular cross that could not

be made and is therefore missing from all blocks. The matrix manipulations required for this

analysis are again presented by factor. For appropriate SCA restrictions, the data vector and

design matrix should be ordered so that the p1 parent has no missing crosses. Since the labeling

of a parent as parent p is entirely subjective, any parent with all crosses may be designated as

parent p. The previous labelling directions are necessary since we generate the SCA submatrix

as horizontal direct products of the columns of the GCA submatrix; and to account for missing

crosses, the horizontal direct product for each particular missing parental combinations are not

calculated which sets the missing SCA's to zero. If there is a cross missing from those of the

p1 parent, we cannot account for the missing cross with this technique (Searle 1987, page 479).

For the mean, block, and GCA submatrices, the adjustment for the missing cross dictates

deleting the rows in the submatrices which would have corresponded to the y2 observations. The

SCA submatrix must be reformed since a degree of freedom for SCA and hence a column of the

submatrix has been lost. The SCA submatrix is reinstituted from the GCA horizontal direct

products (remembering that one cross, 2x3, no longer exists and therefore that product GCA2 x

GCA3 is inappropriate). Dropping the column for SCA, is equivalent to setting SCA, to zero

(Searle 1987) so that the remaining SCA's will sum-to-zero. After that, the reformation is

according to the established pattern. With one missing cross there are now 56 observations and

hence 56 degrees of freedom available. The columns of the X, matrix are now: one for the

mean, three for block, five for GCA, and eight for SCA for a total of 17 columns. The






46

remaining degrees of freedom for error is 39, matching the correct degrees of freedom ((14-

1)x(4-1)= 39).

For the missing cross example 4 is no longer equivalent to the mean of the plot means

since 4 = 2.5386 and Egjyij)/N = 2.5715 where N = 56 (number of plot means). This is the

result of GCA effects which are no longer orthogonal to the mean. Check the X,'X, matrix or

try estimating factors separately and compare to the estimates when all factors are included in X.

If formulae for balanced data (Becker 1975, Falconer 1981, and Hallauer and Miranda

1981) are applied to unbalanced data (plot-mean basis) estimates of parameters are no longer

appropriate because factors in the model are no longer independent orthogonall). Applying

Becker's formula which uses totals of cross means for a site (y. to the missing cross example

yields: gca, = .2992, gca2 = -.5649, gca3 = -.5888, gca4 = .4665, gca5 = .3552, and gca, =

.0219. These answers are very different in magnitude from those in Table 3-2 for this example

and gca6 also has a different sign. Employing these formulae in the analysis of unbalanced data

is analogous to matrix estimation of GCA's without the other factors in the model which is

inappropriate.


Several Missing Crosses


The concluding example (Table 3-2) is a drastically unbalanced data set resulting from

the arbitrary deletion of five crosses (1 x 2, 1 x 3, 2 x 3, 3 x 5, and 4 x 5). The matrix

manipulation for this example is an extension of the previous one cross deletion example. Rows

corresponding to yi12, Yi13, yi2, yi, and y,5 are deleted from the mean, block and GCA

submatrices for all blocks. The SCA matrix (now 4 columns = 10 crosses 5 -1 = 4 degrees

of freedom) is again reformed with only the relevant products of the GCA columns. Counting

degrees of freedom (columns of the sum-to-zero design matrix), the mean has one, block has






47

three, GCA has five, and SCA has four degrees of freedom for a total of 13. Error has (4-1)(10-

1) = 27 degrees of freedom. Totaling degrees of freedom for modeled effects and error yields

40 which equals the number of plot means.

In increasingly unbalanced cases (Table 3-2), the spread among the GCA estimates tends

to increase with increasing imbalance (loss of information). This is a general feature of OLS

analyses and the basis for the feature is that the spread among the GCA estimates is due to both

the innate spread due to additive genetics effects as well as the error in estimation of the GCA's.

When there is less information, GCA estimates tend to be more widely spread due to the increase

in the error variance associated with their estimation. This feature has been noted (White and

Hodge 1989, page 54) as the tendency to pick as parental winners individuals in a breeding

program which are the most poorly tested.


Discussion


After developing the OLS analysis and describing the inherent assumptions of the

analysis, there are four important factors to consider in the interpretation of sum-to-zero OLS

solutions: (1) the lack of uniqueness of the parameter estimates; (2) the weights given to plot

means (yi) and in turn site means (y .) for crosses in data sets with missing crosses in parameter

estimation; (3) the arbitrary nature of using a diallel mean (perforce a narrow genetic base) as

the mean about which the GCA's sum-to-zero; and (4) the assumption that the covariance matrix

for the observations (V) is Ia2,.


Uniqueness of Estimates


Sum-to-zero restrictions furnish what would appear to be unique estimates of the

individual parameters, e.g. GCA1, when, in fact, these individual parameters are not estimable






48

(Graybill 1976, Freund and Littell 1981, and Milliken and Johnson 1984). The lack of

estimability is again analogous to attempting to solve a set of equations in n unknowns with t

equations where n is greater than t. Therefore, an infinite number of solutions exist for 8.

There are quantities in this system of equations that are unique (estimable), i.e., the

estimate is invariant regardless of the restriction (sum-to-zero or set-to-zero) or generalized

inverse (no restrictions) used (Milliken and Johnson 1984) and the estimable functions include

sum-to-zero GCA and SCA estimates since they are linear combinations of the observations; but,

these estimable quantities do not estimate the individual parametric GCA's and SCA's of the

overparameterized model (equation 3-4) since there is no unique solution for those parameters.


Weighting of Plot Means and Cross Means in Estimating Parameters


With at least one measurement tree in each plot and with plot means as the unit of

observation, use of the matrix approach produces the same results as the basic formulae. The

weight placed on each plot mean in the estimation of a parameter can be determined by

calculating (X,'XX)'X' which can be viewed as a matrix of weights W so that equation 3-3 can

be written as b = Wy. The matrix W has these dimensions: the number of rows equals the

number of parameters in f, and the number of columns equals the number of plot means in y.

The i!t row of the W contains the weights applied to y to estimate the i1 parameter in b (6). In

the discussion which follows gca, is utilized as 6,.

If there are no missing plots, the cross mean in every block (yj) has the same weighting

and weights can be combined across blocks to yield the weight on the overall cross mean (y.j).

It can be shown that for the balanced numerical example gca, is calculated by weighting the

overall cross means containing parent 1 by 1/6 and weighting all overall cross means not











GCA3 GCA4


1/6 1/6 1/6 1/6 1/6
.16667 .16667 .16667 .16667 .16667

.14583 -1/12 -1/12 -1/12 -1/12
missing -.08333 -.08333 -.08333 -.08333

.14583 missing -1/12 -1/12 -1/12
missing missing -.08333 -.08333 -.08333

.18056 -.10417 -.10417 -1/12 -1/12

.22549 .01961 -.11765 -.08333 -.08333

.18056 -.10417 -.10417 -.06944 -1/12

.31372 -.27451 missing missing t~ -.08333

.18056 -.10417 -.10417 -.06944 -.06944
.29412 .08824 -.04902 -.29412 -.20588


Figure 3-4. Weights on overall cross means (y.j) for the three numerical examples for
estimation of GCAI. The weights for the balanced example (above the diagonal) are presented
in both fractional and decimal form. The weights for the one-cross missing and the five-crosses
missing are presented as the upper number and lower number, respectively, in cells below the
diagonal. The marginal weights on GCA parameters (right margin) do not change although cells
are missing.


GCA1



GCA2



GCA3



GCA4



GCA5



GCA6


5/6



-1/6



-1/6



-1/6



-1/6



-1/6


GCA1


GCA2


GCA5


GCA6







50

containing parent 1 by -1/12. Figure 3-4 (above the diagonal) demonstrates the weightings on

the overall cross means for the balanced numerical example as well as the marginal weighting on

the GCA parameters. These marginal weightings are obtained by summing along a row and/or

column as one would to obtain the marginal totals for a parent (Becker 1975). One feature of

sum-to-zero solutions is that these marginal weightings will be maintained no matter the

imbalance due to missing crosses, as will be seen by considering the numerical examples for a

missing cross (Figure 3-4 below the diagonal, upper number) and five missing crosses (Figure

3-4 below the diagonal, lower number). The marginal weights have remained the same as in the

balanced case while the weights on the cross means differ among the crosses containing parent

1 and also among the crosses not containing parent 1. In the five missing crosses example,

crosses y.2 and y.2 even receive a positive weighting where in the prior examples they had

negative weighting.

The expected value in all three examples is GCA,, (for sum-to-zero) despite the

apparently nonsensical weightings to cross means with missing crosses; however, the evaluation

of the estimates in terms of the original model changes with each new combination of missing

cells, i.e., y.2 and y26 have a positive weight in the five missing crosses example in GCA,

estimation. Whether this type of estimation is desirable with missing cell (cross) means has been

the subject of some discussion (Speed, Hocking and Hackney 1978, Freund 1980, and Milliken

and Johnson 1984). The data analyst should be aware of the manner in which sum-to-zero treats

the data with missing cell means and decide whether that particular linear combination of cross

means estimating the parameter is one of interest, realizing that the meaning of the estimates in

terms of the original model is changing.








Diallel Mean


The use of the mean for a half-diallel as the mean around which GCA's sum-to-zero is

not satisfactory in that the diallel mean is the mean of a rather narrow genetically based

population, and in particular that the comparisons of interest are not usually confined to the

specific parents in a specific diallel on a particular site. A checklot can be employed to represent

a base population against which comparison of half- or full-sib families can be made to provide

for comparison of GCA estimates from other tests (van Buijtenen and Bridgwater 1986).

Mathematically, when effects are forced to sum-to-zero around their own mean, the

absolute value of the GCA's is reflective of their value relative to the mean of the group. Even

if the parents involved in the particular diallel were all far superior to the population mean for

GCA, GCA's calculated on an OLS basis would show that some of these GCA's were negative.

If the GCA's of the diallel parents were in fact all below the population mean, the opposite and

equally undesirable result ensues. For disconnected diallels together on a single site, an OLS

analysis would yield GCA estimates that sum-to-zero within each diallel since parents are nested

within diallels. Unless the comparisons of interest are only in the combination of the parents in

a specific diallel on a specific site, the checklot alternative is desirable.

A method for obtaining the desired goal of comparable GCA's from disconnected

experiments, disregarding the problem of heteroscedasticity, is to form a function from the data

which yields GCA estimates properly located on the number scale. Such a function can be

formed (using GCA, as an example) from gca,,, the diallel mean, and the checklot mean.

From expectations of the scalar linear model (equation 3-1),

GCA,. = ((p-1)/p)GCA, (1/p)EJ.2GCAj + (1/p)E=2SCAlk 3-5

(2/(p(p-2)))E E 1E=.=SCA+;

E{diallel mean) = A + (E=,B.)/b + (2/p)Ee=,GCAj + (2/(p(p-l)))EgII-IE=2SCAjk; and








E{checklot mean} = j + (E1=B.)/b + 7;

where j for GCA is j or k and r represents the fixed genetic parameter of the checklot. The

function used to properly locate GCA,, (the subscript rel denotes the relocated GCAi) is gca,,

= gca,, + (1/2)(diallel mean checklot mean). The expectation of gca,, with negligible SCA

is GCA,,o = GCAi r/2; and since breeding value equals twice GCA, BV,, = BV, T. If SCA

is non-negligible then the expectation is

GCA,, = GCAI + (l/(p-1))E.=2SCA, (l/((p-l)(p-2)))E I .3SCA, r/2. 3-6

In either case the function provides a reasonable manner by which GCA estimates from

disconnected diallels are centered at the same location on a number scale and are then

comparable.


Variance and Covariance of Plot Means


The variances of plot means with unequal numbers of trees per plot are by definition

unequal, i.e., Var(y1) = o2, + o2,,/n where o~, is plot variance, o, is the within plot variance

and ni, is the number of observations per plot. Also, if blocks were considered random, there

would be an additional source of variance for plot means due to blocks (as well as a covariance

between plot means in the same block) and this could be incorporated into the V matrix with

Var(yj) = oab + a2, + o2 ,n,/n. Since the variances of the means in the observation vector are

not equal and there is a covariance between the means if blocks are being considered random,

best linear unbiased estimates (BLUE) would be secured by weighting each mean by it's true

associated variance (Searle 1987, page 316). This is the generalized least squares (GLS)

approach as


b = (X,'V-'X)-'X,'V-y






53

The GLS approach relaxes the OLS assumptions of equal variance of and no covariance between

the observations (plot means) while still treating genetic parameters as fixed effects. The entries

along the diagonal of the V matrix are the variances of the plot means (Var(yk)) in the same

order as means in the data vector. The off-diagonal elements of V would be either 0 or o2b (the

variance due to the random variable block) for elements corresponding to observations in the

same block. BLUE requires exact knowledge of V; if estimates of ao, o2,, and o2, are utilized

in the V matrix, estimable functions of f approximate BLUE.

The OLS assumption that SCA and GCA are fixed effects can also be relaxed to allow

for covariances due to genetic relatedness. In particular, the information that means are from the

same half- or full-sib family could be included in the V matrix. Relaxation of the zero covariance

assumption implies that GCA and SCA are random variables. If GCA and SCA are treated as

random variables, then the application of best linear prediction (BLP) or best linear unbiased

prediction (BLUP) to the problem would be more appropriate (White and Hodge 1989, page 64).

The treatment of the genetic parameters as random variables is consistent with that used in

estimating genetic correlations and heritabilities. The V matrix of such an application would

include, in addition to the features of the GLS V matrix, the covariance between full-sib or half-

sib families added to the off-diagonal elements in V, i.e., if the first and second plot means in

the data vector had a covariance due to relationship, then that covariance is inserted twice in the

V matrix. The covariance would appear as the second element in the first row and the first

element in the second row of V (V is a symmetric matrix). Also the diagonal elements of V

would increase by 2o,, (the variance due to treating GCA as a random variable) + o. (the

variance due to treating SCA as a random variable).








Comparison of Prediction and Estimation Methodologies


Which methodology (OLS, GLS, BLP, or BLUP) to apply to individual data bases is

somewhat a subjective decision. The decision can be based both on the computational or

conceptual complexity of the method and the magnitude of the data base with which the analyst

is working. To aid in this decision, this discussion highlights the differences in the inherent

properties and assumptions of the techniques.

For all practical purposes the answers from the four techniques will never be equal;

however, there are two caveats. First, OLS estimates equal GLS estimates if all the cell means

are known with the same precision (variance), (Searle 1987, page 490). Otherwise, GLS

discounts the means that are known with less precision in the calculations and different estimates

result. The second caveat is if the amount of data is infinite, i.e., all cross means are known

without error, then all four techniques are equivalent (White and Hodge 1989, pages 104-106).

In all other cases BLP and BLUP shrink predictions toward the location parameters) and produce

predictions which are different from OLS or GLS estimates even with balanced data. During

calculations GLS, BLP, and BLUP place less weight on observations known with less precision,

which is intuitively pleasing.

With OLS and GLS forest geneticists treat GCA's and SCA's as fixed effects for

estimation and then as random variables for genetic correlations and heritabilities. BLP and

BLUP provide a consistent treatment of GCA's and SCA's as random variables while differing

in their assumptions about location parameters (fixed effects). In BLP fixed effects are assumed

known without error (although they are usually estimated from the data) while with BLUP fixed

effects are estimated using GLS. BLP and BLUP techniques also contain the assumption that the

covariance matrix of the observations is known without error (most often variances must be

estimated). In many BLUP applications (Henderson 1974), mixed model equations are utilized






55

iteratively to estimate fixed effects and to predict random variables from a data set. A BLUP

treatment of fixed effects allows any connectedness between experiments to be utilized in the

estimation of the fixed effects. This provides an intuitive advantage of BLUP over BLP in

experimentation where connectedness among genetic experiments is available or where the data

are so unbalanced that treating the fixed effects as known is less desirable than a GLS estimate

of the fixed effects.

An ordering of computational complexity and conceptual complexity from least to most

complex of the four methods is OLS, GLS, BLP and BLUP. The latter three methods require

the estimation of the covariance matrix of the observations either separately (a priori) or

iteratively with the fixed effects. Precise estimation of the covariance matrix for observations

requires a great number of observations and the precision of GLS, BLP and BLUP estimations

or predictions is affected by the error of estimation of the components of V.

Selection of a method can then be based on weighing the computational complexity and

size of the available data base against the advantages offered by each method. Thus, if

complexity of the computational problem is of paramount concern, the analyst necessarily would

choose OLS. With a small data base (one that does not allow reasonable estimates of variances),

the analyst would again choose OLS. With a large data base and no qualms with computational

complexity, the analyst can choose between BLP and BLUP based on whether there is sufficient

connectedness or imbalance among the experiments to make BLUP advantageous.


Conclusions


Methods of solving for GCA and SCA estimates for balanced (plot-mean basis) and

unbalanced data have been presented along with the inherent assumptions of the analysis. The

use of plot means and the matrix equations will produce sum-to-zero OLS estimates for GCA and







56

SCA for all types of imbalance. Formulae in the literature which yield OLS solutions for

balanced data can yield misleading solutions for unbalanced data because of the loss of

orthogonality and also weightings on site means for crosses (or totals) are constants.

GCA's and SCA's obtained through sum-to-zero restriction are not truly estimates of

parametric population GCA's and SCA's. There are an infinite number of solutions for GCA's

and SCA's from the system of equations as a result of the overparameterized linear model. Yet,

if the only comparisons of interest are among the specific parents on a particular site, then the

estimates calculated by sum-to-zero restrictions are appropriate. Checklots may be used to

provide comparability among estimates derived from disconnected sets.

Having discussed the innate mathematical features of OLS analysis, knowledge of these

features should help the data analyst decide if OLS is the most desirable technique for the data

at hand. It may be desirable to relax OLS assumptions, which are in all likelihood invalid for

the covariance matrix of the observations. This could lead to GLS, BLP or BLUP as better

alternatives.












CHAPTER 4
VARIANCE COMPONENT ESTIMATION TECHNIQUES
COMPARED FOR TWO MATING DESIGNS
WITH FOREST GENETIC ARCHITECTURE
THROUGH COMPUTER SIMULATION


Introduction


In many applications of quantitative genetics, geneticists are commonly faced with the

analysis of data containing a multitude of flaws (e.g. non-normality, imbalance, and

heteroscedasticity). Imbalance, as one of these flaws, is intrinsic to quantitative forest genetics

research because of the difficulty in making crosses for full-sib tests and the biological realities

of long term field experiments. Few definitive studies have been conducted to establish optimal

methods for estimation of variance components from unbalanced data. Simulation studies using

simple models (one-way or two-way random models) have been conducted for certain data

structures, i.e., imbalance, experimental design, and variance parameters (Corbeil and Searle

1976, Swallow 1981, Swallow and Monahan 1984, interpretations by Littell and McCutchan

1986). The results from these studies indicate that technique optimality is a function of the data

structure.

In practice (both historically and still common place), estimation of variance components

in forest genetics applications has been achieved by using sequentially adjusted sums of squares

as an application of Henderson's Method 3 (HM3, Henderson 1953). Under normality and with

balanced data, this technique has the desirable properties of being the minimum variance unbiased

estimator. If the data are unbalanced, then the only property retained by HM3 estimation is






58

unbiasedness (Searle 1971, Searle 1987 pp. 492,493,498). Other estimators have been shown

to be locally superior to HM3 in variance or mean square error properties in certain cases (Klotz

et al. 1969, Olsen et al. 1976, Swallow 1981, Swallow and Monahan 1984).

Over the last 25 years, there has been a proliferation of variance component estimation

techniques including minimum norm quadratic unbiased estimation (MINQUE, Rao 1971a),

minimum variance quadratic unbiased estimation (MIVQUE, Rao 1971b), maximum likelihood

(ML, Hartley and Rao 1967), and restricted maximum likelihood (REML, Patterson and

Thompson 1971). The practical application of these techniques has been impeded by their

computational complexity. However, with continuing advances in computer technology and the

appearance of better computational algorithms, the application of these procedures continues to

become more tractable (Harville 1977, Geisbrecht 1983, Meyer 1989). Whether these methods

of analysis are superior to HM3 for many genetics applications remains to be shown.

With balanced data and disregarding negative estimates, all previously mentioned

techniques except ML produce the same estimates (Harville 1977). With unbalanced data, each

technique produces a different set of variance component estimates. Criteria must then be

adopted to discriminate among techniques. Candidate criteria for discrimination include

unbiasedness (large number convergence on the parametric value), minimum variance (estimator

with the smallest sampling variance), minimum mean square error (minimum of sampling

variance plus squared bias, Hogg and Craig 1978), and probability of nearness (probability that

sample estimates occur in a certain interval around the parametric value, Pitman 1937).

Negative estimates are also problematic in the estimation of variance components. Five

alternatives for dealing with the dilemma of estimates less than zero (outside the natural parameter

space of zero to infinity) are (Searle 1971): 1) accept and use the negative estimate, 2) set the

negative estimate to zero (producing biased estimates), 3) re-solve the system with the offending






59

component set to zero, 4) use an algorithm which does not allow negative estimates, and 5) use

the negative estimate to infer that the wrong model was utilized.

The purpose of this research was to determine if the criteria of unbiasedness, minimum

variance, minimum mean square error, and probability of nearness discriminated among several

variance component estimation techniques while exploring various alternatives for dealing with

negative variance component estimates. In order to make such comparisons, a large number of

data sets were required for each experimental level. Using simulated data, this chapter compares

variance component estimation techniques for plot-mean and individual observations, two mating

systems (modified half-diallel and half-sib) and two sets of parametric variance components.

Types of imbalance and levels of factors were chosen to reflect common situations in forest

genetics.


Methods


Experimental Approach


For each experimental level 1000 data sets were generated and analyzed by various

techniques (Table 4-1) producing numerous sets of variance component estimates for each data

set. This workload resulted in enormous computational time being associated with each

experimental level. The overall experimental design for the simulation was originally conceived

as a factorial with two types of mating design (half-diallel and half-sib), two sets of true variance

components (Table 4-2), two kinds of observations (individual and plot mean) and three types of

imbalance: 1) survival levels (80% and 60%, with 80% representing moderate survival and 60%

representing poor survival; 2) for full-sib designs three levels of missing crosses (0, 2, and 5 out

of 15 crosses); and 3) for half-sib designs two levels of connectedness among tests (15 and 10

common families between tests out of 15 families per test). Because of the computational time









Table 4-1. Abbreviation for and description of variance component estimation methods utilized
for analyses based on individual observations (if utilized for plot-mean analysis the abbreviation
is modified by pre-fixing a 'P').

Abbreviation Description Citation

ML Maximum Likelihood: estimates not restricted to the parameter Hartley and Rao 1967;
PML space (individual and plot-mean analysis). Shaw 1987

MODML Maximum Likelihood: negative estimates set to zero after Hartley and Rao 1967
convergence (individual analysis).

NNML Maximum Likelihood: if negative estimates appeared at Hartley and Rao 1967;
convergence, they were set to zero and the system re-solved Miller 1973
(individual analysis).

REML Restricted Maximum Likelihood: estimates not restricted to the Patterson and
PREML parameter space (individual and plot-mean analysis). Thompson 1971; Shaw
1987; Harville 1977

MODREML Restricted Maximum Likelihood: negative estimates set to zero Patterson and
after convergence (individual analysis). Thompson 1971

NNREML Restricted Maximum Likelihood: if negative estimates appeared Patterson and
PNNREML at convergence, they were set to zero and the system re-solved Thompson 1971; Miller
(individual and plot-mean analysis). 1983

MIVQUE Minimum Variance Quadratic Unbiased: non-iterative with true Rao 1971b
PMIVQUE parametricc) values of the variance components as priors
(individual and plot-mean analysis).

MINQUE1 Minimum Norm Quadratic Unbiased: non-iterative with ones as Rao 1971a
PMINQUEI priors for all variance components (individual and plot-mean
analysis).

TYPE3 Sequentially Adjusted Sums of Squares; Henderson's Method 3 Henderson 1953
PTYPE3 (individual and plot-mean analysis).

MIVPEN MIVQUE with a penalty algorithm to prevent negative estimates Harville 1977
(individual analysis).


constraint, the experiment could not be run as a complete factorial and the investigation continued

as a partial factorial. In general, the approach was to run levels which were at opposite ends of

the imbalance spectrum, i.e., 80% survival and no missing crosses versus 60% survival and 5

missing crosses, within a variance component level. If results were consistent across these


treatment combinations, intermediate levels were not run.






61

Designation of a treatment combination is by five character alpha-numeric field. The first

character is either "H" (half-sib) or "D" (half-diallel). The second character denotes the set of

parametric variance components where "1" designated the set of variance components associated

with heritability of 0.1 and "2" designated the set of variance components associated with

heritability of 0.25 (Table 4-1). The third character is an "S" indicating that the last two

characters determine the imbalance level. The fourth character designates the survival level either

"6" for 60% or "8" for 80%. The final character specifies the number of missing crosses (half-

diallel) or lack of connectedness (half-sib). The treatment combination 'H1S80' is a half-sib

mating design (H), the set of variance components associated with heritability equalling 0.1 (1),

80% survival (8), and 15 common parents across tests (0).


Table 4-2. Sets of true variance components for the half-diallel and half-sib mating designs
generated from specification of two levels of single-tree heritability (h2), type B correlation (rB),
and non-additive to additive variance ratio (d/a).

Genetic Ratios"* True Variance Components"
SMating
r. r, d/a is gn I e I aa 1
full-sib 1.0 0.5 0.25 0.25 0.25 0.25 .595 7.905
0.1 0.5 1.0
half-sib 1.0 0.5 0.25 NA 0.25 NA .475 7.9964
0.25 0.8 .25 full-sib 1.0 0.5 0.625 .1562 .1562 .0391 .5769 7.6649

a h = 4,2g / I2phnypc; re = 4o, / (402, + 4u2); and uD / 2A as d/a = 4o2 / 40,.
b See definitions in equation 4-1.


Experimental Design for Simulated Data


The mating design for the simulation was either a six-parent half-diallel (no selfs) or a

fifteen-parent half-sib. The randomized complete block field design was in three locations (i.e.,

separate field tests) with four complete blocks per location and six trees per family in a block;

where family is a full-sib family for half-diallel or a half-sib family for the half-sib design. This






62

field design and the mating designs reflect typical designs in forestry applications (Squillace 1973,

Wilcox et al. 1975, Bridgwater et al. 1983, Weir and Goddard 1986, Loo-Dinkins et al. 1991)

and are also commonly used in other disciplines (Matzinger et al. 1959, Hallauer and Miranda

1981, Singh and Singh 1984). The six trees per family could be considered as contiguous or

non-contiguous plots without affecting the results or inferences.


Full-Sib Linear Model


The scalar linear model employed for half-diallel individual observations is

Yim = + ti + bj + gk + + Su + tga + tga + tSa+ + p + wju, 4-1

where yj1 is the mL observation of the klW cross in the jh block of the ih test;

AL is the population mean;

ti is the random variable test location ~ NID(0,,);

byj is the random variable block ~ NID(O,,2b);

gk is the random variable female general combining ability (gca) ~ NID(0,o2);

g, is the random variable male gca NID(0,2);

s, is the random variable specific combining ability (sca) ~ NID(0,o2,);

tg,~ is the random variable test by female gca interaction ~ NID(O,es);

tg, is the random variable test by male gca interaction ~ NID(0,o);

ts, is the random variable test by sca interaction ~ NID(0,o,);

pij is the random variable plot ~ NID(O0,p);

wyjkl is the random variable within-plot NID(0,o2,); and

there is no covariance between random variables in the model.

This linear model in matrix notation is (dimensions below model component)

y = Zl + ZTr + Z4eB + ZeG + Zss + ZeCG + Zse + Zpep + ew 4-2








nxl n nxt txl nxb bxl nggl nxs sxl nxtgtgxl nxts tsxl nxp pxl nxl

where y is the observation vector;

Z, is the portion of the design matrix for the ith random variable;

e, is the vector of unobservable random effects for the it random variable;

1 is a vector of l's; and

n, t, b, g, s, tg, ts, and p are the number of observations, tests, blocks, gca's, sca's, test

by gca interactions, test by sea interactions and plots, respectively.

Utilizing customary assumptions in half-diallel mating designs (Method 4, Griffing 1956), the

variance of an individual observation is

Var(yljm) = o2 + o2+ 2 + 0 2, + 2o+ 2 + o2, + 2p + o2w; 4-3

and in matrix notation the covariance matrix for the observations is

var(y) = ZrZo, + zZo2b + ZG Zo2 + ZSZ 2. + z GG74G2, + ZrSZ 2, + ZP z2, + I.o2. 4-4

where indicates the transpose operator, all matrices of the form ZZ|' are nxn, and I, is an

nxn identity matrix.


Half-sib Linear Model


The scalar linear model for half-sib individual observations is

yij = V + ti + bi + gk + tg, + phi* + Whi, 4-5

where yi, is the mh observation of the kh half-sib family in the j'h block of the iL test;

,u, ti, b1,, gk, and tg, retain the definition in Eq.4-1;

phj is the random variable plot containing different genotype by environment

components than the corresponding term in Eq.4-1 NID(O,a2,);

Whj, is the random variable within-plot containing different levels of genotypic and

genotype by environment components than the corresponding term in Eq.4-1








~ NID(O,c,); and

there is no covariance between random variables in the model.

The matrix notation model is (dimensions below model component)

y = pl + Zrer + Ze, + ZGG + ZrTG + Zep + ew 4-6

Snxl nxt txl nxtt b bxl nxg gxl nxtg tgxl nxp pxl nx

The variance of an individual observation in half-sib designs is

Var(ygi = a2 + e + 0 + + 0 p+ 2w 4-7

and Var(y) = ZrZ;Wo + ZBZob + ZGZGa, + ZGZ'2tg + ZZp'2ph + Ia, 4-8

For an observational vector based on plot means, the plot and within-plot random

variables were combined by taking the arithmetic mean across the observations within a plot.

The resulting plot means model has a new o2 or o2p, (a. or fh.) term being a composite of the

plot and within-plot variance terms of the individual observation model.

Three estimates of ratios among variance components were determined: 1) single tree

heritability adjusted for test location and block as f2 = 4g2 / 26,I where a2?,,, is the

estimate of the variance of an individual observation from equations 4-3 and 4-7 with the variance

components for test location and block deleted; 2) type B correlation as (i, = 4a2g / (4a2, +

4i2); and dominance to additive variance ratio as d/a = 42, / 42,.


Data Generation and Deletion


Data generation was accomplished by using a Cholesky upper-lower decomposition of the

covariance matrix for the observations (Goodnight 1979) and a vector of pseudo-random standard

normal deviates generated using the Box-Muller transformation with pseudo-random uniform

deviates (Knuth 1981, Press et al. 1989). The upper-lower decomposition creates a matrix (U)

with the property that Var(y) = U'U. The vector of pseudo-random standard normal deviates






65

(z) has a covariance matrix equal to an identity matrix (I) where n is the number of observations.

The vector of observations is created as y = U'z. Then Var(y) = U'(Var(z))U and since Var(z)

= I., Var(y) = U'IU = U'U.

Analyses of survival patterns using data from the Cooperative Forest Genetic Research

Program (CFGRP) at the University of Florida were used to develop survival distributions for

the simulation. The data sets chosen for survival analysis were from full-sib slash pine (Pinus

elliottii var elliottii Engelm) tests planted in randomized complete block designs with the families

in row plots and were selected because the survival levels were either approximately 60% or

80%. Survival levels for most crosses (full-sib families) clustered around the expected value,

i.e., approximately 60% for an average survival level of 60%; however, there were always a few

crosses that had much poorer survival than average and also a small number of crosses that had

much better survival than average. This survival pattern was consistent across the 50 experiments

analyzed. Thus, a lower than average survival level was arbitrarily assigned to certain crosses,

a higher than average survival level was assigned to certain crosses, and the average survival

level assigned to most crosses. This modeling of survival pattern was also extended to the half-

sib mating design. At 80% survival no missing plots were allowed and at 60% survival missing

plots occurred at random.

Full-sib family deletion simulated crosses which could not be made and were therefore

missing from the experiment. When deleting five crosses, the deletion was restricted to a

maximum of four crosses per parent to prevent loss of all the crosses in which a single parent

appeared since this would have resulted in changing a six-parent to a five-parent half-diallel.

Tests having only subsets of the half-sib families in common are a frequent occurrence

in data analysis at CFGRP. This partial connectedness was simulated by generating data in which






66

only 10 of the 15 families present in a test were common to either one of the other two tests

comprising a data set.


Variance Component Estimation Techniques


Two algorithms were utilized for all estimation techniques: sequentially adjusted sums

of squares (Milliken and Johnson 1984, p 138) for HM3; and Giesbrecht's algorithm (Giesbrecht

1983) for REML, ML, MINQUE and MIVQUE. Giesbrecht's algorithm is primarily a gradient

algorithm (the method of scoring), and as such allows negative estimates (Harville 1977,

Giesbrecht 1983). Negative estimates are not a theoretical difficulty with MINQUE or MIVQUE;

however, for REML and ML, estimates should be confined to the parameter space. For this

reason estimators referred to as REML and ML in this chapter are not truly REML and ML when

negative estimates occur; further, there is the possibility that the iterative solution stopped at a

local maxima not the global maximum. These concerns are commonplace in REML and ML

estimation (Corbeil and Searle 1976, Harville 1977, Swallow and Monahan 1984); however,

ignoring these two points, these estimators are still referred to as REML and ML.

The basic equation for variance component estimation under normality (Giesbrecht 1983)

for MIVQUE, MINQUE and REML is {tr(QViQV4)}^2 = {y'QViQy} 4-9

rxr rxl rxl

then = {tr(QVQVj)}-l{y'QViQy};

and for ML (tr(V-'VV-'Vj)}~i = {y'QVQy} 4-10

rxr rxl rxl

where {tr(QViQVj)} is a matrix whose elements are tr(QViQVj) where in the full-sib

designs i= 1 to 8 and j=l to 8, i.e., there is a row and column for

every random variable in the linear model;








tr is the trace operator that is the sum of the diagonal elements of a matrix;

Q = V1 V'X(X'V'X)-X'V- for V as the covariance matrix of y and X as

the design matrix for fixed effects;

V, = ZZ'i where i = the random variables test, block, etc.;

y is the vector of variance component estimates; and

r is the number of random variables in the model.

The MINQUE estimator used was MINQUE1 i.e., ones as priors for all variance

components; calculated by applying Giesbrecht's algorithm non-iteratively. MINQUE1 was

chosen because of results demonstrating MINQUEO (prior of 1 for the error term and of 0 for

all others) to be an inferior estimation technique for many cases (Swallow and Monahan 1984,

R.C. Littell unpublished data).

With normally-distributed uncorrelated random variables, the use of the true values of

the variance components as priors in a non-iterative application of Giesbrecht's algorithm

produced the MIVQUE solutions (equation 4-5). Obtaining true MIVQUE estimation is a luxury

of computer simulation and would not be possible in practice since the true variance components

are required (Swallow and Searle 1978). This estimator was included to provide a standard of

comparison for other estimators.. An additional MIVQUE-type estimator, referred to as

MIVPEN, was also included. MIVPEN was also a non-iterative application of the algorithm with

the true variance components as priors; however, this estimator was conditioned on the variance

component parameter space and did not allow negative estimates. The non-negative conditioning

of MIVPEN was accomplished by adding a penalty algorithm to MIVQUE such that no variance

component was allowed to be less than 1x10l7. Estimates from MIVPEN were equal to MIVQUE

for data sets for which there were no negative MIVQUE variance component estimates. When

negative MIVQUE estimates occur the two techniques were no longer equivalent. The penalty






68

algorithm operated by using A = a2 and by choosing a scalar weight w such that no element

of W,, is less than lx10-7. Then ~Z = o + wA, where A is the vector of departure from the

true values (a2), 1x107 is an arbitrary constant and ,~, is the vector of estimated variance

components conditioned on non-negativity.

REML estimates were from repeated application of Giesbrecht's algorithm (equation 4-9)

in which the estimates from the kh iteration become the priors for the k+ 1l iteration. The

iterations were stopped when the difference between the estimates from the k* and k+1*

iterations met the convergence criterion; then the estimates of the k + 1 iteration became the

REML estimates. The convergence criterion utilized was E=, I o2ik) 2i+1) < 1x104. This

criterion imposed convergence to the fourth decimal place for all variance components. Since

for this experimental workload it was desired that the simulation run with little analyst

intervention and in as few iterations as possible, the robustness of REML solutions obtained from

Giesbrecht's algorithm to priors (or starting points) was explored. The difference in solutions

starting from two distinct points (a vector of ones and the true values) was compared over 2000

data sets of different structures (imbalance, true variance components, and field design). The

results (agreeing with those of Swallow and Monahan 1984) indicated that the difference between

the two solutions was entirely dependent on the stringency of the convergence criterion and not

on the starting point (priors). Also the number of iterations required for convergence was greatly

decreased by using the true values as priors. Thus, all REML estimates were calculated starting

with the true values as priors.

Three alternatives for coping with negative estimates after convergence were used for

REML solutions: accept and use the negative estimates (Shaw 1987), arbitrarily set negative

estimates to zero, and re-solve the system setting negative estimates to zero (Miller 1973). The

first two alternatives are self-explanatory and the latter is accomplished by re-analyzing those data






69

sets in which the initial unrestricted REML estimates included one or more negative estimates.

During re-analysis if a variance component became negative, it was set to zero (could never be

any value other than zero) and the iterations continued. This procedure persisted until the

convergence criterion was met with a solution in which all variance components were either

positive or zero.

Harville (1977) suggested several adaptations of Henderson's mixed model equations

(Henderson et al. 1959) which do not allow variance component estimates to become negative;

however, the estimates can become arbitrarily close to zero. After trial of these techniques

versus the set the negative estimates to zero after convergence and re-solve the system approach,

comparison of results using the same data sets indicates that there is little practical advantage

(although more desirable theoretically) in using the approach suggested by Harville. The

differences between sets of estimates obtained by the two methods are extremely minor (solving

the system with a variance component set to zero versus arbitrarily close to zero).

ML solutions, as iterative applications of equation 4-6, were calculated from the same

starting points and with the same convergence criterion as REML solutions. The three negative

variance component alternatives explored for ML were to accept and use the negative estimates,

to arbitrarily set negative estimates to zero after converging to a solution for the former, and (for

half-sib data only) to re-solve the system setting negative variance components to zero.

The algorithm to calculate solutions for HM3 (sequentially adjusted sums of squares) was

based on the upper triangular G2 sweep (Goodnight 1979) and Hartley's method of synthesis

(Hartley 1967). The equation solved was E{MS}) = MS where MS is the vector of mean

squares and E{MS} is their expectation. The alternative used for negative estimates was to accept

and use the negative estimates.










Comparison Among Estimation Techniques


For the simulation MIVQUE estimates were the basis for all comparisons because

MIVQUE is by definition the minimum variance quadratic unbiased estimator. The results of

comparing the mean of 1000 MIVQUE estimates for an experimental level to the means for other

techniques were termed "apparent bias". "Apparent bias" denotes that 1000 data sets were not

sufficient to achieve complete convergence to the true values of the variance components.

Sampling variances of estimation were calculated from the 1000 observations within an

experimental level and estimation technique for variance components and genetic ratios (single

tree heritability, Type B correlation and dominance to additive variance ratio). Mean square

error then equalled variance plus squared "apparent bias". While mean square error was

investigated, there was never sufficient bias for mean square error to lead to a different decision

concerning techniques than sampling variance of the estimates; so mean square error was deleted

from the remainder of this discussion.

Probability of nearness is the probability that an estimate will lie within a certain interval

around the true parameter. The three total interval widths utilized were one-half, equal to, and

twice the parameter size. The percentage of 1000 estimates falling within these intervals were

calculated for the different estimation techniques within an experimental level for variance

components and ratios and utilized as an estimate of probability of nearness.

Results are presented by variance component or genetic ratio estimated as a percentage

of MIVQUE (except in the case of probability of nearness). MIVQUE estimates represent 100%

with estimates with greater variance having values larger than 100% and "apparently biased"

estimates having values different from 100%. The percentages were calculated as equal to 100

times the estimate divided by the MIVQUE value. For the criterion of variance, the lower the






71

percentage the better the estimator performed; for bias, values equalling 100% (0 bias) are

preferred; and for probability of nearness, larger percentages (probabilities) are favored since

they are indicative of greater density of estimates near the parametric value.



Results and Discussion


Variance Components


Sampling variance of the estimators

For all variance components estimated, REML and ML estimation techniques were

consistently equal to or less than MIVQUE for sampling variance of the estimator (Table 4-3).

The variance among estimates from these techniques was further reduced by setting the negative

components to zero (MODML and MODREML) or setting negative estimates to zero plus re-

solving the system (NNREML, NNML, and PNNREML). Variance among MINQUE1 estimates

is always equal to or greater than for MIVQUE, as one might expect, since they are, in this

application, the same technique with MIVQUE having perfect priors (the true values). Variances

for HM3 estimators (TYPE3 and PTYPE3) are either equal to or greater than MIVQUE (HM3

estimates have progressively larger relative variance with higher levels of imbalance. MIVPEN,

although impractical because of the need for the true priors, had much more precise estimates of

variance components than other techniques illustrating what could be accomplished given the true

values as priors plus maintaining estimates within the parameter space.

In general, the spread among the percentages for variance of estimation for the estimation

techniques is highly dependent on the degree of imbalance and the type of mating system. With

increasing imbalance the likelihood-based estimators realized greater advantage for sampling

variance of the estimates over HM3 for both mating systems. The most advantageous application









Table 4-3. Sampling variance for the estimates of a2, (upper number), oa2 (second number), and
h2 (third number where calculated) as a percentage of the MIVQUE estimate by type of estimator
and treatment combination; NA is not applied. Values greater than 100 indicate larger variance
among 1000 estimates.

Estimator II S80 DIS65 D2S65 H1S80 H1S65

REML 99.9 102.6 101.5 99.6 106.3
100.2 100.0 104.1 99.7 98.0
100.0 101.0 101.4 99.6 105.8
ML 77.3 78.2 76.4 95.9 103.9
106.9 104.8 110.7 100.8 99.1
82.5 82.9 86.4 96.2 103.8
MINQUEI 100.0 104.2 104.0 104.0 146.7
101.2 118.8 123.6 112.5 139.7
100.3 105.8 103.9 104.0 145.8
NNREML 80.8 71.6 95.2 88.0 68.6
67.9 48.3 54.9 78.7 48.6
76.8 64.2 92.2 87.3 67.7
NNML NA NA NA 83.3 65.3
79.4 48.9
83.1 64.7
MODML 58.2 50.0 69.5 84.7 74.6
12.8 81.4 81.6 86.6 68.5
58.1 46.1 72.0 83.8 71.4
MODREML 81.5 74.5 96.1 88.9 78.1
89.1 74.0 73.7 85.4 66.9
76.4 63.5 88.9 87.7 74.3
TYPE3 101.0 101.0 105.5 100.6 121.0
101.1 101.0 115.5 100.9 125.6
100.5 108.4 102.9 100.4 121.6
PREML 100.3 106.3 101.7 107.5 146.9
102.7 113.5 119.8 122.0 150.7
PML 77.6 81.9 77.1 103.6 143.4
109.7 117.3 127.2 123.3 151.9
PMINQUEI 100.3 107.6 105.4 107.5 179.3
102.7 129.0 137.3 122.0 180.6
PNNREML 80.9 71.1 93.9 92.7 86.6
69.8 53.2 60.5 94.0 68.1
PTYPE3 100.3 106.6 105.4 107.5 168.1
102.7 124.7 133.3 122.0 184.9
100.6 110.8 104.1 106.9 168.0
MIVPEN NA 36.2 29.1 80.0 45.6
26.6 20.0 74.3 39.6
34.7 30.2 79.8 45.4
PMIVQUE 100.3 104.2 102.4 107.5 146.9
102.7 114.4 117.8 122.0 150.7






73

of likelihood-based estimators is in the H1S65 case where the imbalance is not only random

deletions of individuals but also incomplete connectedness across locations, i.e. the same families

are not present in each test (akin to incomplete blocks within a test).

An analysis of variance was conducted to determine the importance of the treatment of

negative variance component estimates in the variance of estimation for REML and ML estimates.

The model of sampling variance of the estimates as a result of mating design, imbalance level,

treatment of negative estimates and size of the variance component demonstrated consistently (for

all variance components except error) that treatment of negative estimates is an important

component of the variance of the estimates (p < .05). The model accounted for up to 99% of

the variation in the variance of the variance component estimates with 1) accepting and using

negative estimates producing the highest variance; 2) setting the negative components to zero

being intermediate; and 3) re-solving the system with negative estimates set to zero providing the

lowest variance.

For all estimation techniques, lower variance among estimates was obtained by using

individual observations as compared to plot means. The advantage of individual over plot-mean

observations increased with increasing imbalance.

Bias

The most consistent performance for bias (Table 4-4) across all variance components was

TYPE3 known from inherent properties to be unbiased. The consistent convergence of the

TYPE3 value to the MIVQUE value indicated that the number of data sets used (1000 per

technique and experimental level) was suitable for the purpose of examining bias. The other two

consistent performers were REML and MINQUE1. PTYPE3 (HM3 based on plot means) was

unbiased when no plot means were missing, but produced "apparently biased" estimates when

plot means were missing.










Table 4-4. Bias for the estimates of o2 (upper number), o2' (second number), and h2 (third
number where calculated) as a percentage of the MIVQUE estimate by type of estimator and
experimental combination; NA is not applied. Values different from 100 denote "apparent" bias.

Estimator DIS80 DIS65 D2S65 H1S80 HIS65

REML 99.9 101.5 98.7 99.9 102.8
99.9 102.2 99.8 99.9 98.9
99.9 101.3 98.6 99.9 102.6
ML 74.6 61.6 76.0 96.2 98.2
106.5 114.6 109.7 101.3 101.8
75.5 61.8 77.9 96.3 98.2
MINQUE 99.7 96.4 99.0 99.4 102.0
100.1 100.8 101.3 100.8 98.3
99.7 96.6 98.9 99.4 101.3
NNREML 107.9 116.5 98.1 101.9 107.8
93.1 92.9 92.9 100.5 102.3
108.7 118.4 98.2 102.2 107.7
NNML NA NA NA 101.9 107.8
100.5 102.3
98.2 103.8
MODML 86.6 90.4 79.0 98.1 114.1
109.9 129.9 127.4 101.3 122.9
87.8 91.5 79.4 99.6 112.6
MODREML 109.5 124.2 100.6 103.1 117.8
103.7 119.8 119.2 104.6 120.6
109.5 123.2 98.4 102.9 116.2
TYPE3 100.1 99.4 99.6 100.2 99.6
100.2 101.0 102.4 100.2 100.9
100.0 99.5 99.3 100.2 99.7
PREML 99.7 98.7 97.7 99.5 110.6
100.1 103.6 100.2 102.4 98.3
PML 74.2 58.5 73.6 95.9 105.2
106.9 116.2 111.5 103.2 102.0
PMINQUE 99.7 95.2 98.8 99.5 106.5
100.1 102.1 102.9 102.4 114.8
PNNREML 107.9 114.5 96.7 101.8 115.6
92.9 94.0 95.0 104.5 110.2
PTYPE3 99.7 96.8 99.0 99.5 104.5
100.1 97.2 96.0 102.4 108.7
99.8 98.0 98.8 99.6 104.1
MIVPEN NA 107.5 98.6 102.0 103.2
99.0 91.7 101.4 105.1
112.6 103.9 102.1 103.4
PMIVQUE 99.7 97.4 99.2 99.5 106.8
100.1 101.7 100.5 102.4 98.8










Table 4-5. Probability of nearness for o2 (upper number), o2, (second number), and h2 (third
number where calculated). The probability interval is equal to the magnitude of the parameter.

Estimator DI D1S65 D2S65 H1S80 H1S65

REML 32.8 24.3 41.8 45.3 28.6
43.0 26.2 25.7 36.6 27.1
34.2 25.3 45.4 45.0 28.3

ML 33.6 22.3 40.7 45.4 29.2
42.9 26.4 24.8 36.2 26.7
34.6 22.3 45.0 45.7 28.2

MINQUE 32.6 24.6 41.0 45.1 26.1
43.1 24.3 25.4 34.2 23.2
33.7 25.0 44.6 44.7 25.6

NNREML 33.4 23.4 41.7 45.1 29.3
44.9 28.1 25.6 38.0 28.9
34.3 24.3 46.1 45.2 29.5

NNML NA NA NA 45.9 29.7
37.9 29.1
46.0 29.0

TYPE3 34.0 23.2 42.5 45.3 27.1
42.6 27.1 24.8 37.3 25.0
35.3 23.8 45.8 45.9 27.3

PREML 32.1 20.0 41.6 43.7 24.6
42.7 26.8 24.6 32.3 20.4

PML 33.5 19.8 39.7 44.0 24.4
41.0 26.3 23.6 31.6 21.1

PMINQUE 32.1 21.4 40.4 43.7 24.5
42.7 24.8 23.1 32.3 21.9

PNNREML 31.9 19.2 41.0 43.4 26.0
43.3 28.0 23.3 33.1 21.3

PTYPE3 32.1 23.3 41.7 43.7 25.2
42.7 25.4 24.1 32.3 22.4
32.6 24.1 46.0 44.6 24.6

MIVQUE 33.6 25.7 43.7 45.1 29.2
42.9 28.6 26.4 36.9 26.3
34.8 26.8 47.7 45.4 29.4

MIVPEN NA 41.1 78.5 48.4 35.6
47.0 60.3 39.2 31.2
42.4 80.5 48.7 35.3

PMIVQUE 32.1 20.0 41.8 43.7 25.9
42.7 28.5 26.8 32.3 20.8






76

Among estimators which displayed bias, maximum likelihood estimators (ML and PML)

were known to be inherently biased (Harville 1977, Searle 1987) with the amount of bias

proportional to the number of degrees of freedom for a factor versus the number of levels for the

factor. Other biases resulted from the method of dealing with negative estimates. Living with

negative estimates produced the estimators with the least bias. Setting negative variance

components to zero resulted in the greatest bias. Intermediate in bias were the estimates resulting

from re-solving the system with negative components set to zero.

Probability of nearness

Results for probability of nearness proved to be largely non-discriminatory among

techniques (Table 4-5). The low levels of probability density near the parametric values are

indicative of the nature of the variance component estimation problem. Figure 4-1 illustrates the

distribution of MIVQUE variance component estimates for h2 (4-la) and a2g (4-lb) for level

D1S80. The distributions for all unconstrained variance component estimates have the appearance

of a chi-square distribution, positively skewed with the expected value (mean) occurring to the

right of the peak probability density and a proportion of the estimates occurring below zero

(except error). With increasing imbalance, the variance among estimates increases and the

probability of nearness decreases for all interval widths.


Ratios of Variance Components


Single tree heritability

Results for estimates of single tree heritability adjusted for locations and blocks are shown

in Tables 4-3 and 4-4 (third number from the top in each cell, if calculated). For these relatively

low heritabilities (0.1 and 0.25), the bias and variance properties of the estimated ratio are similar

to those for acg estimates (Figure 4-1). This implies that knowing the properties of the numerator




























4-la. h2
w ,----------


-.25 -.10 0.0 .10


- 15


- -I 101 -


.6 -0.5 -.5 0.0 U .2 1.5 2.
0.6 -0.625 -.250.0 .25 .625 1.0 1.5 2.0


MIVQUE ESTIMATES 1000 DATA SETS









Figure 4-1. Distribution of 1000 MIVQUE estimates ofh2 (4-la) and o02 (4-1b) for experimental
level D1S80 illustrating the positive skew and similarity of the distributions. The true values are
.1 for h2 and .25 for oa,. The interval width of the bars is one-half the parametric value.


n


.25 0.4


4-lb. oa
i-


. ---






78

of heritability reveals the properties of the ratio (especially true of ratios with expected values of

0.1 and 0.25, Kendall and Stuart 1963, Ch. 10). Variance component estimation techniques

which performed well for bias and/or variance among estimates for oa2 also performed well for

h2.

Type B correlation and dominance to additive variance ratio

Type B correlation (Table 4-3 and 4-4 as ol) and dominance to additive variance ratio

(not shown) estimates both proved to be too unstable (extremely large variance among estimates)

in their original formulations to be useful in discrimination among variance component estimation

techniques. This high variance is due to the estimates of the denominators of these ratios

approaching zero and to the high variance of the denominator of ratios (Table 4-2). These ratios

were reformulated with numerators of interest (4a' for additive genetic by test interaction and

4o2, for dominance variance, respectively) and a denominator equal to the estimate of the

phenotypic variance. With this reformulation the variance and bias properties of estimates of the

altered ratios is approximated by the properties of estimates of the numerators.

For increasing imbalance maximum-likelihood-based estimation offers an increasing

advantage over HM3, and for all techniques individual observations offer increasing advantage

over plot-mean observations for variance of the estimates of these ratios. Bias, other than

inherently biased methods (ML), is associated with the probability of negative estimates which

is increased by increasing imbalance. This assertion is supported by comparing the biases of

REML, NNREML, and MODREML estimates across imbalance levels.








General Discussion


Observational Unit

Some general conclusions regarding the choice of a variance component estimation

methodology can be drawn from the results of this investigation. For any degree of imbalance

the use of individual observations is superior to the use of plot means for estimation of variance

component or ratios of variance components. If the data are nearly balanced (close to 100%

survival with no missing plots, crosses (full-sib) or lack of connectedness (half-sib)), the

properties of the estimation techniques based on individual and plot-mean observations become

similar; so if departure from balance is nominal, plot means can be used effectively. However,

using individual observations obviates the need for a survey of imbalance in the data since

individual observations produce better results than plot means for any of the estimation techniques

examined.


Negative Estimates


Drawing on the results of this investigation, the discussion of practical solutions for the

negative estimates problem will revolve around two solutions: 1) accept and use the negative

estimates; and 2) re-solving the system with negative estimates set to zero.

Given that the property of interest is the true value of a variance component or genetic

ratio, often estimated as a mean across data sets, then negativity constraints come into play if the

component of interest is small in comparison to other underlying variance components in the data,

or the variance of estimates is high due to an inadequate experimental design for variance

component estimation. These factors lead to an increased number of negative estimates. If the

data structure is such that negative estimates would occur frequently, then accepting negative

estimates is a good alternative.






80

If negative estimates tend to occur infrequently or bias is of less concern than variance

among estimates, then re-solving the system after convergence yields negative estimates is the

preferable solution. This tactic reduces both bias and variance among estimates below that of

arbitrarily setting negative estimates to zero.


Estimation Technique


The primary competitors among estimation techniques that are practically achievable are

REML and TYPE3 (HM3). Both techniques produce estimates with little or no bias; however,

REML estimates for the most part have slightly less sampling variance than TYPE3 estimates.

If only subsets of the parents are in common across tests as in the case H1S65, REML has a

distinct advantage in variance among estimates over TYPE3.

REML does have three additional advantages over TYPE3 which are 1) REML offers

generalized least squares estimation of fixed effects while TYPE3 offers ordinary least squares

estimation; 2) Best Linear Unbiased Predictions (BLUP) of random variables are inherent in

REML solutions, i.e., gca predictions are available; and thus in solving for the variance

components with REML, fixed effects are estimated and random variables are predicted

simultaneously (Harville 1977); and 3) REML offers greater flexibility in the model specification

both in univariate and multivariate forms as well as heterogeneous or correlated error terms.

Further, although the likelihood equations for common REML applications are based on

normality, the technique has been shown to be robust against the underlying distribution (Westfall

1987, Banks et al. 1985).








Recommendation


If one were to choose a single variance component estimation technique from among

those tested which could be applied to any data set with confidence that the estimates had

desirable properties (variance, MSE, and bias), that technique would be REML and the basic unit

of observation would be the individual. This combination (REML plus individual observations)

performed well across mating design and types and levels of imbalance. Treatment of negative

estimates would be determined by the proposed use of the estimates that is whether unbiasedness

(accepting and using the negative estimates) is more important than sampling variance (re-solve

the system setting negative estimates to zero).

A primary disadvantage of REML and individual observations is that they are both

computationally expensive (computer memory and time). HM3 estimation could replace REML

on many data sets and plot means could replace individual observations on some data sets; but

general application of these without regard to the data at hand does result in a loss in desirable

properties of the estimates in many instances.

The computational expense of REML and individual observations ensures that estimates

have desirable properties for a broad scope of applications. With the advent of bigger and faster

computers and the evolution of better REML algorithms, what was not feasible in the past on

most mainframe computers can now be accomplished on personal computers.












CHAPTER 5
GAREML: A COMPUTER ALGORITHM FOR
ESTIMATING VARIANCE COMPONENTS AND
PREDICTING GENETIC VALUES


Introduction


The computer program described in this chapter, called GAREML for Giesbrecht's

algorithm of restricted maximum likelihood estimation (REML), is useful for both estimating

variance components and predicting genetic values. GAREML applies the methodology of

Giesbrecht (1983) to the problems of REML estimation (Patterson and Thompson 1971) and best

linear unbiased prediction (BLUP, Henderson 1973) for univariate (single trait) genetics models.

GAREML can be applied to half-sib (open-pollinated or polymix) and full-sib (partial diallels,

factorials, half-diallels [no selfs] or disconnected sets of half-diallels) mating designs when planted

in single or multiple locations with single or multiple replications per location. When used for

variance component estimation, this program has been shown to provide estimates with desirable

properties across types of imbalance commonly encountered in forest genetics field tests (Huber

et al. in press) and with varying underlying distributions (Banks et al. 1985, Westfall 1987).

GAREML is also useful for determining efficiencies of alternative field and mating designs for

the estimation of variance components.

Utilizing the power of mixed-model methodology (Henderson 1984), GAREML provides

BLUP of parental general (gca) and specific combining abilities (sca) as well as generalized least

squares (GLS) solutions for fixed effects. The application of BLUP to forest genetics problems

has been addressed by White and Hodge (1988, 1989). With certain assumptions, the desirable






83

properties of BLUP predictions include maximizing the probability of obtaining correct parental

rankings from the data and minimizing the error associated with using the parental values

obtained in future applications. GLS fixed effect estimation weights the observations comprising

the estimates by their associated variances approximating best linear unbiased estimation (BLUE)

for fixed effects (Searle 1987, p 489-490).

The purpose of this chapter is to describe the theory and use of GAREML in enough

detail to facilitate use by other investigators. The program is written in FORTRAN and is not

dependent on other analysis programs. An interactive version of this program can be obtained

as a stand-alone executable file from the senior author; this file will run on any IBM compatible

PC under DOS or WINDOWS2 operating systems. The size of the problem an investigator can

solve will be dependent on the amount of extended memory and hard disk space (for swap files)

available for program use. In addition, the FORTRAN source code can be obtained for analysts

wishing to compile the program for use on alternate systems (e.g. mainframe computers).


Algorithm


GAREML proceeds by reading the data and forming a design matrix based on the number

of levels of factors in the model. Any portions of the design matrix for nested factors or

interactions are formed by horizontal direct product. Columns of zeroes in the design matrix (the

result of imbalance) are then deleted. The design matrix columns are in an order specified by

Giesbrecht's algorithm: columns for fixed effects are first, followed by the data vector, and the

last section of the matrix is for random effects. The design matrix is the only fully formed

matrix in the program. All other matrices are symmetric; therefore, to save computational space


2Windows is the trademark of the Microsoft Corporation, Redmond, WA.






84

and time, only the diagonal and the above diagonal portions of matrices are formed and utilized

(i.e., half-stored).

A half-stored matrix of the dot products of the design columns is formed and either kept

in common memory or stored in temporary disk space so that the matrix is available for recall

in the iterative solution process. The algorithm proceeds by modifying the matrix of dot products

such that the inverse of the covariance matrix for the observations (V) is enclosed by the column

specifiers in the dot products as X'X becoming X'V'X. This transfer is completed without

inversion of the total V matrix. The identity used to accomplish this transfer is

if Vh = hZhZl' + V,+1) where Vh is nonsingular;

then Vh = V'(+l) ahV-'(+l)Zh(Ih + CahZh'V-'lh+I)Zh) Vl(h+l). 5-1

A compact form of equation 5-1 is obtained by pre-multiplying by Z,' and post-multiplying by

Zj where h = 1, k-l (k = the total number of random factors), at is the prior associated with

random variable h, Vk = akI, V, = V and Z, is the portion of the design matrix for random

variable i (Giesbrecht 1983). A partitioned matrix is formed in order to update V1',+1) until V,1'

or V is obtained. This matrix is of the form:

S h + ahZaV -+1)'Z, V-hZa,'V +,)'(X Y I Z ,1 ... I Zk1)

V. (X I y I ZI... I Z-i)Vh+ )-'Z( T ,

where Tk-, = (XI y I Z ... I Zk)'Vk.,-'(X I y Z, ... IZk).

The sweep operator of Goodnight (1979) is applied to the upper left partition of the

matrix (equation 5-2) and the result of equation 5-1 is obtained. The matrix is sequentially

updated and swept until T, = (XIy Z, I...| Zk-,)'V'(X y IZ, I... I Zk.) is obtained. T, is then

swept on the columns for fixed effects (X'V'X). This sweep operation produces generalized least

squares estimates for fixed effects, results which can be scaled into predictions of random

variables, the residual sum of squares and all the necessary ingredients for assembling the






85

equation to solve for the variance components. The equation to be solved for the variance

components is

{tr(QVQVj)}f2 = {y'QV,Qy}

rxr rxI rxl

then ( = {tr(QVQVj)}1{y'QVQy}; 5-3

where {tr(QVQV)} is a matrix whose elements are tr(QVQVj) where i= 1 to r and

j=l to r, i.e., there is a row and column for every random variable in

the linear model;

tr is the trace operator that is the sum of the diagonal elements of a matrix;

Q = V- V -X(X'V-'X)X'V" for V as the covariance matrix of y and X as

the design matrix for fixed effects;

V, = Z1Z', where the i's are the random variables;

Z is the vector of variance component estimates; and

r is the number of random variables in the model (k-1).

The entire procedure from forming T, to solving for the variance components continues

until the variance component estimates from the last iteration are no more different from the

estimates of the previous iteration than the convergence criterion specifies. The fixed effect

estimates and predictions of random variables are then those of the final iteration. The

asymptotic covariance matrix for the variance components is obtained as

Var(2) = 2{tr(QV,QVj)}- 54

by utilizing intermediate results from the solution for the variance components.

The coefficient matrix of Henderson's mixed model equations is formed in order to

calculate the covariance matrix for fixed and random effects. The covariance matrix for






86

observations is constructed using the variance components estimates from Giesbrecht's algorithm.

The coefficient matrix is

S X'R''X X'R'Z ]5-5
Z'R-IX Z'R-'Z + D"

where R is the error covariance matrix which in this application is lo2,

where a2, is the variance of random variable w (equation 5-6 and 5-7);

X is the fixed effects design matrix;

Z is the random effects design matrix; and

D is the covariance matrix for the random variables which, in this

application, has variance components on the diagonal and zeroes on the

off-diagonal (no covariance among random variables).

The generalized inverse of the matrix (equation 5-5) is the error covariance matrix of the fixed

effect estimates and random predictions assuming the covariance matrix for observation is known

without error.


Operating GAREML


While GAREML will run in either batch or interactive mode, we focus on the interactive

PC-version which begins by prompting the analyst to answer questions determining the factors

to be read from the data. Specifically, the analyst answers yes or no to these questions: 1) are

there multiple locations? 2) are there multiple blocks? 3) are there disconnected sets of full-sibs?

i.e., usually referring to disconnected half-diallels and 4) is the mating design half-sib or full-sib?

The program then determines the proper variables to read from the data as well as the most

complicated (number of main factors plus interactions) scalar linear model allowed.

The most complicated linear model allowed for full-sib observations is






87

yiu = + t+ b + set, + gk + S, + tg + tg, + ts, + pi3j + Wi. 5-6

where y, is the mm observation of the kld cross in the jI block of the it test;

pt is the population mean;

ti is the random or fixed variable test environment;

bu is the random or fixed variable block;

set, is the random or fixed variable set, i.e., a variable is created so that

disconnected sets of half-diallels planted in the same experiment can be

analyzed in the same run or to analyze provenances and families within

provenance where provenance equals set; sets are assumed to be across test

environments and blocks with families nested within sets and interactions with

set are assumed unimportant.

gk is the random variable female general combining ability (gca);

g, is the random variable male gca;

s, is the random variable specific combining ability (sca);

tgg is the random variable test by female gca interaction;

tgo is the random variable test by male gca interaction;

tsj is the random variable test by sea interaction;

pi is the random variable plot;

w1i is the random variable within-plot; and

there is no covariance between random variables in the model.

The assumptions utilized are the variance for female and male random variables are equal (~2

= o = o2); and female and male environmental interactions are the same (oa2 = 0 = 02).

The most complicated scalar linear model allowed for half-sib observations is

yjhn = M + t. + bj + set, + gk + tg, + ph + Whi, 5-7








where yi~ is the mL observation of the kL half-sib family in the jh block of the i test;

/, t., bj, set., gk, and tga retain the definition in the full-sib equation;

ph\ is the random variable plot containing different genotype by environment

components than the full-sib model;

whig is the random variable within-plot containing different levels of

genotypic and genotype by environment components than the full-sib model;

and there is no covariance between random variables in the model.

The analyst builds the linear model by answering further prompts. If test, block and/or

set are in the model, they must be declared as fixed or random effects. When any of the three

effects is declared random, the analyst must furnish prior values for the variance. If no prior

value is known, 1.0's may be used as priors. Using 1.0's as priors will not affect the values for

resulting variance component estimates within the constraints of the convergence criterion; but

there may be a time penalty due to increasing the number of iterations required for convergence.

All remaining factors in the model are treated as random variables.

To complete the definition of the model, the analyst chooses to include or exclude each

possible factor by answering yes or no when prompted. After each yes answer, the program asks

for a prior value for the variance. Again, if no known priors exist, 1.0's may be substituted.

After the model has been specified, the program counts the number of fixed effects and the

number of random effects and asks if the number fits the model expected. A "yes" answer

proceeds through the program while a "no" returns the program to the beginning.

GAREML is now ready to read the data file (which must be an ASCII data file) in this

order: test, block, set, female, male, and the response variable. The analyst is prompted to

furnish a proper FORTRAN format statement for the data. Test, block, set, female and male are

read as character variables (A fields) with as many as eight characters per field, while the data






89

vector (response variable) is read as a double precision variable (F field). An example of a

format statement for a full-sib mating design across locations and blocks is "(4A8,F10.5)" which

reads four character variables sequentially occupying 8 columns each and the response variable

beginning in column 33 and ending in column 42 having five decimal places.

After reading the data, GAREML begins to furnish information to the analyst. This

information should be scanned to make sure the data read are correct. This information includes

the number of parents, the number of full-sib crosses, the number of observations, the maximum

number of fixed effect design matrix columns, and the maximum number of random effect design

matrix columns. If there is an error at this point, use CTRL-BRK to exit the program. Probable

causes of errors are the data are not in the format specified, missing values are included, blank

lines or other similar errors are in the data file, or the model was not correctly specified.

At this point, there are three other prompts concerning the data analysis (number of

iterations, convergence criterion and treatment of negative variance components). The number

of iterations is arbitrarily set to 30 and can be changed at the analyst's discretion. No warning

is issued that the maximum number of iterations has been reached; however, the current iteration

number and variance component estimates are output to the screen at the beginning of each

iteration. The convergence criterion used is the sum of the absolute values of the difference

between variance component estimates for consecutive iterations. The criterion has been set to

lx104 meaning that convergence is required to the fourth decimal place for all variance

components. The convergence criterion should be modified to suit the magnitude of the variances

under consideration as well as the practical need for enhanced resolution. Enhanced resolution

is obtained at the cost of increasing the number of iterations to convergence.

The analyst must decide whether to accept and use negative estimates or to set negative

estimates to zero and re-solve the system. The latter solution results in variance component






90

estimates with lower sampling variance and slight bias. If one is interested in unbiased estimates

of variance components that have a high probability of negative estimates, then accepting and

using the negative estimates may be the proper course to take.



Interpreting GAREML Output


Analysis is now underway. The priors for each iteration and the iteration number are

printed out to the screen. GAREML continues to iterate until the convergence criterion is met

or the maximum number of iterations is reached. The next time that analyst intervention is

required is to provide a name for the output file for variance component estimates. The file name

follows normal DOS file naming protocol; however, alternative directories may not be specified,

i.e., all outputs will be found in the same directory as the data file. The program will now quiz

the analyst to determine if additional outputs are desired. These additional outputs are gca

predictions, sea predictions (if applicable), the asymptotic covariance matrix for the variance

components, generalized least squares fixed effect estimates, error covariance matrix of the gca

predictions and error covariance matrix for fixed effects. An answer of yes to the inclusion of

an output will result in a prompting for a file name. In addition, for gca and sea predictions the

analyst may input a different value for 2v. or &2, with which to scale predictions. The

discussion which follows furnishes more detailed information concerning GAREML outputs.


Variance Component Estimates


Ignoring concerns about convergence to a global maximum and negative values, variance

component estimates are restricted maximum likelihood estimates of Patterson and Thompson

(1971). The estimates are robust against starting values (priors), i.e., the same estimates, within

the limits of the convergence criterion, can be obtained from diverse priors. However, priors






91

close to the true values will, in general, reduce the number of iterations required to reach

convergence. The value of the convergence criterion must be less than or equal to the desired

precision for the variance components. REML variance component estimates from this program

have been shown to have more desirable properties (variance and bias) than other commonly used

estimation techniques (maximum likelihood, minimum norm quadratic unbiased estimation and

Henderson's Method 3) over a wide range of data imbalance. The properties of the estimates are

further enhanced by using individual observations as data rather than plot means. The output is

labelled by the variance component estimated.


Predictions of Random Variables


The predictions output are for general and specific combining abilities and approximate

best linear unbiased predictions (BLUP) of the random variables. BLUP predictions have several

optimal properties: 1) the correlation between the predicted and true values is maximized; 2) if

the distribution is multivariate normal then BLUP maximizes the probability of obtaining the

correct rankings (Henderson 1973) and so maximizes the probability of selecting the best

candidate from any pair of candidates (Henderson 1977).

Predictions are of the form:

6 = DZ'V-'(y-X6) 5-8

where ii is the vector of predictions;

I is the estimated covariance matrix for random variables from the REML

variance component estimates, see equation 5-5;

Z' is the transpose of the design matrix for random variables;

y is the data vector;

X is the design matrix for fixed effects;