1 A N EXPLORATION OF THE RELIABILITY ON BEHAVIOR RATING INVENTORY OF EXECUTIVE FUNCTION (BRIEF) USING MULTIVARIATE GENERALIZABILITY THEORY By XIAOZHEN SHEN A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORI DA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION UNIVERSITY OF FLORIDA 2011
2 2011 Xiaozhen Shen
3 To my Parents
4 ACKNOWLEDGMENTS I would like to express my gr atitude to Dr. David Miller, my supervisory committee chair understanding have guided me with directions and help enriched my graduate training. I also thank my supervi sory committee member, Dr. Cynthia Garvan, for providing me the data source and valuable advice throughout the process. Their constructive suggestions and encouragement have expedited the completion of this thesis. Appreciation also goes to the faculty me mbers in Research and Evaluation Methodology Program, Dr. James Algina and Dr. Walter Leite, for their sharing of knowledge and wisdom in my graduate studies. I also wish to thank Angela Rowe, Karen Ledee, and Elaine Green for their kind assistance and hel ping. In addition, I am grateful for the support from my friends during these years of oversea life, thank them for being there for me like my family, sharing the good and the bad. No words are to be found to express my love to my parents. Their endless l ove and caring are my forever inspiration. This thesis is dedicated to my beloved parents.
5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ ............... 4 LIST OF TABLES ................................ ................................ ................................ ........................... 7 ABSTRACT ................................ ................................ ................................ ................................ ..... 8 CHAPTER 1 INTRODUCTIO N ................................ ................................ ................................ .................... 9 Statement of the Problem ................................ ................................ ................................ .......... 9 Classical Test Theory Model ................................ ................................ ........................... 10 Reliability ................................ ................................ ................................ ........................ 11 ................................ ................................ ................................ ............ 12 Generalizability Model ................................ ................................ ................................ .... 13 Multivariate Generalizability Model ................................ ................................ ............... 14 Behavior Rating I nventory of Executive Function (BRIEF) ................................ ........... 15 2 LITERATURE REVIEW ................................ ................................ ................................ ....... 17 Reliability ................................ ................................ ................................ ............................... 17 ................................ ................................ ................................ ................... 19 ................................ ................................ ........... 22 If Item Deleted ................................ ................................ .................. 23 Correction for Attenuation ................................ ................................ .............................. 24 Generalizability Theory Overview ................................ ................................ ......................... 25 Generalizability Mode l with One Facet p*R Random Design ................................ ........ 26 Variance Components ................................ ................................ ................................ ..... 27 Universe of Admissible Observations and Facets ................................ ........................... 28 Generalizability (G) Studies and Decision (D) Studies ................................ ................... 29 Generalizability Coefficient and Dependability Index ................................ .................... 30 Generalizability Model with Two Facet p*I *R Random Design ................................ ... 31 Random and Fixed Facets ................................ ................................ ............................... 32 Generalizability Model with Two Facet p*I *R Design with Rater Facet Fixed ............ 33 Crossed and Nested Facets ................................ ................................ .............................. 34 Generalizability Model with Two Facet p*I : R Design with Rater Facet Fixed ............ 34 Multivariate Generalizability Theory Overview ................................ ................................ .... 36 Purpose of the Study ................................ ................................ ................................ ............... 37 3 METHODOLOGY ................................ ................................ ................................ ................. 39 Research Questions ................................ ................................ ................................ ................. 39 ................................ ................................ .................. 39 BRIEF Data ................................ ................................ ................................ ..................... 39
6 Software: ALPHA: SPSS ................................ ................................ ................................ 39 Missing Data ................................ ................................ ................................ .................... 40 ................................ ................................ ................................ 40 Assumptions ................................ ................................ ................................ ............. 41 Correlation and normality ................................ ................................ ........................ 41 Descriptive statistics ................................ ................................ ................................ 42 ................................ ................................ .................... 43 Subscales inter correlation ................................ ................................ ....................... 43 Item total statistics for subscale ................................ ................................ ............... 44 Multivariate Generalizability Analysis ................................ ................................ ................... 45 BRIEF Data ................................ ................................ ................................ ..................... 46 Multivariate Generalizability Model ................................ ................................ ............... 46 G study design ................................ ................................ ................................ .......... 46 Software: MGT: mGENOVA ................................ ................................ .................. 48 Variance and covariance matrix ................................ ................................ ............... 49 Disattenuated correlation ................................ ................................ .......................... 51 D study design ................................ ................................ ................................ .......... 51 D coefficients and SEMs for subscale variables ................................ ...................... 51 D coefficients and SEMs for composite variables ................................ ................... 53 Eight configurations ................................ ................................ ................................ 54 4 RESULTS AND CONCLUSIONS ................................ ................................ ........................ 58 5 DISCUSSION AND FUTURE RESEARCH ................................ ................................ ......... 63 ................................ ................................ ................................ ........ 63 Multivariate Generalizability M odel ................................ ................................ ...................... 63 Item Response Theory ................................ ................................ ................................ ............ 64 Questionnaire Brief Suggestions ................................ ................................ ............................ 65 APPENDIX A BRIEF Questionnaire ................................ ................................ ................................ .............. 67 B ................................ ................................ ....................... 70 C Multivariate Generalizability D Study Configuration Results ................................ ............... 78 LIST OF REFERENCES ................................ ................................ ................................ ............... 82 BIOGRAPHICAL SKETCH ................................ ................................ ................................ ......... 86
7 LIST OF TABLES Table page 3 1 Inter Item Correlation Matrix Under Scale Inhibit ................................ ............................ 41 3 2 Summary Item Statistics Under Subscale Inhibit ................................ .............................. 42 3 3 Item Statistics Under Subscale Inhibit ................................ ................................ ............... 42 3 4 ................................ ................................ ........................... 43 3 5 Inter Correlation among the Eight Subscales ................................ ................................ .... 44 3 6 Item Total Statistics for Subscale Inhibit ................................ ................................ .......... 44 3 7 Estimated G Study Variance a nd Covariance Components ................................ ............... 48 3 8 D Study Results f or Individual Variables ................................ ................................ .......... 52 3 9 D Stud y Results for Composite Variables ................................ ................................ ......... 53 3 10 D Study Composite G Coefficient Change on the Variation of Facet Sample Size .......... 54
8 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Arts in Education AN EXPLORATION OF THE RELIABILITY ON BEHAVIOR RATING INVENTORY OF EXECUTIVE FUNCTION (BRIEF) USING MULT IVARIATE GENERALIZABILITY THEORY By Xiaozhen Shen December 2011 Chair: M. David Miller Major: Research and Evaluation Methodology investigating test internal consistency is widely used as a measure of reliability However, alpha itself is not always considered the best choice for estimating reliability in some situations. O ne significant limitation of Classical Test Theory that is the reliability estimation addresses only one source of error across an entire population. Therefore, G ener alizability theory as its counterpart is presented to examine the reliability of the instrument Behavior Rating Inventory of Executive Function (BRIEF), accounting for multiple sources of error exist ing alpha and Multivariate G eneralizability T heory are illustrated and compared in this paper. These two different reliability methods share similar language and notations, yet each of them entails its own uniqueness and value. The high value of coefficient alpha i ndicates the questionnaire BRIEF has strong internal consistency, but also signals that the items are redundant to a certain degree. The generalizability coefficient from Multivariate generalizability theory yields moderate estimates, suggesting it is acce ptable to generate scores to other conditions under the same construct.
9 CHAPTER 1 INTRODUCTION Statement of the Problem Analyzing latent constructs such as job satisfaction, intelligence, attitude, or customer satisfaction requires instruments to accurate ly measure the constructs. Construct is defined as the hypothetical variable that is being measured (Hatcher, 1994). In psychometrics, the precise measurement of personality variables or attitudes is usually a necessary first step before any theories of pe rsonality or attitudes can be considered. In all social sciences, unreliable measurements will c reates obstacles to s uccess fully predict ing behavior. as a measure of internal consistency was widely used in test construction and test use H owever, alpha itself is not always the best choice for estimating reliability. There is one significant limitation of Classical Test Theory (CTT) that the reliabil ity estimation addresses only one source of error across an entire population. Therefore, G eneralizability T heory (GT) is presented as its counterpart to examine the reliability of the instruments with multiple sources of error such as the Behavior Rating Inventory of Executive Function (BRIEF). The term e xecutive functioning is defined as higher order psychological abilities involved in task oriented behavior under conscious control ( Zelazo 2003 ). The BRIEF has been useful in identifying differences in d isorders such as Attention Deficit Hyperactivity Disorder (ADHD), Autism Spectrum Disorder (ASD), reading disabilities, and Traumatic Brain Injury (TBI), etc. M ultivariate G eneralizability T heory (MGT) model are presented and compared in this paper to check the consistency of the instrument T hese two different reliability methods share similar language and notations, yet each of them entails its own uniqueness and value.
10 Classical Test Theory Model R eliability analysis is a co mmon method used to construct reliable measurement scales. Its g enerated s tatistics enable s researchers to build and evaluate scales under the realm of classical test theory model. What does precise measurement means? We can hypothesize that there is a th eoretical mind to complete tasks. All the items in t he instrument measure the above concept to certain degree Thus, a response to a particular item reflects two aspects: first, the participants working memory ability, which is called the true score in this case; and second, some other aspect of the respective question or of the participant, which we call the error score. Obviously the error score woul d, contributes to overall ability to remember, which we called the observed total score. Generally every response to an item reflects partly the true score for the construct as well as certain random error. In Classical Test Theory (Spea rman, 1907), the observed value (X) is divided into two components, a true score ( T) and measurement error (E), as in the equation: X = T + E, (1 1) which se rves as our ultimate goal, by taking the mean score of all the observed score, considering the N were infinite. This approach assumes that there is no substantial change in the true score for a participant in reality; the true score serves as a hypothetical idea in classical test theory (Kline, 2005). Our ideal aim is to find the observe d score closest to the true score, and to minimize the error score. Two sets of measurements on the same variable for the same individual will yield
11 different scores. R epeated measurements by more chances would show some consistency. Reliability thus measures the internal consistency from one set of measurements to another. Reliability A measure ment is considered reliable if it catches mostly true score variance against to the error score variance. Test developers and users would all expect the result of an assessment could be replicated in order to support the upcoming decision. Reliability by d efinition refers to the consistent replication of a measurement procedure across conditions. That is, reliability is the level of accuracy or consistency of data from an instrument over a period of time, and showing test stability where similar results are reported for similar data (Norusis, 1994; Tuckman, 1999). One thing to note is that, a test is not reliable or unreliable. Reliability is a property of the scores on a test for a particular population of ex aminees (Feldt & Brennan, 1989). Therefore, it is not a fixed parameter of the instrument itself. However, to some extent, all measurements are unreliable. I t is unlikely that each student will earn the same score on the repeated occasions or maintain the same rank order within group. The underlying rea son is that when a student responds to a set of test items, his/her score is subject to errors of measurement, since the score represents only a limited sample of items under the given domain obtained on one of many possible occasions. Our problem appears as we cannot directly compute the reliability index since we cannot obtain the value of the true score. However we can obtain an estimate of reliability coefficient of parallel tests by getting the squared correlation between the observed score (X) and th e true score (T). It is mathematically equivalent to the standard coefficient of determination in the Analysis of Variance (ANOVA) framework. The standard error of measurement (SEM) is an average of the individual standard errors around the mean of the di stribution of the ir true score. The smaller the standard error of
12 measurement, the closer the random errors would be around the true score. The reason tha t we prefer to use the variance rather than the standard deviation to compute reliability, is that rel iability is independent of the unit measurement. The higher the reliability of a measure is, the higher the proportion of the observed score that is attributed to the true score. Due to the limitation of classical test theory that the reliability estimatio n addresses only one source of error across an entire population t he sensitivity to different sources of error brings about different reliability coefficients in terms of different designs such as test retest, alternative forms, inter rater, internal con sistency, etc. (Crocker & Algina, 1986). One single test is more likely to be administered in the educational settings. A special case of the Spearman Brown correction was used firstly as a solution to estimate the reliability it was based upon the corre lation between two hal f test s rather than two full tests (Brown, 1910; Spearman, 1910). Coefficient alpha (Cronbach, 1951) among all the later solutions based upon the domain sampling model (Lord, 1955), like the six coefficients considered by Guttman ( 19 45) was the easiest to compute an d to understand. The beauty of a lpha was that it was the average of all such random splits (Cronbach, 1951). Considering reliability estimates are more suitable to test retest measurements, Guttman (1945) attributed a series of lower bound estimates for one test reliability. As the well accepted estimates the reliability of a scale by determining the internal consistency of the test or the average correlation of items within the test (Cronbach 1951). It assume s the covariances between items represent true covariance s the variances of the items contain both true and unique variance. Therefore, the variance of a test can be viewed as the sum of the true covariances true variances and e rror s As we adding each item to the test the total test variance the other items.
13 Th us, the increase of items will have more proportion of true score variance reflected in the sum scale. A high value of alpha indicates strong internal consistency among the test items showing that all of them contribut e to a reliable scale Generally it implies that respondents who tend to yield high score for one item are also likely to pe rform well on other item s under the same construct. Had alpha been low, this p rediction ability would be low. Reliability coefficient ranges between zero and one. Nunnaly (1978) suggest ed 0.70 as an acceptable reliability coefficient, lower thresholds are seen as inadequate estimates The most common "rule of thumb" would be that alpha should exceed 0 .80. High reliabilities (0.95 or higher) are not necessarily desirable, as this indicates that the items may be redundant. In practice, scales with lower relia bilities are often used, this varies by discipline. Generalizability Model C lassical t est t heory is traditionally used to investigate the reliability. However, one of the limitations of CTT is that it allows only one single residual term. Generalizability t heory was later developed to overcome this limitation, accounting for multiple sources of errors that exist in the measurement. Study designs for estimating reliability coefficients using classical test theory can also be described using the language an d notation of generalizability theory (Brennan, 2006). In terms of the reasons that lead us favoring generalizability theory over CTT a brief summary is provided as follow. (Shavelson, 1991 ). First, generalizability theory allows the researchers to estim ate multiple independent sources of error variance in a measure simultaneously. Second, the estimated variance components can serve as a guidance for decision studies (D studies) to be precise, the degree of error variation can be achieved in accordance w ith our desired accuracy of measurement. Third, generalizability theory allows the estimation of test score reliability based
14 on whether the scores will be used to make relative (norm referenced test) or absolute (criterion referenced test) decisions. In t his sense, generalizability theory expands classical test theory in that reliability of scores depends on how we use these scores. Under generlizability theory, one can manipulate the number of conditions of any facet or combinations of facets observing the effect of changes on error variances and the resulted coefficients ( ) The comparison between the D study designs enables the selection of an optimum precision design. Multivariate Generalizability Model Generalizability Theory with only one universe score associated with the object of measurement is called Univariate Generalizability Theory (UGT), while two or more universe scores associated with the object of measurement is called Multivariate Generalizability Theory (MGT). Take a personality test as an example for UGT where each participant can get only one total score, then tests like Scholastic Aptitude Test ( SAT ) Graduate Records Examination ( GRE ) etc. are examples for MGT, where verbal and math sections, as well as writing or other sections are included, two or more subset scores and one total score are obtained for each student. Under univariate generalizability theory the universe score and error score estimates are based on variance components, while under multivariate generalizability theory the estimates are built upon the covariance comp onents in addition to variance components. Without doubt, univariate generalizability theory approach is much simpler. However, if a univariate approach was employed to analyze measurement with multiple content sections as some studies did, information ab out the covariance among different content sections would be left out It matters when it comes to test use. Therefore, the choice between using univariate generalizability theory and multivariate generalizability theory depend s on the nature of the measur ement and the considerations for the universe of generalization.
15 Multivariate generalizability theory further has its own distinctive functions. It is common to see that instruments have different number of items in each content section to measure the abil ities. The unbalanced data problem will increase the difficulty of variance component estimation. On top of that some measurement s do not have a composite score, only profiles for subscores are provided For the above cases, the performance of multivariat e generalizability approach is perfectly competent over univariate generalizability approach In sum, under certain circumstance, univariate generalizability approach and multivariate generalizability approach can produce similar results on the given data set. H owever, multivariate analysis provides more information that can be used by the test developers and users when there are covariance s between subscores. Behavior Rating Inventory of Executive Function (BRIEF) Executive functioning has been defined as a set of complex cognitive abilities that associated with goal oriented behavior. O ne test that aims to evaluate child ren functioning behaviors is called the Behavior Rating Inventory of Executive Function (BRIEF). The BRIEF has been useful in identifying differences in disorders such as Attention Deficit Hyperactivity Disorder (ADHD), Autism Spectrum Disorder (ASD), Reading Disorder, and Traumatic Brain Injury (TBI) etc The questionnaire Behavior Rating Inventory of Executive Function is att ached at the Appendix. Unlike traditional cognitive measurements which are administered under regulated laboratory conditions, the BRIEF is a parent al and teacher rating scale of child ren functioning behaviors in everyday situations and home/s chool environments, measur ing across eight domains/subscales. Three scales (Inhibit, Shift, and Emotional Control) that comprise the Behavioral Regulation Index (BRI); and the other five scales (Initiate, Working Memory, Plan/Organize, Organization of Mate rials, and Monitor) comprise the Metacognition Index (MI).
16 The eight clinical subscales in BRIEF were based on expert judgments, not on statistical considerations, since in practice test specifications are based on expert judgments. Th e first scale Inhibi t measures prohibitory behavior The second scale Shift assesses the ability to move freely from one task or situation to another. The third scale Emotional Control measures the ability to control emotional res ponse s, whereas the forth scale Initiate ability to start a task and independently generate ideas, responses, or problem solv ing strategies. The fifth scale Working Memory involves holding information in mind to compl ete a task, and the sixt h scale Plan/Organize measures the ability to manage current and future task demands. The seventh scale Organization of Materials of ordering work, storage, and play areas; and the last scale rk checking habits as well as behavioral monitoring ability The two indexes BRI and MI combined to form the Global Executive Composite (GEC) score. The goal of BRIEF is to determine whether the children display a distinctive pattern of strengths and weak nesses on the eight scales of the BRIEF. This is crucial for determining whether there are areas of weakness within the domain of executive functioning for the children. Therefore, follow up instruction and remediation can be develop ed to target specific a reas of weakness or build upon areas of strength. In this paper, two different reliability methods were used to analyze the instrument BRIEF. Alpha analysis was a common method used to yield the reliability of the test scores as in most published papers an d reports. Here, multivariate generalizability model was used as a comparison model to explore the reliability of the instrument at the meantime, presenting an introduction of the model along with its interpretation on the underlying construct.
17 CHAPTER 2 LITERATURE REVIEW Reliability Reliability is the second most important issue in the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in E ducation, 1999), which require the test developers and users to obtain and report evidence concerning reliability and errors of measurement as a standard practice. Reliability by definition is the consistency of the measurement within a context (Crocker & Algina, 1986). Reliability under classical testing t heory has to follow several assumptions for the random errors. First, the random errors are assumed to have a mean of zero. Second, the random errors are uncorrelated with each other, which means there is no underlying connections between true score. With the above assumptions being met, th e errors of measurement can be assigned into two categories. One of them is called systematic measurement error, which is not easy to detect and is much stricter than other teachers, which has nothing to do with the construct being m easured, but it consistently affects the score of the students and ended with different mean scores for different teachers. The other measurement error is named random measurement errors. It does not have any consistent effect and is sometimes considered as noise in the measurement For I t contributes variability to the data but has less influence on the average performance of the repeated measurements
18 Being sensitivity to different sources of error in different situations, multiple types of reliability estimates are developed. Four types of reliability are briefly discussed as follows: F irst, i n the test retest method, r eliability is estimated as the p earson product moment correlation coefficient between two admi nistrations of the same measure; w hile in the p arallel forms method, reliability is estimated as the consistency between two tests that are constructed from the same content domain. The third reliability that under the inter rater method assesses the degree to which different raters give consistent estimates of the same measurement Lastly, t he internal consistency method assesses the consistency of results across items within a test. ( Crocker & Algina, 1986 ). Detail s on internal consistency method are discussed in later sections. Reliability can be obtained by the ratio of the true score variance to the total score variance using the correlation of two parallel tests or some less restrictive forms. Parallel tests h ave equal means and equal variances s ome less restrictive forms like Tau Equivalent forms have only equal means but not variances. Essentially Tau Equivalent forms have no equalities, its true scores are off by a constant per form (Equation 2 1 ), thus th eir true score variances are equal and error variances are not Congeneric forms have no observed equalities either, but its true scores are perfectly correlated in a linear sense. (Crocker & Algina, 1986) (2 1 ) Given the correlation between two tests, but not the true scores or their variance (we would not bother with reliability if the true scores are attainable ), we are faced with ten unknown estimates (four variances and six covarian ces). However, if the two tests happen to be parallel with uncorrelated errors, then the ten unknown estimate s drop to three and the reliability of each test can be obtained (Revelle, Zinbarg, 2009): (2 2 )
19 Where stands for the reliability coefficient, is the covariance between the true score and the total score, is the variance of the true score, and is the variance of the total score. On most occasions, however, only one test Thus the issue of how much information is available from a single testing session is very important. Kud er Richardson (KR) formula, Spilt half reliabili a lpha coefficient are common methods used to examine the internal consistency within a single test. Kuder Richardson method has its limitation as it is applicable to dichotomo us data only. As to the split half method, instead of measuring consistency from two measurements it separates one test into two halves to explore its internal consistency Spearman Brown prediction formula always come into play when the reliability of on e of the subtests/halves is known, assuming that the halves are parallel, with the ir mean s and variance s identical to each other : (2 3 ) Where is the Spearman Brown correction coeffic ient, k is the number of subtests combined, and is the known reliability of the subtest. Psychometricians as well use this prediction formula to estimate the effect of shortening a test or doubling exam length. Comparing to the above two met hods, is favored as it can handle both dichotomous data and large scale data. More importantly, in addition to the single pair of half tests used in the Spilt half method split half estimate s that are computed using Rulon method (Crocker & Algina, 1986). It assumes
20 a ssent ially tau equivalent, based on Norm Referenced Test (NRT) and random error only assumption s Unlike the test icient alpha is a lower bound (less than or equal) of the reliability coefficient for the raw variables and the standardized variables. In the raw scale test items have different weights and their own v ariability contribut ions to the resulting scale, whil e the items in the standardized form all account for equal weights. P eople may favor the standardized alpha over raw alpha take it as if standardization normalizes skewed data. As a matter of fact, standardization is simply a linear transformation, it does not function better than raw alpha as to the aspect of normaliz ing data. Standardized alpha is normally used when scales are comparable and variances are heterogeneous in the cases as where a mixture of dichotomous and polytomous items were included. C ronbach (1951) mentioned alpha as a lower bound of reliability and discussed the relationships between alpha and several other correlations, especially test retest correlation and split half correlation. Later alpha was popularly interpret ed as a measure of the test internal consistency. By far the interpretation of alpha as a measure of internal consistency has gained more ground than the lower bound interpretation in practical test construction and test use (Sijtsma, 2008) However, both interpretations h ave their own issues. The interpretation of test internal structure defines as but indeed alpha does not imply such translation When an instrument d e liver s a high degree of alpha, the trait validity of the i nstrument often is taken for granted, thus no further investigation about the dimensions of the instrument would be carried on. As to another interpretation of alpha, some researchers treat
21 alpha as reliability instead of a lower bound to the reliability A ctually in some cases it is a gross underestimate, a poor estimate of internal consistency and in other cases a gross overestimate. (Cortina, 1993; Cronbach, 1951; Green, Lissitz & Mulaik, 1977; Revelle, 1979; Schmitt, 1996; Zinbarg, Yovel, Revelle & McDo nald, 2006). Despite years of critiques and warnings about the usage of coefficient alpha, it continues to be used. The reasons that all those appropriate estimates out there were unpopular among the researchers were that neither do they explain in a langu age that is easy to comprehend, nor do they have user friendly programs like SPSS (Revelle, Zinbarg, 2008). Borsboom (2006) mentioned that availability of statistical method s in SPSS expose themselves to the empirical research ers to a great extent Alpha i s included in the statistical methods in SPSS, and so are the other five lower bounds proposed by Guttman (1945). In Revelle (2009), a comparison of thirteen estimates of reliability was presented : from the Guttman (1945) bounds, and from McDonald (1999), from Ten Berge the above alpha ( ) was much lower than some other reliability estimates ( and glb ) It tends to lead to an undercorrection w hen only the general factor in a multidimensional test accounts for the major explanation of the underlying causes, for which produces a more accurate correction. In other cases, when more than one dimension account for the causes of the multidimensi onal test, alpha could lead to an overcorrection, while estimate yields more precise correction. Revelle (2009) further prove d the argument that alpha is a poor index of both multidimensionality and unidimensionality. People are likely to stop furth er investigat ing its test dimensionality when a high alpha value is obtained. The difference between test internal consistency and test dimensionality confuses people occasionally (Gardner, 1995; 1996). As a matter of fact, unidimensionality have been used as synonymous with homogeneity, while internal consistency are more often
22 interpreted as interrelatedness of items. To be more specific about interrelatedness, it may indicate that some test items are correlated to certain groups of items, or it can be th at each item is correlated to every other item in the test. In other words, high alpha s can be derived from either when each test item share variance with all other items, or when each item shares variance with part of the other items. To conclude t he rela tionships between these similar definitions in one sentence, that is, uni dimensional test entails high internal consistency, but not vice versa. If no common variance are detected among all items, then t hey are neither internally consistent nor uni dimen sional a low degree of alpha will be resulted from lack of item interrelatedness. On the other hand, if all of the items share common variance with each other, then they will yield a high degree of alpha as an indication of both internal consistency and u ni dimensionality. In between the two situations, when one set of items share common variance with other sets of items but not all Thus the instrument is considered as internally consistent. However, the uni dimensionality is no longer fit since the presence of the subscales In this case, like our instrument BRIEF, the interpretation of the data should be in terms of internal consistency together with multi dimensionality. The reliability e variance and covariance of the subscales are available. As mentioned, the more items included in a test the more true score variance will be reflected in the total score variance. Accordi ng to the definition of the variance of a composite score, the variance of the sum of two items is equal to the sum mation of the two item variances m inus (two times) the covariance: (2 4 )
23 Where is the variance of the sum of the two items, is the variance of the item x, is the variance of the item y, and is the true score variance common to item x and y. d by comparing the sum of item variances with the variance of the sum scale (Crocker & Algina, 1986): (2 5 ) Where k is the number of items, is the sum of all variances of item i ; and is the variance for the sum of all items. In the case of the first situation that all items measure different things, the variance of the composite will be the same as the sum of variances of the individual items, since the common varia nce is zero. With the other hand, if all items measure the same thing, will become equal to and by multiplying this b y thus the coefficient alpha becomes 1. analysis enables us to have them deleted. Thus more proportion of the true score can be r eflected items is given by: (2 6 ) If the resulted alpha increases after an item is deleted from the scale, we can infer that substantial error variance has been discarded. On the contrary, if the alpha coefficient decreases,
24 then good common variance has been deleted, and for which item we strongly recommended to be retained. Correct ion for Attenuation Reliability always is considered along with validity. Both of them are very important concepts in quantitative research. Generally, reliability estimates the consistency of the measurement. Validity, though, assesses the accuracy of the measurement. As reliable as the measurement SAT, it will not be a valid test for admission to graduate school since it is not evaluating the specific concept that is attempted to be measured. Obtaining the less than perfect reliability through conducting the measurement scale is one of our primary interests for this study. But if we want to look into the validity of the measure and to use one of our scales (working memory) to predict some other criterion, such as participants IQ scores, correction for attenuation is recommended. Considering that our scale correlates with criterion IQ, the confidence in the scale validity will be increased. During test construction, the complex process of scale validation requires to link the sc ale to all relevant external criteria that purport to measure the construct. The validity of a scale always will be limited by its reliability. When the estimated reliability of a measure is 0.80, then only the 80% of the true score variance in the scale w ill be correlated with criterion variable, in other words, the actual correlation between the two scales will be attenuated To estimate the actual correlation of true scores in both measures, we can correct the attenuated correlation by dividing the corre lation to the square root of the product of the reliabilities of the two scales: (2 7 )
25 Where is the actual corrected correlation of true scores between the two measures x and y. is the uncorrected correlation, a nd and are the reliability of the respective measure x and y. Through the procedure of correction for attenuation, it is clear that how well the scales are measured not only affects the correlation of the scales, but also the va lidity of the scale. Generalizability Theory Overview To improve the circumstances that the restrictive classical test theory addresses only one true score and on e single error term, generalizbility theory was introduced by Cronbach, Gleser, Nanda, Rajara tnam, Nageswari and Spearman (1904, 1963, and 1972). The following Generalizability Theory: A Primer. (1991) by Shavelson and Webb made a clear description about the logic underlying the major concepts. Furthermore, Brennan made a significant contribution with his book Generalizability Theory (2001a) by presenting an all embracing solution and up to date support to the theory. In classical test theory, validity is less addressed within the studies of reliability. However, validity plays a more important rol e to reliability, since there is no reason to use the inaccurate test even if the test measures consistently. Generalizability theory, on the other hand, adequately combines reliability and validity together, providing grounds for investigating and designi ng reliable and accurate measurement Generalizability theory and reliability analysis under classical test theory share matching variance to the observed score variance: (2 8 ) account in the assessment as well, the total score variance grows i nto three components:
26 (2 9 ) where is the variance of the observed score, is the variance of the true score, is the variance of the rater, and is the variance of the random er ror. The new reliability remains unchanged when all subjects assessed by only one rater. Yet, if raters are randomly assigned to subjects, then the new reliability becomes: ( 2 1 0 ) Apparently coefficient produces higher value than coefficient alpha was used for this design, the estimates would be overestimated. Generalizability Model with One Facet p*R Random Design Cronbach et al. (1972) define d generalizability coefficient ( similar to the reliability coefficient in classical test theory, as the ratio of the universe score variance to the expected value of the observed score variance that relative to the subjects. For the one facet p*R design, ( 2 1 1 ) where is the generalizability coefficient, is the true score variance and is the variance of the student by rater interaction. co efficient can be interpreted and used alike as reliability coefficient and in fact is equal to the coefficient alpha reliability if and is the same as alpha adjusted for a change in test length using the Spearman Br own formula if (Brennan, 200 6 ) Additionally, generalizability theory offers a dependability index in the current design, which though has no analog in classical test theory. Being the absolute reliability coefficient, dependab ility index measures participants performance against established standards by dividing universe score variance to the expected value of the overall observed score variance:
27 (2 1 2 ) The theory and application of reliability coefficients under classical test theory and coefficients under generlizability theory all depend on an assumption of random sampling. Persons, the objects of measurement, are assumed to be a random sample from the population. The estimation of variance components also depends on the assumption that the levels of each facet are randomly sampled from among all possible conditions. The coefficients in different generalizability design will be described with i llustrations in later sections. In addition to a mere simple combination of classical test theory and a nalysis of v ariance, generalizability theory excels in distinguish different errors, elaborating variance components, and making appropriate estimations under its unique conceptual framework, which are universe of admissible observations and Generalizability (G) studies, as well as a universe of generalization and Decision (D) studies. In sum, generalizability theory serves as a flexible and powerful tool to assess the measurement reliability. It not only allows us to investigate the instrument but also to design reliable observations, utilizing the decomposed sources of variation in the measurement and then minimizing the measurement error(s) to reach an o ptimal design. One of the challenges in generalizability theory is to make a distinction among different measurement designs, which Generalizability Theory are use d to illustrate the conceptual and statistical is sues in the following sections. Variance Components Variance components represent different sources of variability among response to a measure. In other words, different observed scores for examinees can b e caused by various
28 reasons. For example, an individual score on a particular item can be affected by person effect item effect (i), one source of variability that caused by good or poor written items, and a residual variance component that including the person item interaction (pi) as well as other unspecified effect. Therefore, an observed score for one individual on one item can be stated as: (2 1 3 ) Where is the observed score for any person in the population on any item in the universe, is the grand mean score over all persons and items, is the average score for person (p) over all items, is the average score for item (i) over all persons. ) represents person effect, ( ) is the item effect, and is the residual effect that i ncludes the person item interaction and all other sources of error not specified in the design. Hence, the variance of the scores can be presented as, ( 2 1 4 ) Where is the variance of the observed scores, is the variance of person, is the variance of item and is the variance due to the residuals. In this manner, the variance of the observed scores was partitioned into thr ee independent sources of variation according to differences among persons, items, and the residual term. Unlike the one general error term in the classical test theory, generalizability theory allocates specific variance components appropriately for the r esearch design. Universe of Admissible Observations and Facets In generalizability theory, reliability and validity are well associated, concepts from as broad as a universe, to its facets, to the conditions for admissible observations, and the last fin er observations are all defined to measure the latent construct they purport to interpret.
29 A detailed illustration of the concepts of the above terms and their relationships will be presented in this part. Taking a math test as an example, its items are c onsidered as randomly sampled from a universe of admissible observations. A universe of generalization may consist of homogeneous items, trained raters, similar ability level students, etc. that makes the universe a large pool for admissible substitution. The algebra, geometry and calculus sections can be seen as a section facet along with the rater facet and student facet, each factor represents one source of variation in the measurement. One or more facets can be selected, depend on the test construction and use. Each section, each rater and each student can be considered a condition of item is called an observation, which reflects potentially the variance c omponents of all facts in the measurement design. Generalizability (G) Studies and Decision (D) Studies The G study and D study perform different functions in the generalizability theory, yet the two procedures remains connected to each other for measur ement development. G study firstly classify and allocate different sources of variations from the observed scores, and the D study then based on the obtained variance components to free the error by gathering new samples from the universe of generalizatio n and then to design a measurement to meet the needs. D study usually focuses on the expected means rather than single observation as in the G study, thus the D study variance components become the variance components of the G study divided by the number o f the corresponding facet levels. The larger the sample size the error facet levels, the less the error variance would be. Generally crossed designs are recommended for G studies for simple and classification of the effects, the D studies however may use all kinds of designs as needed (Shavelson &Webb,
30 1991). For the purpose of presenting various generalizability models, D studies are applied to all of the following designs. Generalizability Coefficient and Dependability Index As presented in the one fac et p*R random design, a generalizabilty coefficient is the ratio of the universe score variance to the sum of the universe score variance and the relative error variance, mathematically equivalent to an intra class correlation: (2 1 5 ) where is the generalizability coefficient is the universe score variance and is the relative error variance. G coefficient ranges between zero and one. In the o ne facet p*R random design, the universe score variance is the variance component for the object of measurement, ( ). It does not create error variance and thus is not considered as a facet. Only the in the rest variance components is re lated to the object of measurement (people), and thus are classified as relative error variance When the G coefficient is high, the facets in the G studies are considered reliable to be generalized to other conditions. Dependability Index (phi coefficient) is similar to G coefficient as it is also the ratio of the universe score variance to the observed score variance, only the observed score variance here is the combination of the universe score variance and the absolute error variance: (2 1 6 ) Where is the dependability index, is the universe score variance, and is the absolute error variance. Absolute error variance includes all sources of error variance, but not the univ erse score variance.
31 G coefficient ( ) is associated with a relative decision, which concerns the relative ranking of individuals. A position that will be filled with top three ranking applicants is a decision relying on the relative inte rpretation of scores; usually the norm referenced measurements are used for such decision. Dependability index ( ) is concerned with an absolute decision, which focuses on the absolute level of an individual's performance, independent of other participants' performance. It is always involved with fixed cutting scores, like the pass or fail of a BAR test for a lawyer student, criterion referenced measurements are used for this kind decision. The correct calculation of generalizability coefficien t and dependability index lay upon the accurate classification and assignment of the relative and absolute error in the measurement. Generally, the absolute error variance is larger than or equal to the relative error variance, and thus the corresponding d ependability index is less than or equal to the generalizability coefficient. Generalizability Model with Two Facet p*I *R Random Design In a measurement design with the person, item and rater as three corresponding facets, a p*I*R two facet balanced desig n is formed. Person is not considered as a facet as being the objective of measurement. The variance of scores (X) of the design is as follow: (2 1 7 ) W here the universe score variance is ; T he relative error variance is the sum of the ones only related to person besides the u niverse score variance: ; T he absolute error variance is the sum of all the variance components, other than the universe score variance:
32 Ther efore, the generalizability coefficient in the p*I*R two facet balanced design is: (2 1 8 ) And the dependability index phi in the p*I*R two facet balanced desi gn is: (2 19 ) Note that in a G study, facets are represented by lower case letters p*i*r, while in the D study, upper case letters p*I*R is represented. The var iance components in G studies are for single p*i*r combination, while in D studies average score are considered, thus the components should be divided by the numbers of conditions of items and raters from the G studies. Random and Fixed Facets A facet i s a random one when its conditions are randomly selected from the universe of admissible observations, and can be generalized to the other conditions of the same kind. A fixed facet is created when the decision maker selects certain conditions on purpose a nd is not interested in generalizing them to other conditions, or in other cases, it is unreasonable to generalize beyond the current conditions, or when the entire universe of conditions is small and all conditions are already included in the measurement design. (Shavelson &Webb, 1991) Generalizability theory analyzes random facet using variance components, as presented in the p*I*R design. However, due to the fixed facet variance is not considered as one source of error variance, but alike the variance o f the object of the measurement (people), both of them are fitting in the universe score variance, not relative or absolute error variance. Generalizability theory deals with fixed facets with mean difference, by averaging over the number of conditions of the fixed facet (Cronbach et al., 1972). Yet sometimes it does not make conceptual sense to average over the conditions of a fixed facet. Therefore, a separate G study conducted within each
33 condition of the fixed facet is recommended (Shavelson & Webb, 19 91) or a full multivariate generalizability analysis could be performed (Brennan, 2001). Generalizability Model with Two Facet p*I *R Design with Rater Facet Fixed Taking the above p*I *R design, and for this design the rater is treated as a fixed facet, thus only the current fixed raters could be included in the measurement. The variance of scores (X) in a p*I*R design with rater facet fixed becomes: (2 2 0 ) Where the term is no longer listed in the equation, for it becomes a mean difference, and not considered as a source of error any more. T he universe score variance becomes the variance component for the objective of measurement (people), plus an interaction between people and rater: T he relative error variance is the sum of the rest variances that rela ted to person, besides the universe score variance: T he absolute error variance is the sum of all the variance other than the universe score variance: Therefore, the g eneralizability coefficient in the p*I*R two facet with rater facet fixed balanced design is: (2 2 1 ) And the dependability coefficient phi in the p*I*R two fa cet with rater facet fixed balanced design is: (2 2 2 )
34 Note that for any D study, the variance components should be divided by the numbers of conditions of items and raters from the G studies. Crossed and Nested Facets Crossed design of a measurement happens when all the conditions of a facet interacts with all the condition of other facets. If each participant takes both the pre test and the post test, we c an say the facet of person is crossed with the facet of occasion, denoted as p o where p is the person and o is the occasions. A nested design is created when different set of conditions of a facet interacts with one and only one condition of another face t. Take GRE test for an example, different subtest (verbal or math) has corresponding verbal items and math items. For the verbal section, only the verbal items are included, and so does the math section, verbal items would not appear under the math sectio n. In this way, we call the facet item is nested within the facet subtest, and denoted as i:s (vise versa), where i is the items and s represents the subtest. Both the crossed design and nested design can be included in one measurement. For example, each student has three assignments, and each assignment has its own tasks, the tasks under each assignment are different than the ones in other two assignments. Thus we say the persons are crossed with tasks, which are nested within assignments, it can be denot ed as p (t:a), where p represents person, t represents task and a represents assignment. Generalizability Model with Two Facet p*I : R Design with Rater Facet Fixed We have set the rater facet in the crossed design p*I*R to be fixed in the previous design, and in this design, we will further nest the item facet into the rater facet. The design now denoted as p*I:R. Hence, its variance of scores (X) becomes: + (2 2 3 )
35 W here the term are no longer listed in the equation, but a new term appeared, which equals the sum of and in the G studies. T he universe score variance in this design is the variance component for the objective of measurement (people), plus an interaction between people and rater, same as the fixed design above: T he relative error variance becomes the sum of the rest ones related to person, besides the universe score variance, T he absolute error variance becomes the sum of all the variance other than the universe score variance, Therefore, the generalizability coefficient in the p*I:R two facet de sign with rater facet fixed, and item facet nested in rater facet balanced design becomes: (2 2 4 ) And the dependability index phi in the p*I:R two facet design with rater facet fixed, and item facet nested in rater facet balanced design becomes: (2 2 5 ) Again note that for this D study, the variance components should be divided by the numbers of condi tions of items and raters from the G studies. Different G and D study designs are tailored to meet specific requirement and needs. Setting an ideal combination of conditions in each facet in D study would concern practical and budget considerations as well as matter of generaliza tion In order to obtain certain degree of a G coefficient, usually a higher coefficient, we could use D studies to simulate more levels of facets
36 (items or raters) to minimize error and satisfy the objective. D ifferent combinations of numbers of items and raters, and different designs are welcomed for trial in order to obtain the best result. Multivariate Generalizability Theory Overview For the last few decades, test developers and users have attempted to investigate the reliabilit y of a measurement which contains multiple subtests. Such data have the following characteristics: first, e ach examinee (object of measurement) has two or more universe scores representing subtests or profiles; second, t he conditions of subtests are fixed, that is, the selected conditions are the only interest and will not be generalized to other conditions; third, t he number of items in each condition of the subtests (or profiles) might not be equal, which mean s the data are unbalanced; fourth, i t decompos es both variances and covariances into components, as opposed to univariate analysis which only accounts for variance components. Furthermore, the researchers concern about not only each universe score but also the composite universe scores for the entire test. Multivariate generalizability theory, in contrast to univariate generalizability theory, was developed to meet this challenge (Rajaratnam, 1965; Shavelson & Webb, 1991; Brennan, 2001). The results of the separate univariate analyses and multivariate analyses will produce similar results under two conditions. The first condition happens when the scores are uncorrelated. Multivariate generalizability analysis of a specific dataset will then be interpreted as multiple univariate analyses. The second cond ition happens when the scores are highly correlated. The univariate analyses and multivariate analysis produce about identical generalizability coefficients with the scores in the multivariate analysis have equal weights. For all other cases, when the cor relations among scores have intermediate values, the multivariate analysis is preferable, for its results will depend on the patterns of both the variances and covariance components as well as the magnitudes of the disattenuated correlations.
37 Using the no one and only o ne level of the fixed facet ( subscales ). the facets are crossed with each other, one facet is nested within the other Brennan (2001) recommended performing a full multivariate analysis when a fixed facet exists in the research design. Cronbach (2004) mentioned that generalizability theory was developed to be a random effects theory. The limitation of having fixed effects was overcome in the multivariate generalizability theory model, in which the levels of the fixed fac et would be modeled as separate dependent varia bles in a multivariate design. Although ANOVA method for estimating variance components is straightforward when applied to balanced data, issues arise when comes to the unbalanced data set Multivariate design avoids the problem of unbalanced data by analyzing corresponding parallel univariate designs. In the end, each univariate design has balanced data under all levels of the corresponding fixed facet. Purpose of the Study Designing a reliable instrument is t o have items under the same construct related to each other, in the meanwhile have each of them contributing adequately unique information. Reliability measures whether items that propose to measure the same general construct produce consistent scores. Th e questionnaire BRIEF was designed to have items hypothesized to be sensitive to the developmental and acquired neurological conditions. Since important decisions are made based on the test score, score reliability plays a very importance role. It is crucia l that scores are not easily biased by the particular sample of items or the particular teacher who rate on the examinee.
38 Reliability is commonly measured with Cronbach's alpha, a statistic calculated from the pairwise correlations between items. A detaile d Cronbach's alpha analysis would be presented in the next section. However the lower bound reliability coefficient has been demonstrated many times to either being an overestimate or a poor estimate The hierarchical coefficient omega may be a more approp riate index in terms of all of the items in a test measure the same latent variable (McDonald, 1999; Zinbarg, Revelle, Yovel & Li, 2005). An extensive model called multivariate g eneralizability theory was conducted as a comparison in the later section. Th e phi coefficient in the multivariate generalizability theory, which was used to make absolute decisions, comes to serve as the reliability coefficient of BRIEF The results from a G study were then used to inform a decision. D studies are typically perfor med to gain insight into how the precision of test scores would be affected by manipulating the various facets of the assessment. By determining which sources contribute most to measurement error, steps can be taken in test design to reduce those sources o f error, it is therefore possible to examine how th e generalizability coefficients would change under different circumstances, and consequently determine the ideal conditions under which the questionnaire BRIEF would achieve the desired reliab i l ity The mu ltivariate generalizability analysis was also used to investigate the relationship of the different subscales. The purpose of this study is threefold: first, to further explore the effect of the two models on variance components, reliability and standard e rror of measurement; second, to investigate the differences and relationships between the eight subscales; and third, to illustrate the benefits of using a multivariate generalizability theory framework.
39 CHAPTER 3 METHODOLOGY Research Questions In thi s chapter, two analysis of reliability methods, Cronbach alpha analysis and multivariate generaliazability theory will be conduct ed to the questionnaire BRIEF. F our specific research questions have been developed before the comparison is initiated. Firs t of all, a br ief summary of t eliability will be investigated. Second, th e relationship b etween the eight s ubscales in BRIEF will be studied. Third, the multivariate g eneralizability m odel D index and the SEMs for the eight s ubscale s a nd the c omposite s cale of BRIEF will be lpha m odel an d multivariate generalizability m odel will be discussed. With the four questions beard in mind, the next section of the paper proceeds with the analy at the beginning and the consistency of the questionnaire BRIEF was follow analyzed by the multivariate generalizability approach BRIEF Data Software: ALPHA: SPSS Seventy three question s using Likert type scales (1 = Never; 3 = Often) from questionnaire BRIEF were administered. Usable survey forms of totaling 1 ,089 of 1 318 forms (82.63%) were reliability approach using the software SPSS (200 9) Eight common factors were extracted during factor analysis and were interpreted to represent factors Inhibit, Shift, Emotional Control, Initiate, Working Memory, Plan/Organize, Organization of Materials, and Monitor. According to the questionnaire ite ms of Q9, Q38, Q42, Q43, Q45, Q47, Q57, Q58, Q59, and Q69 form the construct of Inhibit Q4, Q5, Q6, Q13, Q14,
40 Q24, Q30, Q40, Q53, and Q62 form the construct of Shift, Q1, Q7, Q26, Q27, Q48, Q51, Q64, Q66, and Q72 form the construct of Emotional Control, Q 3, Q10, Q19, Q34, Q50, Q63, and Q70 form the construct of Initiate, Q2, Q8, Q18, Q21, Q25, Q28, Q31, Q32, Q39, and Q60 form the construct of Working Memory, Q12, Q17, Q23, Q29, Q35, Q37, Q41, Q49, Q52, and Q56 form the construct of Plan/Organize, Q11, Q16, Q20, Q67, Q68, Q71, and Q73 form the construct of Organization of Materials, Q15, Q22, Q33, Q36, Q44, Q46, Q54, Q55, Q61, and Q65 form the construct of Monitor. In theory, the items should be sampled from the domain defined by its concept. A sum of all th e items under each subscale is computed as a composite score for the factor. Missing Data In surveys, respondents tend to skip questions that they don't want to answer. In addition, scanning device tends to omit slight pencil marks. In both cases, we will have missing data in our lpha procedure will be stopped To prevent this issue from happening, the listwise deletion in SPSS ignores cases that have missing values. Summated scales are always used in survey i nstruments to investigate underlying constructs researcher s intended to measure. The scales are to be use d later in the objective models. Experienced researchers would firstly run a test before using them in subsequent analyses (Reynaldo, 1999). The higher the the lower the error/ unique components of items will be. T he more items are included in the test the greater the likelihood that errors will cancel each other out, suggesting that they are all measuring the same construct.
41 Assumptions Correlation and n ormality Table 3 1 Inter Item Correlation Matrix Under Scale Inhibit Q9 Q38 Q42 Q43 Q45 Q47 Q57 Q58 Q59 Q69 Q9 1.000 .739 .717 .752 .715 .714 .718 .754 .733 .754 Q38 .739 1.000 .746 .761 .658 .726 .698 .7 48 .740 .809 Q42 .717 .746 1.000 .824 .728 .714 .711 .742 .691 .719 Q43 .752 .761 .824 1.000 .731 .774 .763 .796 .748 .776 Q45 .715 .658 .728 .731 1.000 .717 .693 .713 .679 .701 Q47 .714 .726 .714 .774 .717 1.000 .807 .806 .758 .761 Q57 .718 .698 .711 .763 .693 .807 1.000 .842 .752 .749 Q58 .754 .748 .742 .796 .713 .806 .842 1.000 .805 .798 Q59 .733 .740 .691 .748 .679 .758 .752 .805 1.000 .784 Q69 .754 .809 .719 .776 .701 .761 .749 .798 .784 1.000 Since alpha depends on the correlation coefficien t, it is essential to make sure that the correlations are valid measure of the strength of inter item association. It is a very good idea to scatter plot each pair of variables and, if necessary, to test for non linearity. If the item statements are expres sed in different direction (i.e. are they all positive or negative statements?), we should reverse the scales to statements in the same direction, in case when we sum the scores of all items, the rating of positive statement and that of negative statement cancel out each other. From table 3 1, we see pretty strong pair correlations between all items (.658 .842), indicating all of them go together measuring the same construct Inhibit For small problems, we can often see patterns in the correlation matrix, a nd try to improve reliability by dropping certain items; for larger problems, factor analysis is a better approach. Table 3 1 presents the pair correlations under the scale Inhibit the pair correlations under other seven scales are presented in the Append ix B All the corresponding items under specific factor showed medium to high correlations, indicating items have high internal consistency for each factor.
42 Descriptive statistics Table 3 2 Summary Item Statistics Under Subscale Inhibit Mean Minimum M aximum Range Max/ Mini Variance N of Items Item Means 1.570 1.443 1.671 .228 1.158 .005 10 Inter Item Correlations .746 .658 .842 .184 1.280 .002 10 In examining the descriptive statistics, we found the items are relatively all center ed at the middle o f the range 1.57 (1 stands for never, and 3 stands for often), the minimum has a mean of 1.443, and the maximum has a mean of 1.67 for subscale Inhibit For the other factors (see Appendix), similar patterns were found that item means for different factors ranged from 1.420 1.627. Items that show extreme means (nearly 1 or 3) need further examination before been eliminated. Table 3 3 Item Statistics Under Subscale Inhibit Mean Std. Deviation N Q9 1.67 .732 1857 Q38 1.66 .753 1857 Q42 1.58 .748 1857 Q 43 1.60 .759 1857 Q45 1.59 .738 1857 Q47 1.48 .715 1857 Q57 1.44 .687 1857 Q58 1.52 .724 1857 Q59 1.56 .729 1857 Q69 1.59 .739 1857 The spread of the data distribution can have an effect on estimates The large r the spread of varia nces of the items, the higher the resulted alpha would be Note that the standard deviation (SD) of the subscale Inhibit ranges from 0 .687 to 0 .759, similar ranges are found in the other seven factors in the Appendix B There are differences in the variabi lity in items, but that all have enough variation to be useful, in other words, the variation are low, but not so low that the items should be thrown away.
43 Cronbach coefficient a lpha Table 3 4 Cronbach c oefficient a lpha Scales Alpha Inhibit Shift Emo tional Control Initiate Working Memory Plan/ Organize Organization of Materials Monitor Overall Raw .967 .944 .954 .932 .946 .941 .936 .935 .991 Standardized .967 .945 .955 .932 .947 .941 .937 .936 .991 Taking the subscale Inhibit as an example, Inhibi t is the sum of observed variables Q9, Q38, Q42, Q43, Q45, Q47, Q57, Q58, Q59, and Q69 measuring the same construct, and with 1089 subjects respond to the items, we can estimate the variance for each item, as well as the variance for the sum scale Inhibit The variance of the sum scale Inhibit will be smaller than the sum of the ten item variances, since the variance of the sum scale contains an add itional component of covariance As it is in our case, the BRIEF output has an overall raw alpha of .991, whi ch is very high considering that .70 is the cutoff value for being acceptable. Actually high reliabilities (0.95 or higher) could be problematic which may imply that the items are entirely redundant. The raw alphas for the eight scales are .967 for Inhibi t, .944 for Shift, .954 for Emotional Control, .932 for Initiate, .946 for Working Memory, .941 for Plan/Organize, .936 for Organization of Materials, and .935 for Monitor. All the alphas indicate strong internal consistency among the items under respectiv e construct. It is a good sign that our instrument is considered internally overestimated estimates in multidimensional measurement. Subscales i nter correlation One thing worth note is that a negative correlation is not a problem in this part of analysis. For example, if the instrument has a subscale measuring positive attitude and another subscale assessing depression ," it makes perfectly sense that they should yield opposit e estimates to each
44 other. On the other hand, a positive relationship could be a problem because the concepts measured by the subscales are not distinct if the correlations are very high. Table 3 5 Inter c orrelation among the e ight s ubscales p_ inhibit p_ shift p_emo _cntrl p_ initiate p_wk_ mem p_plan p_ organi p_ monitor p_inhibit 1.000 .711 .814 .745 .731 .731 .685 .896 p_shift .711 1.000 .851 .733 .760 .756 .671 .769 p_emo_cntrl .814 .851 1.000 .667 .656 .666 .623 .784 p_initiate .745 .733 .667 1.0 00 .899 .917 .742 .869 p_wk_mem .731 .760 .656 .899 1.000 .906 .778 .850 p_plan .731 .756 .666 .917 .906 1.000 .808 .879 p_org .685 .671 .623 .742 .778 .808 1.000 .788 p_monitor .896 .769 .784 .869 .850 .879 .788 1.000 In the table 3 5, we can see th at the correlation among the eight subscales are all relatively strong, rang from .623 to .917 indicating that all subscales are dependent on each other T here are possibly overlapped portion of concepts been measured in our instruments although each of them sustain a very high Cronbach coefficient Alpha. Item t otal s tatistics for s ubscale No items should be dropped simply by looking at its own mean or correlation, a good analysis of test items should take the whole test into consideration. Table 3 6 I tem Total Statistics for Subscale Inhibit Scale Mean if Item Deleted Scale Variance if Item Deleted Corrected Item Total Correlation Squared Multiple Correlation Alpha if Item Deleted Q9 14.03 33.765 .832 .698 .964 Q38 14.04 33.521 .836 .734 .9 64 Q42 14.12 33.601 .832 .738 .964 Q43 14.10 33.131 .878 .790 .962 Q45 14.11 33.987 .795 .652 .965 Q47 14.23 33.751 .856 .753 .963 Q57 14.26 34.085 .850 .767 .963 Q58 14.18 33.418 .888 .815 .962 Q59 14.14 33.694 .844 .731 .963 Q69 14.11 33.418 .867 .774 .963
45 The results from table 3 6 are helpful for identifying individual items that might be troublesome. First, we look for the least strong correlation between an item and a scale composed of all of the other items. Q45 appears to have the least as sociations with the remaining items, although the correlation is strong and as high as 0.795. It is suggested that a correlation coefficient less than .30 is a weak correlation, and that item should be removed and not used to form a composite score for the construct. Thus, we are inclined to keep the Q45 in the measurement. Second, we look for the inflation of the reliab ility alpha when certain item was deleted. I f Q45 in Inhibit were to be deleted for the fact that it has the least item total correlation v alue, then the value of raw alpha will decrease from the current .967 to .965. This indicates that Q45 is measuring the same construct as the rest of the items in the scale, although it contributes the least to the score variance Thus, we conclude once mo re that it is an item we should retain. If we remove Q45 from the scale, it will make the construct less reliable, since the alpha drops from 0.967 to 0.965. Therefore, all of the items were preferred to be retain ed in order to form a reliable measurement. In sum, we would lose more power by shortening our test than we would gain from a higher average correlation. outcome, or to say, exceeded our expectation. However, we have to remember that alpha is a poor index of multidimensionality. There are other scaling models out there m any analysts, however apply alpha to all scaling problems. Multivariate Generalizability Analysis Generally, a reliability coefficient can be attain ed by replicating the design process for multiple times or occations However, a true replication of the measurement would involve randomly equivalent, but also with different teachers, and items, etc. Obtaining the reliability
46 coefficient by simply correl overcome this complexity, we incorporate the multivariate generalizability theory to estimate the test internal consistency Its generalizability study results will be used in the D study to estimate how reliability coefficients were impacted by varying the number of teachers and items, with the purpose of determining the optimal number needed for an efficient decision. (Kreiter, 2004) BRIEF Data We use the same BRIEF d ata as above, there are 85 teachers and each teacher evaluated around 13 students. Usable survey forms of totaling 1,089 of 1 318 forms (82.63%) were received. The teacher survey was specified to have 73 polytomously scored items split into eight content c ategories (Inhibit, Shift, Emotional Control, Initiate, Working Memory, Plan/Organize, Organization of Materials, and Monitor). E ight scores are produced for each corresponding subscale. The first three scales comprise the sub total scale of Behavioral Reg ulation Index (BRI); and the other five scales comprise the second sub total scale of Metacognition Index (MI). The BRI and MI sub total scales are combined to form the total score of Global Executive Composite (GEC). Multivariate Generalizability Model G s tudy d esign In the framework of m ultivariate g eneralizability t heory, the BRIEF data follows a specific design. The data for this analysis consist of polytomous responses (1 = Never; 2 = Sometimes; 3 to a 73 item survey. Thus the 1,089 students were nested within the 85 teachers, and were crossed with the items. Furthermore, each survey was consisted of eight content categories, in other words, the items were nested within the
47 eight subscales. As we me ntioned before, the item number for each subscale are 10, 10, 9, 7, 10, 10, 7, and 10. If the conditions within each facet people (p) / student, teacher (t), and item ( i), but not facet subscale (s), are treated as being random samples from all possible c onditions, thus, the se three facets are random facets which can be generalized to similar situations, while facet subscale is a fixed facet, as is r equired under multivariate g eneralizability theory, then we denote p as person / students, t as teachers, i as items, and s as subscales. The questionnaire BRIEF data follows a generalizability model, which is equivalent to the (p : t) (i : s) design in the univariate generalizability model, with the following ob served sample sizes: = 1089, = 85, = 10, 10, 9, 7, 10, 10, 7, 10, = 8. For the fixed subscale facet, there is a univariate (p : t) i design associated with each of the eight subscales. Because teachers ( object of measurement component and a disattenuated correlation for teacher (t) between the eight subscales can be estimated. Univariate Generalizability theory: Linear Model: (3 1) Variance Components: (3 2) Multivariate Generalizability theory: Linear Model: (3 3) Variance Components: (3 4) Covariance Components: (3 5)
48 Software: MGT: mGENOVA Software mGENOVA (Brenn an, 2001) was used to conduct the multivariate generalizability analysis. In order to meet the requirement of the software mGENOVA, which only accept s .txt file, the data was reorganized before saving into the .txt file from .csv file. First, all the data were set to same length as one digit for all the response, and 6 digit for the subscale values as 000.00, then all the data were sorted according to the teacher names in order to put the students evaluated by the same teacher together, and then the items w ere arranged according to the subscales, in this way, the items would list in the order of subscales, not according to the item numbers. Finally, the student ID and teacher names were moved to the right column of the file and the first row of the file whic h containing the names of variables were deleted. Thus, the data were ready to be read into mGENOVA S tatistics including means of the variables, mean squares and mean products would be printed, the mean squares were used to estimate variance components, a nd the mean products were used to estimate covariance components, others results were presented as follow s : Table 3 7 Estimated G Study Variance a nd Covariance Components Effect inhibit shift emocntrl initiate workmem plan organize monitor t 0.07363 0.82852 0.91854 0.95049 0.92010 0.91520 0.89280 0.95403 0.06271 0.07780 0.91974 0.85705 0.94719 0.86801 0.86 059 0.88985 0.05970 0.06145 0.05738 0.86750 0.90345 0.83855 0.89014 0.90046 0.07362 0.06824 0.05932 0.08148 0.95995 0.99093 0.91242 0.97974 0.07054 0.07465 0.06115 0.07742 0.07983 0.96674 0.95152 0.96870 0.07351 0.07167 0.05946 0.08373 0.08086 0.08763 0.93388 0.97380 0.065 41 0.06481 0.05757 0.07032 0.07259 0.07464 0.07290 0.93360 0.06838 0.06557 0.05698 0.07388 0.07230 0.07615 0.06659 0.06978 p:t 0.32267 0. 17785 0.17291 0.24117 0.19149 0.25606 0.21312 0.16330 0.17005 0.25138 0.20612 0.16006 0.16230 0.23831 0.24438 0.19651 0.15027 0.1558 5 0.23000 0.22160 0.21975 0.18946 0.13330 0.14707 0.17992 0.19065 0.18687 0.24381 0.26049 0.16694 0.20286 0.22420 0.21455 0.21194 0.18950 0.23687
49 Table 3 7 Estimated G Study Variance a nd Covariance Components Effect inhibit s hift emocntrl initiate workmem plan organize monitor i 0.00561 0.00561 0.01054 0.00058 0.01827 0.00069 0.00620 0.01334 ti 0.01129 0.01232 0.00964 0.01570 0.01724 0.01431 0.00942 0.01927 pi:t 0.12420 0.13320 0.12563 0.16326 0.14440 0.17395 0.13817 0.18960 Variance and c ovariance m atrix Estimated variance components and covariance component for each subscale are reported in Table 3 7. The estimated variance components are the bolded elements at the diagonal of the matrix, the lower diagonal elements are the unbiased estimates of universe score covariances, and the upper diagonal elements should have been the same covariance, but rath er, the disattenuated correlations are printed instead. Variance components of the eight subscales were compared to determine how different the effects variability was. As seen from the table 3 7, the teacher (t) effect was quite small accounting for less than 8% of the variance for all subscales, with t he smallest 0.057 for subscale Emotional C ontrol, and the larges t 0.088 for subscale P lan, indicating that there were greater
50 subscale Plan than subscale E motional C ontrol, though the difference was not very obvious. The effect item (i) and effect teacher by item interaction (ti) are all below 2% of the variance for all subscales, ranked as the smallest among all the other effects. This indicates that item difficult y and teacher by item interaction does not contribute Looking into the ite m (i) effect, subscale Working M emory (0.02) has relatively larger variability than the subscale Initiate (0.0006), this h as wel l clued us that we could place more items in the subscale Working M emory in our future D study to improve the test score reliability, although the item (i) effect accounted for few error variance. With the low variance components for both teacher (t) and i tem (i), it is understandable that variability attributable to teacher by item (ti) is relatively small which indicate that the The effect of students nested in teacher (p:t) was t he largest source of variability. Variance of the people/student nested in teacher (p:t) are the summation of the people (p)/student variance and people (p)/student by teacher interaction (pt), with the known low variance component for teacher (t) effect, and we can include that there are substantial differences in students universe scores, and thus could be considered as one prior facet that should be manipulated in the D study. The residual (pi:t) has medium variance component, which is expected because b oth item (i) and teacher (t) have relatively small variance, we have assumed that person/student (p) has relatively large variance and also there is not much error assumed which is not being explained by the model. The covariances and correlation matrix p rovided more information on the performance of the different subscales. The covariances for the eight subscales under the teacher (t) effect are all around 0.06 0.08, suggesting that the teachers evaluation on the students were consistent within each subsc Given
51 that the variances for effect people/student nested in teacher (p:t) were larger than the effect teacher (t), its covariances are higher as well (0.13 0.26). T he higher covariance in effect people/student nested in teacher (p:t) showed that they may accounts for greater weights in Disattenuated c orrelation The observed disattenuated correlation is provided only for the first effect (t), since the first e ffect is treated as the objects of measurement facet. The correlation of the universe scores for each subscale was examined to determine whether or not the items within the different subscales represented different skills. All these universe score correlat ion were as high as between 0.82 0.99, suggesting that the eight subscales in this study could be considered as well D s tudy d esign D c oefficients and SEMs for s ubscale v ariables D s tudies demonstrate the impact of different measurement conditions related to teacher, student and item facets on the absolute and relative coefficients and standard error of measurement ample size for each facet, and harmonic means of the which were also called the divisors, were calculated to compute for the variance and covariance components. Therefore, the variance and covariance in the D study were the results of G study varian ce and covariance divided by the corresponding divisors. The D study variance covariance matrix for the first effect teacher effect (t) is the variance covariance matrix for universe scores. All effects, except the teacher effect (t), that interact with t he universe scores effect (t) contribute to relative error ( ); and all
52 effects, except the universe scores effect (t), contribute to absolute error ( ). Every effect would contribute to the error for the mean. Therefore, the generalizability coefficient facet design with subscale facet fixed becomes: (3 6 ) And the dependability index phi i facet design with subscale facet fixed becomes: (3 7) Table 3 8 D Study Results f or Individual Variables I nhibit S hift E mocntrl Initiate Workmem Plan Organize M onitor Univ Score Var 0.07363 0.07780 0.05738 0.08148 0.07983 0.08763 0.07290 0.0697 8 Rel Error Var 0.03427 0.01965 0.02778 0.02941 0.02732 0.02489 0.02741 0.02723 Abs Error Var 0.03483 0.02021 0.02895 0.02950 0.02915 0.02495 0.02830 0.02856 Er Var for Mean 0.00183 0.00171 0.00217 0.00139 0.00309 0.00139 0.00207 0.00247 Univ Score SD 0.27134 0.27893 0.23955 0.28545 0.28255 0.29602 0.27000 0.26416 Rel Error SD 0.18513 0.14018 0.16667 0.17150 0.16530 0.15775 0.16557 0.16502 Abs Error SD 0.18664 0.14217 0.17014 0.17174 0.17073 0.15797 0.16822 0.16901 Err SD for Mean 0.04278 0.04132 0.04661 0.03725 0.05557 0.03732 0.04545 0.04975 Gen Coefficient 0.68237 0.79836 0.67382 0.73476 0.74502 0.77882 0.72673 0.71930 Phi 0.67884 0.79379 0.66468 0.73421 0.73253 0.77834 0.72037 0.70955 From table 3 8, we see that for each of the fixed multivariate variables (subscales), universe score variance, relative and absolute err or variances, standard error of measurement, generalizability and phi coefficients are presented. The largest universe s core variance was the subscale P lan (0.088), and the smallest universe s core variance was the subscale E motion al C ontrol (0.057), however, because of the stan dard error of measurement the highest generalizabilty coefficient becomes subscale S hift (0.80), while the lowest generalizabilty co efficient remains the subscale E motional C ontrol (0.67); Similarly, the highest phi coefficient is for the subscale S hift (0.79), and the lowest phi
53 coefficient was for the subscale E motional C ontrol (0.66). All the reliability coefficients are higher than 0.66, suggesting measurement precision for all the eight subscales were medium high, especially for subscale S hift and su bscale P lan, while the measurement precision for subscale Inhibit (0.68) and Emotional C ontrol (0.67) still need improvement. D c oefficients and SEMs for c omposite v ariables Table 3 9 D Study Results for Composite Variables Composite Universe Scor e Variance 0.06956 Composite Relative Error Variance 0.02009 Composite Absolute Error Variance 0.02020 Composite Error Variance for Mean 0.00116 Composite Universe Score Stand ard Deviation 0.26374 Composite Relative Error Standard Deviation 0.14174 Composite Absolute Error Standard Deviation 0.14212 Composite Error Standard Deviation for Mean 0.03410 Composite Generalizability Coefficient 0 .77589 Composite Phi 0.77496 Composite S/N Rel 3.46215 Composite S/N Abs 3.44358 The composite univers e score is based on applying the default w weights (a priori weights) to the universe scores for each of the eight subscales. Among all the Configuration s in mGENOVA, the default w weight were used (0.14, 0.14, 0.12, 0.10, 0.14, 0.14, 0.10, 0.14), which ar e proportional to the item sample size s (10, 10, 9, 7, 10, 10, 7, 10). T he eight subscales contributed to the composite universe score variance under the w weights are 13.48%, 13.51%, 10.49%, 10.13% 14.52%, 14.95%, 9.37%, 13.54% Since BRIEF test represe nts a criterion referenced test, the phi coefficient is the more appropriate measure of reliability in this case. As seen from the table 3 9, the composite Phi coefficient was medium high as 0.775, the composite absolute error standard error of
54 measurement was 0.142, suggesting that the composite measurement reliability is acceptable, but still have room for improvement. Eight configurations In the Configuration A, same G study sample size was used (defalt), and we get an acceptable composite Phi coefficien t of 0.775, it serves as the base for the later design comparison. T he result of Configuration A is presented in the Appendix C Started from the Configuration B the sample size of certain facets was varied with the purpose to observe the ir effect on the G and Phi coefficients, along with the associated SEMs. Table 3 10 provides the absolute SEMs and Phi coefficients according to the eight D studies Configurations. Table 3 10 D Study Composite G Coefficient Change on the Variation of Facet Sample Size Co nfiguration Subscales P:t Divisor Weight Absolute SEMs Composite G Estimates G Study 10 10 9 7 10 10 7 10 10.111 D Study A 10 10 9 7 10 10 7 10 10.111 w 0.14212 0.77496 D Study B 1 1 1 1 1 1 1 1 10.111 w 0.08666 0.90246 D Study C 5 5 5 5 5 5 5 5 10. 111 w 0.06253 0.94672 D Study D 10 10 10 10 10 10 10 10 10.111 w 0.05882 0.95256 D Study E 20 20 20 20 20 20 20 20 10.111 w 0.05688 0.95551 D Study F 8 8 11 7 12 7 8 12 10.111 w 0.06068 0.94915 D Study G 10 10 9 7 10 10 7 10 15 w 0.04994 0.96539 D Stu dy H 10 10 9 7 10 10 7 10 20 w 0.04410 0.97281 In the Configuration B the sample size for teacher (t), student nested in teacher (p:t) remained the same, while the sample size for item (i) were set to 1 for each of the eight subscales. The composite phi coefficient were increased from 0.775 to 0.902, and the SEMs were dropped from 0.142 to 0.087, apparently changed the number of items (i) in each subscale from unbalanced to balanced would resulted in better score reliability, even though the number of sa mple size were decreased from average ten items to only one item per subscale. It reinforced BRIEF. We could infer from this run of data that the sample size of the teacher (t) and student
55 nested in teacher (p:t) were so large, that even one item per subscale would give a satisfied composite generalizability coefficient. The output of Configuration B result is presented in the Appendix In the Configuration C the sample size for teacher (t), student nested in teacher (p:t) remained the same, while the sample size for item (i) were increased to five for each of the eight subscales. The composite phi coefficient was increased 0.172 from 0.775 to 0.947 comparing to t he Configuration A, and the SEMs were dropped to 0.063. While comparing with the Configuration B the composite phi coefficient increased by 0.045 (from 0.902 to 0.947). Therefore, we could infer from these three runs of data that balanced multivariate des ign yield quite high composite phi coefficient, and the increase of the number of items (i) from one to five for each subscale have nice boost on the phi coefficient, but not a s remarkable as from the last Configuration The detailed result of Configuratio n C is presented in the Appendix In the Configuration D the sample size for teacher (t), student nested in teacher (p:t) again remained the same, and the sample size for item (i) were increased to ten for each of the eight subscales. The composite phi co efficient merely increased by 0.006 (from 0.947 to 0.953) comparing to the Configuration C and the SEMs dropped only 0.004, from 0.063 to 0.059. Apparently the increase of the number of items (i) from five to ten for each subscale ha s little effect on the phi coefficient, yet the test length is doubled. The detailed result of Configuration E is presented in the Appendix The same situation applied to the Configuration E the sample size for teacher (t), student nested in teacher (p:t) remained the same, wh ile the sample size for item (i) were increased to twenty for each of the eight subscales. The composite phi coefficient only increased by 0.003 (from 0.953 to 0.956) comparing with the Configuration D and the SEMs were dropped to
56 0.057. Although the comp osite phi coefficient achieved the satisfied level comparing to the first couple Configurations, the contribution was not sufficient enough, especially considering the fact that included such a large amount of items (160 items in total) Considering the pr actical issue of the test length and time, we would not recommend this data configuration for the test construction. The detailed result of Configuration F is presented in the Appendix In Configuration F the sample size for teacher (t), student nested in teacher (p:t) remained the same, however, the sample size for item (i) were altered according to the magnitude of the variance components of the eight subscales within the item (i) effect, while remaining the current test length by relocating the items (8 8 11 7 12 7 8 12) for each subscale. The number of items in subscale Working Memory were increased from 10 to 12, as its variance accounted for the most among the eight subscales, the number of items in subscales Initiate and Plan were both decreased to 7 where little variability among the items were detected. Though the composite P hi coefficient in this Configuration decreased by 0.006 (from 0.956 to 0.950) when comparing with the Configuration E and also the SEMs increased 0.004 from 0.057 to 0.061, th e test length was shrunk from 160 items back to 73 items. Compare to the Configuration D and Configuration E, the Configuration F composite phi coefficient was right located in between. In this case, we would leave the choice for the BRIEF experts to weigh between the test length and score reliability, as well as the content specifications In the Configuration G the sample size for teacher (t), item (i) remained the same as the G study, while the sample size for students nested in teacher (p:t) were incre ased to fifteen from average 10.111 for each teacher. The composite phi coefficient were increased from 0.775 to 0.965 comparing to the default D study, and the SEMs were dropped from 0.142 to 0.050, apparently increased the number of students nested in te acher (p:t) would also result in
57 increasing of the composite phi coefficient, which actually had a more substantial effect than changing the unbalanced item numbers to balanced (0.902), even a little bit higher than increasing the balanced item number from one to twenty (0.956) for each subscale. It is apparently practical to reduce the error by modifying the facets that are responsible for the majority of the error variance in the design. The Configuration G output is presented in the Appendix In the Conf iguration H the sample size for teacher (t), item (i) remained the same, while the sample size for students nested in teacher (p:t) were increased to twenty for every teacher. The composite phi coefficient were increased to 0.973, and the SEMs were droppe d to 0.044, thus we could infer that by increasing the number of students nested in teacher (p:t) would resulted in increasing of the composite phi coefficient, however by increasing fifteen students to twenty for each teacher the phi coefficient only incr eased by 0.007. The result of Configuration H is presented in the Appendix According to the change of the composite phi coefficients and SEMs after the variation of the sample size in different facets, we could conclude that when we increased the sample s ize of the D study on either the item (i) facet, or the students nested in teacher (p:t) facet, or both of them, the composite phi coefficient would increased from the acceptable level to a satisfied level, however the degree of increase was slowing down a s we further increasing the sample size on either of the above facet.
58 CHAPTER 4 RESULTS AND CONCLUSI ONS 1) According to table 3 3, our output has an overall raw alpha of .991, which is very high consider ing that .70 is the cutoff value for being acceptable. Actually high reliabilities (0.95 or higher) could be a problem as this may indicate the items were entirely redundant. Looking back to the survey, we found Q38 (Does not think before doing), Q43 (Is impulsive) and Q 69 (Does not think of consequences before acting) are asking similar questions, we suggest that certain items should be revised. The raw alphas for the eight scales are .967 for scale Inhibit, .944 for scale Shift, .954 for scale Emotional Control, .932 for scale Initiate, .946 for scale Working Memory, .941 for scale Plan/Organize, .936 for scale Organization of Materials, and .935 for scale Monitor. All the alphas indicate strong internal consistency among the i tems under respective const ruct It is a good sign that our instrument is considered as reliable, however, we need to remember that alpha could be overestimated when come to multidimensional cases 2) The Relationship Between the Eight Subscales in BRIEF lpha analysis, the correlation among the eight subscales are all relatively strong in table 3 4, rang from .623 to .917 same in the multivariate g eneralizability analysis in table 3 6, all the universe score correlations were as high as between 0.82 0.99. Howe ver, a very high positive relation is not what we wanted in a study because it could imply that the concepts measured by the subscales are not distinct. Therefore, the eight subscales in our instrument are all dependent on each other, meaning there are ove rlapped portion of concepts been measured, which agrees with the above cutoff criterion that a very high Cronbach lpha re presents replicated items.
59 3) Multivariate Generalizability and Dependability coefficients for each subscale and composi te scales in BRIEF as well as the Standard Error of Measurement For the coefficient is subscale Shift (0.80), and the lowest generalizabilty coefficient is the su bscale Emotional Control (0.67), in between the generalizabilty coefficient for scale Inhibit is 0.68, for scale Initiate is 0.73, for scale Working Memory is 0.75, for scale Plan is 0.78, for scale Organize is 0.73, for scale Monitor is 0.72; the highest phi coefficient is the same for the subscale Shift (0.79), and the lowest phi coefficient is the same for the subscale Emotional Control (0.66), in between the phi coefficient for scale Inhibit is 0.68, for scale Initiate is 0.73, for scale Working Memory is 0.73, for scale Plan is 0.78, for scale Organize is 0.72, for scale Monitor is 0.71. All the reliability coefficients are higher than 0.66, suggesting measurement precision for all the e ight subscales were medium high. The composite g eneralizability coe fficient for all eight subscales was medium high as 0.776, and the composite Phi coefficient for all eight subscales was medium high as 0.775, both are above the acceptable level. Table 3 10 provides the phi coefficients and the respective absolute SEM for the various Configurations. Reliability coefficients serve as the primary interest in our study, the standard error of measurement, however could be a more useful metric. A higher generalizability coefficient does not guarantee a smaller error component. While the differences in SEM were not dramatic in our case, small differences were observed from Configura tion A (0.14) to Configuration I (0.04). 4) Refer back to question 1) and 3), we can see that the lpha for the eight subscales are all as high as 0.932 and above, and has an overall lpha of 0.991, while the Phi coefficients
60 for the eight subscales are medium high around 0.66 to 0.79, and its composite Phi coe fficient is 0.77. The reason underlying that the two models have different result could be that the lpha model throws all the errors into one single error score, while Multivariate Generalizability model not only separates and accounts for mult iple sources of errors, but also allows more than one universe score associated with the object of measurement. Here eliability is highly over estimated in our case, and as mentioned earlier before, the usage of eliabili ty should be cautious. However, the major causes for the different result is that the objective of measurement was changed from students to teacher, thus different error variance were included for the estimation of the coefficients. As to the internal stru cture of the test, m ultivariate generalizability theory furthermore contributed valuable information. It provided that the disattenuated correlations between the universe scores of the eight subscales were between 0.82 and 0.99, indicating a high degree of commonality between the scores on the eight subscales. Even with such case that the subscales are highly correlated, variance covariance matrix brought additional input. The variance components of the eight subscales were compared to determine how differ ent the variability of teachers/raters and items were within each of the subscales. Table 3 subscales were more variable than on other subscales within an effect, and that the different subscales further had various degrees of scores variability in different effects. For instance, the effect of students nested in teacher (p:t) ranked the highest among all effects on the variability of the students performance, and within the effect (p:t) the subscale Inhibit (0.32) accounted more variance than the subscale Shift (0.17). With the information provided, we could manipulate the
61 number of the students evaluated by the teacher in the D studies to see the impact on reliability by reducing the largest error variance. Furthermore, as to the item (i) effect, despite the fact that it accounts for little score variance, the subscale Working Memory (0.02) accou nted for larger error variance than the subscale Initiate (0.0006), notifying us to increase the items in the subscale Initiate to improve the content accuracy of BRIEF. The multivariate generalizability theory further avoids the complexity of unbalanced d esign, and allows us to investigate and decrease the measurement error at each subscale and composite level. In addition to trying to more accurately assess the reliability of the scores, the benefits of using the multivariate approach rather than the Cron development perspective, as was shown in the D studies, which allows for varying the conditions of any facet or combinations of facets to get the optimum precision design. With the eight D study Configurations in the software mGENOVA, we found that the default w weight, which are proportional to the item sample sizes, were preferred to the a weights, with which each subscale gets same weight G eneralizability and phi coefficients were not able to converge under the a weights, but not under the w weight. The D study allows us to use the information provided by the G study to visualize the effects of modifications on reliability using simulated data. We found that the composite phi coefficient were increased signific antly when we changed the number of items (i) in each subscale from unbalanced (0.78) to balanced (0.90), even though the number of sample size were decreased from average ten items to only one item per subscale, with one condition that the sample size of the teacher (t) and student nested in teacher (p:t) should remain as large as ten. If the sample size for teacher (t) and item (i) were fixed, while the sample size for students nested in teacher (p:t) were increased to twenty from average ten for each tea cher, the composite
62 phi coefficient would increased from the acceptable level (0.78) to a satisfied level (0.98), even larger than changing the unbalanced item numbers to balanced (0.90), also a little bit higher than increasing the balanced item number fr om one to twenty (0.96). Therefore, if budget allowed, it is preferred to increase the number of students nested in teachers (p:t) rather than increase the number of items, considering the effect students nested in teachers (p:t) accounting the largest var iance in the design. However, further increasing the sample size on either of the above facet would result in higher composite phi coefficient, but the degree increased would not be obvious.
63 CHAPTER 5 DISCUSSION AND FUTUR E RESEARCH el R eliability tests are preliminary analysis especially important when derivative variables are intended to be used for subsequent analyses. If the result shows poor reliability, some items should be re examined and modified or completely changed as need ed. Before we proceed to coefficient alpha, it requires that the correlations should be verified that meet the assumptions Reliability test also assumes uni dimensionality. Both are strong assumptions, we need to pay attentions before and after calculatin g alpha. Exploratory factor analysis is also o ne good method to detect poor items for the purpose of keeping efficient items could mark out them and then to be delet ed in order to generate an improved reliability coefficient. Consider the questionnaire BRIEF has the teacher effect as the measurement of the object the inter reliability. Much as these two methods contribute to the reliability of th e test score, the multivariate generalizability model outshined in a greater extend as in multi dimensional case s Multivariate Generalizability Model Be aware of that the multivariate generalizability analysis presented as a powerful tool for assessing th e reliability of a questionnaire of multiple dependent subscales, some points yet need to be addressed. First, u nbalanced data always creates complexity when we want to decompose the variance components. There are many approaches of decompositions of the total sums of squares, while in the generalizability mod el, it allows its average term mean squares to be able to adjust in various ways for other effects.
64 Second, m ixed models yield biased estimation as it contains both fixed effects and random effects. Generalizability theory solves the problem by averaging over the fixed facets in the mixed model and estimates only the variances of the random effects. Yet it does not make conceptual sense as averaging the levels in the fixed effect. Brennan recommended applying the multivariate generalizability theory to handle the fixed effect of the mixed models. On top of the two potential difficulties in our case, another issue is the complex computations for deriving the expected values of mean squares. Searle (19 87) reviewed several alternative methods of estimating variance components that do not have the limitations of ANOVA methods : Maximum likelihood (ML) and R estricted M aximum L ikelihood (REML) MINQUE (minimum norm quadratic unbiased estimation) and MIVQUE ( minimum variance quadratic unbiased estimation), Bootstrap and J ackknife Brennan (2001) The comparison of the six methods will be considered as future studies Research topics like negative variance component estimate arises because of sampling errors or because of model misspecific ation (Shavelson & Webb, 1981), or issues concerning the sampling variability of estimated variance and covariance components in unbalanced designs (Calkins, Erlich, Marston, & Malitz, 1978; Leone & Nelson, 1966; cf. Lindquist, 1953; Shavelson & Webb, in press; Smith, 1978 ; Woodward and Joe 1973 ; Joe and Woodward 1976 ; Noreen M. Webb, Richard Shavelson, 1981) are also interested Item Response Theory Under the framework of classical test theory, the standard error of measureme nt is considered invariable for all students. Longer tests are favored with higher reliab i l ity with a smaller standard error than the shorter tests. However, i t was widely known that measurement precision is not uniform across the entire range of test scor es It is possible to obtain the standard error of measurement for each
65 individual, given ability under Item Response Theory (IRT). Standard errors depend on the relationship between item properties (discrimination and difficulty) and the trait l evel of each respondent. For non adaptive tests, longer tests are more reliable than shorter tests, but the standard errors are larger for extreme scores Item response theory extends the concept of reliability from a single index to a function called the Information F unction. The IRT information function is the inverse of the conditional observed score standard error at any given test score (ability estimate) This provides useful information for approaching and direction for explor ing BRIEF under the fram e work of item response theory. Questionnaire Brief Suggestions The multivariate generalizability design for analysis provided additional eight configurations on test construction with the purpose to maximize score reliability. Before we look into this iss ue, we need to note that restructuring the test would change the interpretation of the total score, as the percent of items within each subscale would change. Adding more items to a measure would increase the score reliability, yet it is not always feasib le to do that. The items are expensive to develop, deliver, and score as well, plus there are possible limitations on the number of items and length of testing time in the assessment. These practical concerns might restrict us from lengthening the test for a desired reliability. Other than examining the change on score reliability through increasing the total number of items, we could also investigate the effect of changing the distribution of items between the subscales while retaining the original test le ngth. By reallocating the cases based on the magnitude of the variance component, the score reliability can be increased without increasing testing time. Before the actual test construction, the D studies of the multivariate generealizability theory allowe d us to play with the conditions in different effects, and thus enabled us to select the finest for the design while avoiding possible cause of waste in the resource allocations.
66 No matter which configuration is chosen for the test construction, it is nec essary to make sure that the test was not restructured in a way that raises concerns about the content balance, in other words, the content representation in each subscale should be consistent with the intended score interpretations.
67 APPENDIX A BRIEF QUE STIONNAIRE Brief Home Pre Post Follow up Year 1 Follow up Year 2 Follow up Year 3 Follow up Year 4 User ID: Teacher ID: (only for follow up data) Child ID: Enter values 1 = Never; 2 = Sometimes; 3 = Often; 8 = Missing; 9 = Double Entry; 1 Overreacts to small problems 2. When given three things to do, remembers only the first or last 3. Is not a self starter 4. Cannot get a disappointment, scolding, or insult off his/her mind 5. Resists or has trouble accepting a different way to s olve a problem with schoolwork, friends, chores, etc. 6. Becomes upset with new situations 7. Has explosive, angry outbursts 8. Has a short attention span 9. Needs to be told "no" or "stop that" 10. Needs to be told to begin a task even when will ing 11. Loses lunch box, lunch money, permission slips, homework, etc. 12. Does not bring home homework assignment sheets, materials, etc. 13. Acts upset by a change in plans 14. Is disturbed by change of teacher or class 15. Does not check work for mistakes 16. Cannot find clothes, glasses, shoes, toys, books, pencils, etc. 17. Has good ideas but cannot get them on paper 18. Has trouble concentrating on chores, schoolwork, etc. 19. Does not show creativity in solving a problem 20. Backp ack is disorganized 21. Is easily distracted by noises, activity, sights, etc. 22. Makes careless errors 23. Forgets to hand in homework, even when complete 24. Resists change of routine, foods, places, etc. 25. Has trouble with chores or tasks t hat have more than one step 26. Has outbursts for little reason 27. Mood changes frequently 28. Needs help from adult to stay on task
68 29. Gets caught up in details and misses the big picture 30. Has trouble getting used to new situations (classes groups, friends) 31. Forgets what he/she was doing 32. When sent to get something, forgets what he/she is supposed to get 33. Is unaware of how his/her behavior affects or bothers others 34. Has problems coming up with different ways of solving a problem 35. Has good ideas but does not get job done (lacks follow through) 36. Leaves work incomplete 37. Becomes overwhelmed by large assignments 38. Does not think before doing 39. Has trouble finishing tasks (chores, homework) 40. Thinks t oo much about the same topic 41. Underestimates time needed to finish tasks 42. Interrupts others 43. Is impulsive 44. Does not notice when his/her behavior causes negative reactions 45. Gets out of seat at the wrong times 46. Is unaware of own behavior when in a group 47. Gets out control more than friends 48. Reacts more strongly to situations than other children 49. Starts assignments or chores at the last minute 50. Has trouble getting started on homework or chores 51. Mood is easi ly influenced by the situation 52. Does not plan ahead for school assignments 53. Gets stuck on one topic or activity 54. Has poor understanding of own strengths and weaknesses 55. Talks or plays too loudly 56. Written works is poorly organized 57. Acts too wild or "out of control" 58. Has trouble putting the brakes on his/her actions 59. Gets in trouble if not supervised by an adult 60. Has trouble remembering things, even for a few minutes 61. Work is sloppy 62. After having a proble m, will stay disappointed for a long time 63. Does not take initiative 64. Angry or tearful outbursts are intense but end suddenly 65. Does not realize that certain actions bother others 66. Small events trigger big reactions 67. Cannot find thin gs in room or school desk 68. Leaves a trail of belongs wherever he/she goes 69. Does not think of consequences before acting 70. Has trouble thinking of a different way to solve a problem when stuck 71. Leaves messes that others have to clean up 72. Becomes upset too easily 73. Has a messy desk 74. Has trouble waiting for turn
69 75. Does not connect doing tonight's homework with grades 76. Tests poorly even when knows correct answers 77. Does not finish long term projects 78. Has poor h andwriting 79. Has to be closely supervised 80. Has trouble moving from one activity to another 81. Is fidgety 82. Cannot stay on the same topic when talking 83. Blurts things out 84. Says the same things over and over 85. Talks at the wrong time 86. Does not come prepared for classes
70 APPENDIX B A LPHA A NALYSIS R ESULTS Scale: Shift Inter Item Correlation Matrix Q4 Q5 Q6 Q13 Q14 Q24 Q30 Q40 Q53 Q62 Q4 1.000 .680 .658 .632 .563 .563 .585 .551 .536 .720 Q5 .680 1.000 .730 .664 .592 .634 .667 .609 .618 .657 Q6 .658 .730 1.000 .760 .720 .701 .724 .577 .557 .668 Q13 .632 .664 .760 1.000 .799 .721 .729 .579 .554 .617 Q14 .563 .592 .720 .799 1.000 .717 .702 .550 .493 .557 Q24 .563 .634 .701 .721 .717 1.000 .734 .588 .583 .574 Q 30 .585 .667 .724 .729 .702 .734 1.000 .634 .622 .600 Q40 .551 .609 .577 .579 .550 .588 .634 1.000 .660 .539 Q53 .536 .618 .557 .554 .493 .583 .622 .660 1.000 .562 Q62 .720 .657 .668 .617 .557 .574 .600 .539 .562 1.000 Summary Item Statistics Mean M inimum Maximum Range Maximum / Minimum Variance N of Items Item Means 1.420 1.308 1.582 .274 1.210 .007 10 Inter Item Correlations .633 .493 .799 .307 1.622 .005 10
71 Item Total Statistics Scale Mean if Item Deleted Scale Variance if Item Deleted Cor rected Item Total Correlation Squared Multiple Correlation Cronbach's Alpha if Item Deleted Q4 12.62 20.700 .744 .618 .940 Q5 12.72 20.636 .797 .658 .937 Q6 12.80 20.717 .830 .722 .935 Q13 12.84 21.130 .822 .743 .936 Q14 12.86 21.423 .765 .701 .938 Q 24 12.89 21.511 .785 .659 .938 Q30 12.84 21.042 .813 .690 .936 Q40 12.79 21.396 .711 .551 .941 Q53 12.73 21.287 .696 .551 .942 Q62 12.72 20.815 .745 .611 .939 Scale Statistics Mean Variance Std. Deviation N of Items 14.20 25.846 5.084 10 Scale: E motional control Inter Item Correlation Matrix Q1 Q7 Q26 Q27 Q48 Q51 Q64 Q66 Q72 Q1 1.000 .668 .675 .681 .698 .670 .587 .726 .705 Q7 .668 1.000 .789 .726 .691 .664 .617 .723 .708 Q26 .675 .789 1.000 .752 .685 .666 .660 .733 .707 Q27 .681 .726 .752 1. 000 .716 .747 .623 .747 .755 Q48 .698 .691 .685 .716 1.000 .750 .608 .763 .739 Q51 .670 .664 .666 .747 .750 1.000 .594 .749 .729 Q64 .587 .617 .660 .623 .608 .594 1.000 .680 .673 Q66 .726 .723 .733 .747 .763 .749 .680 1.000 .796 Q72 .705 .708 .707 .75 5 .739 .729 .673 .796 1.000
72 Summary Item Statistics Mean Minimum Maximum Range Maximum / Minimum Variance N of Items Item Means 1.467 1.315 1.609 .294 1.223 .009 9 Inter Item Correlations .700 .587 .796 .209 1.356 .003 9 Item Total Statistics S cale Mean if Item Deleted Scale Variance if Item Deleted Corrected Item Total Correlation Squared Multiple Correlation Cronbach's Alpha if Item Deleted Q1 11.60 21.360 .785 .622 .951 Q7 11.80 21.720 .812 .695 .949 Q26 11.84 21.802 .824 .722 .949 Q27 11 .76 21.500 .840 .719 .948 Q48 11.68 21.160 .825 .695 .948 Q51 11.63 21.292 .812 .685 .949 Q64 11.89 22.700 .725 .545 .953 Q66 11.72 21.137 .868 .759 .946 Q72 11.71 21.123 .850 .732 .947 Scale Statistics Mean Variance Std. Deviation N of Items 13.2 0 27.098 5.206 9 Scale: Initiate Inter Item Correlation Matrix Q3 Q10 Q19 Q34 Q50 Q63 Q70 Q3 1.000 .728 .628 .619 .712 .705 .596 Q10 .728 1.000 .612 .634 .707 .649 .637 Q19 .628 .612 1.000 .721 .608 .633 .686 Q34 .619 .634 .721 1.000 .632 .624 .755 Q50 .712 .707 .608 .632 1.000 .696 .655 Q63 .705 .649 .633 .624 .696 1.000 .638 Q70 .596 .637 .686 .755 .655 .638 1.000
73 Summary Item Statistics Mean Minimum Maximum Range Maximum / Minimum Variance N of Items Item Means 1.626 1.597 1.699 .102 1. 064 .001 7 Inter Item Correlations .661 .596 .755 .159 1.266 .002 7 Item Total Statistics Scale Mean if Item Deleted Scale Variance if Item Deleted Corrected Item Total Correlation Squared Multiple Correlation Cronbach's Alpha if Item Deleted Q3 9.69 12.939 .787 .660 .920 Q10 9.78 13.151 .782 .636 .921 Q19 9.75 13.325 .762 .611 .923 Q34 9.78 13.330 .783 .668 .921 Q50 9.79 13.094 .791 .646 .920 Q63 9.75 13.168 .777 .618 .921 Q70 9.77 13.295 .779 .655 .921 Scale Statistics Mean Variance Std. De viation N of Items 11.38 17.745 4.213 7 Scale: Working memory Inter Item Correlation Matrix Q2 Q8 Q18 Q21 Q25 Q28 Q31 Q32 Q39 Q60 Q2 1.000 .663 .592 .581 .676 .562 .630 .605 .554 .632 Q8 .663 1.000 .749 .730 .676 .740 .616 .549 .669 .626 Q18 .592 749 1.000 .737 .664 .752 .629 .538 .721 .623 Q21 .581 .730 .737 1.000 .638 .734 .615 .526 .664 .587 Q25 .676 .676 .664 .638 1.000 .648 .659 .615 .638 .687 Q28 .562 .740 .752 .734 .648 1.000 .610 .527 .723 .592 Q31 .630 .616 .629 .615 .659 .610 1.000 .7 54 .621 .699 Q32 .605 .549 .538 .526 .615 .527 .754 1.000 .534 .661 Q39 .554 .669 .721 .664 .638 .723 .621 .534 1.000 .594 Q60 .632 .626 .623 .587 .687 .592 .699 .661 .594 1.000
74 S ummary Item Statistics Mean Minimum Maximum Range Maximum / Minimum Variance N of Items Item Means 1.551 1.325 1.725 .400 1.302 .019 10 Inter Item Correlations .641 .526 .754 .228 1.432 .004 10 Item Total Statistics Scale Mean if Item Deleted Scale Variance if Item Deleted Corrected Item Total Correlation Squared Mul tiple Correlation Cronbach's Alpha if Item Deleted Q2 13.97 26.657 .732 .581 .942 Q8 13.86 25.524 .819 .703 .939 Q18 13.85 25.636 .819 .709 .939 Q21 13.79 25.554 .789 .661 .940 Q25 14.00 26.209 .795 .647 .940 Q28 13.85 25.597 .802 .697 .939 Q31 14.1 0 26.920 .779 .688 .941 Q32 14.19 27.773 .701 .618 .944 Q39 13.90 25.835 .774 .630 .941 Q60 14.11 26.958 .761 .622 .941 Scale Statistics Mean Variance Std. Deviation N of Items 15.51 32.235 5.678 10 Scale: Plan Inter Item Correlation Matrix Q12 Q17 Q23 Q29 Q35 Q37 Q41 Q49 Q52 Q56 Q12 1.000 .488 .679 .497 .564 .569 .589 .648 .681 .606 Q17 .488 1.000 .517 .580 .670 .592 .602 .552 .583 .548 Q23 .679 .517 1.000 .525 .576 .589 .619 .613 .639 .593 Q29 .497 .580 .525 1.000 .649 .634 .596 .585 .594 505 Q35 .564 .670 .576 .649 1.000 .683 .665 .673 .694 .603 Q37 .569 .592 .589 .634 .683 1.000 .716 .668 .670 .589 Q41 .589 .602 .619 .596 .665 .716 1.000 .706 .701 .605 Q49 .648 .552 .613 .585 .673 .668 .706 1.000 .785 .632 Q52 .681 .583 .639 .594 .69 4 .670 .701 .785 1.000 .666 Q56 .606 .548 .593 .505 .603 .589 .605 .632 .666 1.000
75 Su mmary Item Statistics Mean Minimum Maximum Range Maximum / Minimum Variance N of Items Item Means 1.586 1.545 1.639 .094 1.061 .001 10 Inter Item Correlations .616 .488 .785 .297 1.609 .004 10 Item Total Statistics Scale Mean if Item Deleted Scale Variance if Item Deleted Corrected Item Total Correlation Squared Multiple Correlation Cronbach's Alpha if Item Deleted Q12 14.22 26.420 .727 .587 .937 Q17 14.28 27. 217 .696 .526 .938 Q23 14.29 26.517 .731 .570 .937 Q29 14.30 27.035 .701 .524 .938 Q35 14.31 26.512 .793 .657 .934 Q37 14.22 25.979 .783 .637 .934 Q41 14.27 25.992 .797 .655 .933 Q49 14.29 26.135 .809 .694 .933 Q52 14.26 25.805 .831 .723 .932 Q56 1 4.26 26.204 .730 .545 .937 Scale Statistics Mean Variance Std. Deviation N of Items 15.86 32.359 5.688 10 Scale: Organize Inter Item Correlation Matrix Q11 Q16 Q20 Q67 Q68 Q71 Q73 Q11 1.000 .737 .642 .643 .610 .610 .601 Q16 .737 1.000 .666 .669 648 .637 .595 Q20 .642 .666 1.000 .709 .671 .666 .725 Q67 .643 .669 .709 1.000 .728 .712 .728 Q68 .610 .648 .671 .728 1.000 .826 .715 Q71 .610 .637 .666 .712 .826 1.000 .722 Q73 .601 .595 .725 .728 .715 .722 1.000
76 Summary Item Statistics Mean Min imum Maximum Range Maximum / Minimum Variance N of Items Item Means 1.484 1.399 1.583 .184 1.131 .005 7 Inter Item Correlations .679 .595 .826 .231 1.389 .003 7 Item Total Statistics Scale Mean if Item Deleted Scale Variance if Item Deleted Corrected Item Total Correlation Squared Multiple Correlation Cronbach's Alpha if Item Deleted Q11 8.94 12.677 .740 .605 .931 Q16 8.93 12.639 .764 .640 .929 Q20 8.83 12.012 .796 .646 .926 Q67 8.86 12.034 .820 .676 .924 Q68 8.99 12.220 .821 .738 .924 Q71 8.98 12.331 .816 .732 .924 Q73 8.81 11.768 .798 .668 .926 Scale Statistics Mean Variance Std. Deviation N of Items 10.39 16.485 4.060 7 Scale: Monitor Inter Item Correlation Matrix Q15 Q22 Q33 Q36 Q44 Q46 Q54 Q55 Q61 Q65 Q15 1.000 .743 .521 .660 .502 .496 .590 .450 .546 .513 Q22 .743 1.000 .548 .651 .509 .524 .627 .470 .559 .535 Q33 .521 .548 1.000 .558 .783 .768 .590 .642 .537 .798 Q36 .660 .651 .558 1.000 .517 .538 .603 .469 .600 .556 Q44 .502 .509 .783 .517 1.000 .813 .572 .659 .504 .814 Q46 .4 96 .524 .768 .538 .813 1.000 .585 .705 .521 .789 Q54 .590 .627 .590 .603 .572 .585 1.000 .502 .537 .607 Q55 .450 .470 .642 .469 .659 .705 .502 1.000 .471 .647 Q61 .546 .559 .537 .600 .504 .521 .537 .471 1.000 .548 Q65 .513 .535 .798 .556 .814 .789 .607 .647 .548 1.000
77 Summary Item Statistics Mean Minimum Maximum Range Maximum / Minimum Variance N of Items Item Means 1.627 1.538 1.851 .314 1.204 .013 10 Inter Item Correlations .593 .450 .814 .364 1.809 .010 10 Item Total Statistics Scale Mean if Item Deleted Scale Variance if Item Deleted Corrected Item Total Correlation Squared Multiple Correlation Cronbach's Alpha if Item Deleted Q15 14.42 26.992 .695 .620 .931 Q22 14.46 27.106 .719 .636 .930 Q33 14.71 26.553 .804 .722 .926 Q36 14.65 27. 050 .716 .576 .930 Q44 14.67 26.574 .792 .759 .926 Q46 14.73 26.713 .803 .752 .926 Q54 14.62 27.037 .724 .542 .929 Q55 14.73 27.200 .692 .541 .931 Q61 14.71 27.309 .665 .468 .932 Q65 14.73 26.648 .814 .759 .925 Scale Statistics Mean Variance Std. Deviation N of Items 16.27 32.995 5.744 10
78 APPENDIX C MULTIVARIATE GENERAL IZABILITY D STUDY CO NFIGURATION RESULTS Multivariate Generalizability D Study Configuration A COLUMNS 1111111111222222222233333333334444444444555555555566666666667777777777 12 345678901234567890123456789012345678901234567890123456789012345678912345678901 GSTUDY (p:t) x i Design A with Covariance Components Design = p:t COMMENT Variance components design: (p: t) i COMMENT Covariance components design: p : t COMMENT Uni variate counterpart: (p: t) (i :s) COMMENT Example: Brief Data consists of 73 items in 8 content Subscales. The COMMENT survey is evaluated by 85 Teachers to 1089 students. This is the class COMMENT means version of the table of specifications mo del with fixed content COMMENT categories. COMMENT SUM: 8 Subscales, 85 teachers, 1089 students, 73 items OPTIONS NREC 5 "*.out" VALID 1 3 DEFAULT_DSTUDY MULT 8 inhibit shift emocntrl initiate workmem plan organize monitor EFFECT # t 85 85 8 5 85 85 85 85 85 EFFECT p:t 2 13 13 16 17 4 10 13 15 7 17 13 18 12 16 9 16 4 8 20 10 17 9 17 14 14 7 13 7 18 13 12 13 14 13 9 16 9 15 13 16 12 14 13 15 16 12 25 7 18 14 2 18 17 14 13 17 13 11 5 16 18 12 13 19 14 9 17 3 12 18 15 17 8 10 14 18 5 10 11 10 11 16 8 17 2 13 13 16 17 4 10 13 15 7 17 13 18 12 16 9 16 4 8 20 10 17 9 17 14 14 7 13 7 18 13 12 13 14 13 9 16 9 15 13 16 12 14 13 15 16 12 25 7 18 14 2 18 17 14 13 17 13 11 5 16 18 12 13 19 14 9 17 3 12 18 15 17 8 10 14 18 5 10 11 10 11 16 8 17 2 13 13 16 17 4 10 13 15 7 17 13 18 12 16 9 16 4 8 20 10 17 9 17 14 14 7 13 7 18 13 12 13 14 13 9 16 9 15 13 16 12 14 13 15 16 12 25 7 18 14 2 18 17 14 13 17 13 11 5 16 18 12 13 19 14 9 17 3 12 18 15 17 8 10 14 18 5 10 11 10 11 16 8 17 2 13 13 16 17 4 10 13 15 7 17 13 18 12 16 9 16 4 8 20 10 17 9 17 14 14 7 13 7 18 13 12 13 14 13 9 16 9 15 13 16 12 14 13 15 16 12 25 7 18 14 2 18 17 14 13 17 13 11 5 16 18 12 13 19 14 9 17 3 12 18 15 17 8 10 14 18 5 10 11 10 11 16 8 17 2 13 13 16 17 4 10 13 15 7 17 13 18 12 16 9 16 4 8 20 10 17 9 17 14 14 7 13 7 18 13 12 13 14 13 9 16 9 15 13 16 12 14 13 15 16 12 25 7 18 14 2 18 17 14 13 17 13 11 5 16 18 12 13 19 14 9 17 3 12 18 15 17 8 10 14 18 5 10 11 10 11 16 8 17 2 13 13 16 17 4 10 13 15 7 17 13 18 12 16 9 16 4 8 20 10 17 9 17 14 14 7 13 7 18 13 12 13 14 13 9 16 9 15 13 16 12 14 13 15 16 12 25 7 18 14 2 18 17 14 13 17 13 11 5 16 18 12 13 19 14 9 17 3 12 18 15 17 8 10 14 18 5 10 11 10 11 16 8 17 2 13 13 16 17 4 10 13 15 7 17 13 18 12 16 9 16 4 8 20 10 17 9 17 14 14 7 13 7 18 13 12 13 14 13 9 16 9 15 13 16 12 14 13 15 16 12 25 7 18 14 2 18 17 14 13 17 13 11 5 16 18 12 13 19 14 9 17 3 12 18 15 17 8 10 14 18 5 10 11 10 11 16 8 17 2 13 13 16 17 4 10 13 15 7 17 13 18 12 16 9 16 4 8 20 10 17 9 17 14 14 7 13 7 18 13 12 13 14 13 9 16 9 15 13 16 12 14 13 15 16 12 25 7 18 14 2 18 17 14 13 17 13 11 5 16 18 12 13 19 14 9 17 3 12 18 15 17 8 10 14 18 5 10 11 10 11 16 8 17 EFFECT i 10 10 9 7 10 10 7 10 FORMAT 0 0 PROCESS "Brief MGT data.txt"
79 D STUDY RESULTS FOR INDIVIDUAL VARIABLES inhibit shift emocntrl initiate workmem plan organize monitor Univ Score SD 0.27134 0.27893 0.23955 0.28545 0.28255 0.29602 0.27000 0.26416 Rel Error SD 0.18513 0.14018 0.16667 0.17150 0.16530 0.15775 0.16557 0.1650 2 Abs Error SD 0.18664 0.14217 0.17014 0.17174 0.17073 0.15797 0.16822 0.16901 Gen Coefficient 0.68237 0.79836 0.67382 0.73476 0.74502 0.77882 0.72673 0.71930 Phi 0.67884 0.79379 0.66468 0.73421 0.73253 0.77834 0.72037 0.70955 D STUDY RESULTS FOR COMPOSITE Variable Wts inhibit shift emocntrl initiate workmem plan organize m onitor w weights 0.13699 0.13699 0.12329 0.09589 0.13699 0.13699 0.09589 0.13699 Composite Universe Score Standard Deviation 0.26374 Composite Relative Error Standard Deviation 0.14174 Composite Absolute Error Standard Deviation 0.14212 Composite Error Standard Deviation for Mean 0.03410 Composite Generalizability Coefficient 0.77589 Composite Phi 0.77496 M ultivar iate Generalizability D Study Configuration B D STUDY RESULTS FOR INDIVIDUAL VARIABLES inhibit shift emocntrl initiate workmem plan organize monitor Univ Score SD 0.27134 0.27893 0.23955 0.28545 0.28255 0.29602 0.27000 0.26416 Rel Error SD 0.18513 0.14018 0.16667 0.17150 0.16530 0.15775 0.16557 0.1650 2 Abs Error SD 0.18664 0.14217 0.17014 0.17174 0.17073 0.15797 0.16822 0.16901 Gen Coefficient 0.68237 0.79836 0.67382 0.73476 0.74502 0.77882 0.72673 0.719 30 Phi 0.67884 0.79379 0.66468 0.73421 0.73253 0.77834 0.72037 0.70955 D STUDY RESULTS FOR COMPOSITE Variable Wts inhibit shift emocntrl initiate workmem plan organize monitor w weights 0.13699 0.13699 0.12329 0.09589 0.13699 0.13699 0.09589 0.13699 a weights 0.12500 0.12500 0.12500 0.12500 0.12500 0.125 00 0.12500 0.12500 Composite Universe Score Variance 0.06956 Composite Relative Mean square Error 0.00341 Composite Absolute Mean square Error 0.00352
80 Multivariate Generalizability D Study Configura tion C D STUDY RESULTS FOR INDIVIDUAL VARIABLES inhibit shift emocntrl initiate workmem plan organize monit or Univ Score SD 0.27134 0.278 93 0.23955 0.28545 0.28255 0.29602 0.27000 0.26416 Rel Error SD 0.23557 0.20638 0.21769 0.23815 0.23599 0.23075 0.21725 0.2478 9 Abs Error SD 0.24719 0. 21954 0.24068 0.23936 0.27196 0.23225 0.23108 0.27347 Gen Coefficient 0.57023 0.64622 0.54768 0.58961 0.58906 0.62203 0.60700 0.531 73 Phi 0.54648 0 .61747 0.49764 0.58714 0.51909 0.61899 0.57720 0.48268 Composite Universe Score Standard Deviation 0.26360 Composite Relative Error Standard Deviation 0.08099 Composite Absolute Error Standard Deviation 0.08666 Composite Generalizability Coefficient 0.91374 Composite Phi 0.90246 Multivariate Generalizability D Study Configuration D D STUDY RESULTS FOR INDIVIDUAL VARIABLES inhibit shift emocntrl initiate workmem plan organize monitor Univ Score SD 0.27134 0.27893 0.23955 0.28545 0.28255 0.29602 0.27000 0.264 16 Rel Error SD 0.18192 0.13556 0.16257 0.16265 0.16046 0.15268 0.15896 0.1591 5 Abs Error SD 0.18269 0.13659 0.16418 0.16274 0.16328 0.15279 0.15993 0. 16123 Err SD for Mean 0.03919 0.03757 0.03889 0.03604 0.04643 0.03660 0.03827 0.0422 6 Gen Coefficient 0.68991 0.80894 0.68466 0.75490 0.75614 0.78988 0.74259 0. 73368 Phi 0.68810 0.80659 0.68038 0.75469 0.74965 0.78964 0.74026 0.72857 Composite Universe Score Standard Deviation 0.26360 Composite Relative Error Standard Deviation 0.0 5646 Composite Absolute Error Standard Deviation 0.05688 Composite Generalizability Coefficient 0.95614 Composite Phi 0.95551 Multivariate Generalizability D Study Configuration E D STUD Y RESULTS FOR INDIVIDUAL VARIABLES inhibit shift emocntrl initiate workmem plan organize monitor Univ Score SD 0.27134 0.27893 0.23955 0.28545 0.28255 0.29602 0.27000 0.26416 Rel Error SD 0.15320 0.11682 0.13810 0.14338 0.13776 0.13130 0.13753 0.1377 8 Abs Error SD 0.15502 0.11920 0.14228 0.1436 6 0.14424 0.13157 0.14072 0.14253 Gen Coefficient 0.75829 0.85077 0.75054 0.79854 0.80793 0.83560 0.79398 0.786 14 Phi 0.75394 0.84559 0.73922 0.79789 0.79327 0.83505 0.78640 0.77451 Composite Universe Score Standard Deviation 0.26374 Composite Relative Error Standard Deviation 0.04884 Composite Absolute Error Standard Deviation 0.04994 Composi te Generalizability Coefficient 0.96685 Composite Phi 0.96539
81 Multivariate Generalizability D Study Configuration F D STUDY RESULTS FOR INDIVIDUAL VARIABLES inhibi t shift emocntrl initiate workmem plan organize monitor Univ Score SD 0.27134 0.27893 0.23955 0.28545 0.28255 0.29602 0.27000 0.26416 Rel Error SD 0.13373 0.10268 0.12071 0.12641 0.12110 0.11527 0.12051 0.12132 Abs Error SD 0.13581 0.10537 0.12547 0.12673 0.12842 0.11557 0.12413 0.12670 Gen Coefficient 0.80457 0.88066 0.79748 0.83605 0.84481 0.86833 0.83388 0.82581 Phi 0.79967 0.87511 0.78472 0.83534 0.82879 0.86773 0.82551 0.81298 Composi te Universe Score Standard Deviation 0.26374 Composite Relative Error Standard Deviation 0.04285 Composite Absolute Error Standard Deviation 0.04410 Composite Generalizability Coefficient 0.97428 Composite Phi 0.97281 Multivariate Generalizability D Study Configuration G D STUDY RESULTS FOR INDIVIDUAL VARIABLES inhibit shift emocntrl initiate workmem pla n organize monitor Univ Score SD 0.27134 0.27893 0.23955 0.28545 0.28255 0.29602 0.27000 0.26416 Rel Error SD 0.13042 0.09795 0.11661 0.11731 0.11594 0.11017 0.11405 0.11524 Abs Error SD 0.13149 0.09937 0.11885 0.11744 0.11981 0.11033 0.11540 0.11810 Gen Coefficient 0.81234 0.89022 0.80841 0.85550 0.85589 0.87834 0.84859 0.84010 Phi 0.80983 0.88738 0.80246 0.85524 0.84759 0.87804 0.84554 0.83341 Composite Universe Score Standard Deviation 0.26360 Composite Relative Erro r Standard Deviation 0.04066 Composite Absolute Error Standard Deviation 0.04124 Composite Generalizability Coefficient 0.97675 Composite Phi 0.97610 Multivariate Generalizability D Study Configuration H D STUDY RESULTS FOR INDIVIDUAL VARIABLES inhibit shift emocntrl initiate workmem plan organize monitor Univ Score SD 0.27134 0.2789 3 0.23955 0.28545 0.28255 0.29602 0.27000 0.26416 Rel Error SD 0.13373 0.10268 0.12071 0.12641 0.12110 0.11527 0.12051 0.1213 2 Abs Error SD 0.13581 0.1 0537 0.12547 0.12673 0.12842 0.11557 0.12413 0.12670 Gen Coefficient 0.80457 0.88066 0.79748 0.83605 0.84481 0.86833 0.83388 0.825 81 Phi 0.79967 0.87511 0.78472 0.83534 0.82879 0.86773 0.82551 0.81298 Composite Universe Score Standard Deviation 0.26374 Composite Relative Error Standard Deviation 0.04285 Composite Absolute Error Standard Devia tion 0.04410 Composite Generalizability Coefficient 0.97428 Composite Phi 0.97281
82 LIST OF REFERENCES American Educational Research Association, American Psychological Associat ion, & National Council on Measurement in Education (1999 ). Standards for Educational and Psychological Testing. American Educational Research Association. Borsboom, D. (2006). Can we bring about a velvet revolution in psychological measurement? A rejoinde r to commentaries. Psychometrika 71, 463 467. Brennan, R. L. (2001a). Generalizability theory. New York: Springer Verlag. Brennan, R. L. (2001b). Manual for mGENOVA. Iowa City: Iowa Testing Programs, University of Iowa. Brennan, R. L. (2006) Educational Measurement (4 th Edition) National Counc il on Measurement in Education. Praeger Publishers Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology 3, 296 322. Calkins, D. S., Erlich, O., Marston P. T., & Malitz, P ( 1978 ) An Empirical investigation of the distributions of generalizability coefficients and variance estimates for an application of generalizability theory. Paper presented at the annual meeting of the American Educationa l Research Association, Toronto Cortina. J. M. (1993). What is coefficient alph a? An examination of theory and applications. Journal of Applied Psychology 78 98 104. Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika 16, 2 97 334. Cronbach, L.J., Nageswari, R., & Gleser, G.C. (1963). Theory of generalizability: A liberation of reliability theory. The British Journal of Statistical Psychology 16, 137 163. Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley. Crocker, L. & Algina, J. (1986). Introduction to Classical and Modem Test Theory. New York: Holt, Rinehart and Winston. Feldt, L. S., & Bren nan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105 146). New York: American Council on Education and Macmillan. Gardner, P. L. (1995). Measuring attitudes to science: Unidimensionality and internal consistency re visited. Research in Science Education 25, 283 9.
83 Gardner, P. L. (1996). The dimensionality of attitude scales: A widely misunderstood idea. International Journal of Science Education 18, 913 9. Green, S. B., Lissitz, R.W., & Mulaik, S. A. (1977). Limita tions of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement 37 827 838. Guttman, L. (1945). A basis for analyzing test retest reliability. Psychometrika 10(4), 255 282. Hatcher, L. (1994). A step by step a pproach to using the SAS (R) system for factor analysis and structural equation modeling. Cary, NC: SAS Institute. Hill, T. & Lewicki, P. (2007). STATISTICS: Methods and Applications StatSoft, Tulsa, OK. JOE, G W. & WOODWARD, J. A ( 1976 ) Some developm ents in multivariate g eneralizability. Psychometrika 41, 205 217. Kline, T. J. B. (2005) Psychological testing: a practical approach to design and evaluation. Thousand Oaks, CA: Sage. Kreiter, C.D., Yin, P., Solow, C. & Brennan, R.L. (2004) Investigating the reliability of the medical school admissions interview. Advances in Health Sciences Education 9(2), 147 159. Leone, F. C. & Nelson, L. S. ( 1966 ). Sampling distributions of v ariance c omponents I. Empirical s tudies o f b alanced nested d esigns. Technometr ics 8, 457 568. Lindquist, E. F. ( 1953 ). Design and analysis of e xperiments i n Psychology and educ ation. Boston: Houghton Mifflin Lord, F.M. (1955). Sampling fluctuations resulting from the sampling of test items. Psychometrika 20(1), 1 22. McDonald, R. P. (1999). Test theory: A unified treatment. Hillsdale: Erlbaum. Norusis, M. J. (1994). SPSS professional statistics 6.1. Chicago: SPSS, Inc. Nunnally, JC. (1978) Psychometric theory. New York: McGraw Hill. Rasmussen, C., McAuley, R., & Andrew, G. (2007). Parental ratings of children with Fetal Alcohol Spectrum Disorder on the Behavior Rating Inventory of Executive Function (BRIEF). Journal of FAS International 5, 1 8 Revelle, W. (1979). Hierarchical cluster analysis and the internal structure of tests. Mu ltivariate Behavioral Research 14(1), 57 74. Revelle, W., & Zinbarg, R. E. (2009). Coe cients alpha, beta, omega, and the GLB: Comments on Sijtsma. Psychometrika 74, 145 154.
84 scales, Journal of Extension 37 (2) Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment 8, 350 353. Searle, S. R. (1987). Linear Models for Unbalanced Data. New York, NY: Wiley. Shavelson, R. J, & Webb, N. M. (1981). Generalizability theory: 1973 1980. British Journal of Mathematical and Stati stical Psychology 34, 133 166. Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage. Psychometrika Smith, P. L. (1 978) Sampling errors of variance components in s mall sample multifacet Generalizability s tudies. Journal of Educational Statistics Vol. 3, No. 4 pp. 319 346 Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology 15, 72 101. Spearman, C. (1907) Demonstration of formulae for true measurement of correlation. American Journal of Psychology 18, 160 169. Spearman, C. (1910) Correlation calculated from faulty data. British Journal of Psychology 3, 271 2 95. Ten Berge, J.M.F., & Zegers, F.E. (1978). A series of lower bounds to the reliability of a test. Psychometrika 43(4), 575 579. Tuckman, B.W. (1999). Conducting educational research (5th ed.). Belmont: Wadsworth Group. Webb, N. M., & Shavelson, R. J (1981). Multivariate generalizability of general educational development ratings. Journal of Educational Measurement 18, 13 22. Wiley, E. (2001). Bootstrap strategies for v ariance c omponent e stimation: t heoretical and e mpirical r esults. Unpublished doc toral dissertation, Stanford University. Woodward, J. A., & Joe, G. W ( 1973 ) Maximizing t he c oefficient o f generalizability in multifacet d ecision studies. Psychometrika 38, 173 181. Zelazo P D ., Muller U. ( 2003 ). Executive function in typical and atyp ical development, In: Blackwell handbook of childhood cognitive development. Malden MA: Blackwell Publishers Ltd, pp 445 470. and h : Their relations with each other and two alternative conceptualizations of reliability. Psychometrika 70(1), 123 133.
85 Zinbarg, R., Yovel, I., Revelle, W. & McDonald, R. (2006). Estimating generalizability to a universe of indicators that all have an attribute in common: A comparison of estimators for Applied Psychological Measurement 30, 121 144.
86 BIOGRAPHICAL SKETCH Xiaozhen Shen was born in 1984 in Zhejiang, China. She earned her Bachelor of Art degree in 2007 from the program of English E ducation in Shanghai International Studies University. For the same year, Xiaozhen Shen was admitted to Research and Evaluation Methodology Program at University of Florida She wanted to continue on to earn a Ph.D degree in Resear ch and Evaluation Methodology a fter completing the