UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
GENDER DIFFERENCES ON COLLEGE ADMISSION TEST ITEMS: EXPLORING THE ROLE OF MATHEMATICAL BACKGROUND AND TEST ANXIETY USING MULTIPLE METHODS OF DIFFERENTIAL ITEM FUNCTIONING DETECTION By THOMAS LANGENFELD A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY ______ _ l  m i __ m a A i I I ACKNOWLEDGEMENTS would like express sincerest appreciation individuals who have ass sted in completing this study. am extremely indebted to Dr. Linda Crocker, chairperson doctoral committee, helping conceptualization, development, and writing this ssertation. Her assistance encouragement were extremely important in enabling me to achieve doctorate. also want thank the other members committee, James Algina, Jinwin Hsu, Marc Mahlios, and Rodman Webb, patiently reading the manuscript, offering constructive comments , providing editorial assistance, and giving continuous support. further wish thank David Miller, John Hall, Scott Behrens their assi stance related to different aspects thi study. want expr ess deepest gratitude family providing during the graduate emotional exp support erience. that want was so vital thank wife, Ann many way s thi degree is as much hers as mine, studies Space limitations not allow express many personal sac rifi ces made wife that could complete this study. also want thank daughter, Kathryn Loui who was born early stages thi study and come to provide a special type support. TABLE OF CONTENTS pacrge ACKNOWLEDGEMENTS................................... LIST OF TABLES..................................... ABSTRACT . . .. . . .... ........ . CHAPTERS Statement of Problem........................ The Measurement Context of the Study........ The Research Problem........................ Theoretical Rationale....................... Limitations of the Study.................... REVIEW OF LITERATURE........................ DIF Methodology............................. Gender and Quantitative Aptitude............ Potential Explanations of DIF............... Summary..................................... METHODOLOGY....... ......................... Examinees......c............................ Instruments................................. Analysis ....................... ............ Summary............... .. ..... .... ....... ... RESULTS AND DISCUSSION...................... Descriptive Statistics...................... Research Findings..................* ......... LIST OF FIGURES.................................... v111ii INTRODUCTION................................ SUMMARY AND CONCLUSIONS.................... . APPENDICES SUMMARY STATISTICAL TABLES.................. DIFFERENTIAL ITEM QUESTIONNAIRE REVISED TEST A FUNCTIONING INCLUDING THE NXIETY SCALE...... THE CHIPMAN, INSTRUMENT MATHEMATIC MARSHALL, AND SCOTT FOR ESTIMATING S BACKGROUND....... (1991) BIOGRAPHICAL REFERENCES....................................... SKETCH............................... LIST OF TABLES Table page Proposed Matrix Multitrait : Uniform Multimethod Indic es. Correlation . S . Proposed Matrix: Multitrait Alternate Multimethod Indices Correlation ...... 10 Item Data Group and Item Scores Ability Group.. Desc riptive Stati stics item RTA Scale Corre lation GRE Calculus Colle Completion, Mathematics SAT Credits Frequencies an Mathematics Percentages Background.. Gender and Mean Scores Revise Total the Rel d Test Sample, ease Anxiety Gender, d GRE Scal and and (RTA) the the Mathematics Background. Intercorrelation and Mathematics Sample , Women, Rel Bac and eas ground Men.... GRE the Q, RTA, Total Multitrait Multimethod Correlation Matrix Uniform DIF PercentofAgr Indi eemen ces.. Rates Inferential Tests and T by Gender, A Between D Mathematics Methods Bac ground, : 30Item GREQ....... Multitrait Multimethod Corre lation Matrix  a 99 _  Tetrachoric Estimates Corre lations Four Standardized Problematic Items : Exploratory Sample. e. 119 MultitraitMultimethod Correlation Matrix Valid Test Items : Uniform DIF Indices Percent of Agr eement Rates Inferential Tests and T Gender, 'A Between Mathematic Methods : * S 2 Background, 6Item Valid Test MultitraitMultimethod the Valid Test .. 133 Correlation Items : Alt Matrix ernate DIF Indices LIST OF FIGURES Figure page The Four Problematic Test Questions LRCs LRCS Women Women Men Men on Item Item LRCs Women Men Item LRCs Examinees with Subs tantial Little Mathematics Background Item LRCs Women Men Item LRCs Examin ees with Sub stantial and Little Mathematics Bac ground on Item .. 125 LRCs Women Men on Item Illu DIF LRCs strating the Condition... Women Symmetrical Men Nonuniform Item Illu DIF strating the Condition... More Typical Nonuniform Abstract the of Dissertation University Requirements S of for Presented Florida the Partial Degree Doctor Graduate School Fulfillment the of Philosophy GENDER DIFFERENCES EXPLORING AND TEST ON COLLEGE ROLE ANXIETY ADMISSION OF MATHEMATICAL USING MULTIPLE TEST ITEMS: BACKGROUND METHODS OF DIFFERENTIAL ITEM FUNCTIONING DETECTION Thomas August, . Langenfeld 1995 Chairper Major De son : Linda apartment: Crocker Foundations Education The purpose s study was to discover whether defining examine subpopulations relevant educational or psychological variable , rather than gender, would yield item statistic that were more cons istent across five methods detection differential item functioning (DIF) subsidiary purpose thi study was assess how consis tency DIF estimates were affected when structural validation findings were incorporated into analyst The study was conduct the context of college admission quantitative examinations and gender issues. Participants consi sted 1263 university students. For purposes this study, were analyzed categorizing examines their gender, mathematics backgrounds , and level of test anxiety. The hypothesis that defining subpopulations mathematic background or test anxiety would yield higher consistency estimation than defining subpopulations gender was not substantiated. Result indicated that using mathematics background to define subpopulations explain gender had potential usefulness; however, this study, use of test anxiety to define subpopulations The finding explain confirmed DIF the was ineffectual. importance structural validation analy ses Results from using entire test revealed that nonuniform DIF methods had low intermethod consis tency variance related to methods. When structural validation findings were used to define valid subset of items, highly cons stent DIF indices resulted across methods and minimal variance related to methods. nonuniform Results method further and suggested the need importance use jointly interpreting both DIF indices and significant tests. Implications recommendations research and practice are included. CHAPTER INTRODUCTION Statement the Problem Differential item functioning (DIF), a statistical indication of item bias , occurs when equally proficient individuals , from different subpopulations, have different probabilities answering an item correctly Linn, Levine, Hastings , & Wardrop, 1981; Scheuneman, 1979; Shepard, Camilli, & Williams, 1984). storically, researchers studying have addressed two principal concerns. The first concern researchers been development evaluation of statistical methods detecting "bia sed" item The second concern been identify plausible explanations item bias. study, both methodological and substantive educational issues concerning item bias DIF were addressed. During methods been past four developed decades (for , a plethora a comprehensive detection review advances in item bias detection methods over the ten years see Millsap Everson, 1993). DIF method have variable formed from an observed conditional score an unobserved conditional estimate latent ability and whether they can detect nonuniform as well uniform DIF. Researchers applying methods using an observed conditional responses score the commonly test sum number or subsection the correct test estimate the ability of each examinee. Researchers using unobserved unidimensional conditional item estimates response most theory frequently (IRT) apply model estimating the Uniform latent DIF occurs ability when of each there examine. interaction between ability level group membership. That the probability answering an item correctly is greater one group than the other group uniformly over ability level Nonuniform DIF, detectable only some methods, occurs when there interaction between ability level and group membership. That difference the probabilities a correct res ponse the two groups the same at all ability level . In terms, nonuniform DIF indicated nonparallell" item character stic curves. I. L L* I IB  "I ~ rr r *1 emerged most widely used procedure (more because Educ national ting Service' usage than a result oretical consensus), and s frequently method to which others are compared (Hambl eton Rogers , 1989; Raju, 1990; Shealy & Stout, 1993a; Swaminathan & Rogers , 1990). The appeal procedure of use, c hisquare imple test conceptualization, significance, relative and ease desirable statisti properties (Dorans Holland, 1993; Millsap Ever son, 1993) Researchers applying MH employ observed score as the conditioning variable and recognize that MH i sens itive to only uniform DIF. Other methods compared with the MH procedure thi study included logi stic regression (Swaminathan Rogers, 1990), Signe d Area (IRT Unsigned Area (IRT (Raju, 1988, 1990), the Simultaneous Item Bias Test SIBTEST) & Stout, 1993a, 1993b SLogis regre ssion was signed to condition on observed scores analyze item res ponses With logistic regre ssion, user can detect both uniform and nonuniform DIF IRTSA and IRTUA were devised to condition on latent ability estimates and assess the area between an item * S . SA), _ * k L   developed developed to detect to detect only both uniform uniform DIF, and whereas I nonuniform RTUA DIF SIBTEST was designed to conceptualize DIF multidimensional phenomenon where nuisance determinant adverse influence item responses (Shealy & Stout, 1993a, 1993b). Research hers using SIBTEST apply factor analysis to define a valid subtes t and a regression correction procedure to estimate criterion variable. SIBTEST was developed to detect only uniform DIF. assess diff erent indi ces with data from curriculumbased, eighth grade mathemati test, Skaggs Liss (199 found that the consis tency between methods was low, no reasonable explanation items manif testing could be hypothes ized. They posited that cate goriz ing subpopulations demographic characteristics such as gender or ethni city DIF studies was "not very helpful conceptual zing cognitive issues and indicated nothing the reasons the differences" 239) number researchers have suggested need to explore using subpopulations categorize d by psychologically and with educationally gender significant or ethnicity variables potentially that correlate influence item a ~ a mL* ____ I Skaggs & Liss itz, 1992; Tatsuoka, Linn, . M. Tats uoka, Yamamoto, 1988). Thu , a major concern the study was the consi tency results from different DIF estimation proce dure when subpopulations are conceptualized psychological or educational variable Three methods conceptual zing subpopulations were combined with five fundamentally differ stateof art procedures assess DIF. The Measurement Context Study The differ substantive ences issue on a sample was test the investigation containing items gender similar those found on advanced college admi ssion quantitative examinations. Generally, men tend to outperform women Scholastic Aptitude tMath (SAT the American College Testing Assessment Mathemati Usage Test (ACTM), Graduate Rec ExaminationQuantitative (GREQ) However, from a predictive validity perspective, these diff erences are problematic example, men tend score the SAT approximately (National standard Center deviation Education units Stati higher tics, 1993), although women tend to perform at nearly same M), tend to outperform men in general college courses (Young, 1991, 1994). poss ible explanation quantitative test score differences between men women background experience. Men tend to enroll in more years mathematics (National Center Education Statistics, 1993). A second explanation that could potentially explain differential validity such tests test anxiety. Test anxiety relates to examinees' fear negative evaluation defensiveness (Hembree, 1988). Women generally report higher level test anxiety than men (Everson, Mill sap, Rodrique , 1991; Hembree, 1988; Wigfield Eccl , 1989 Thus , for high stakes tests mathematical aptitude, mathematics background and test anxiety could influence item responses differentially each gender. The earch Problem thi study, explored the feasibility conceptual zing psychological subpopulations variable relevant contrast the educational use traditional demographic variables. The vehicle to achieve thi purpose was a released form the GREQ. this  S I L 1 L examinees with substantial and little mathematics background, anxiety. DIF examines was assessed high using and five low in t different est measures. The IRT DIF UA, measures and were SIBTEST. MH, The logis DIF m regression, ethods were class IRT SA, ified into two groupsmethods measuring uniform DIF and alternate methods. The uniform method were IRT SA, and SIBTEST. Alternate methods included logis regre ssion and were UA, along designed with measure Logis both regression uniform and and nonuniform UA DIF. MantelHaen was placed into both analy group because wides pread use test practitioners. Regarding the study s methodological issues, the results five methods of estimating will contrasted within each three modes of defining subpopulation groups. The observation interest was the DIF indices estimated each item under a particular combination subpopulation definition DIF method. Replications were items on a released form the GRE test. For research questions that follow, trait effects refer the three subpopulation conceptual zations and A I  ~t C1 CC L The consis first tency four of DIF research indices questions between address methods when subpopulations are conceptualized using different traits The uniform methods of MH, IRTSA, and SIBTEST were combined with traits gender, mathematics background, test matrix anxiety to yield of correlation a multitraitmultimethod coefficients . (See (MTMM) Table illustration a MTMM matrix with uniform measures. Similarly, alternate estimation methods UA, and logistic regrets sion were combined with traits gender, mathematics background, and test anxiety yield a second multitraitmultimethod (MTMM) matrix correlation coefficients. (See Table an illustration a MTMM matrix with alternate measures.) Each following research questions was address uniform twice; methods each and question the was alternate answered methods, for resp the ectively: . Among the three sets of convergent coeffi clients often termed monotrait heteromethod coefficient (e.g., the correlation between the indices obtained from SA methods when subpopulations are defined the trait gender), will coefficients base _,  _ r .1__   a _ T I _ _ _ 1 Table Proposed Uniform MultitraitMultimethod Correlation Matrix: Indices MHD B C A SA B C SIBTESTb A B I.MHD A.Gender B.MathBkd HM C.TA HM HM II.IRTSA A.Gender B.MathBkd HH MH* HM C.TA HH MH* HM HM III.SIBTESTb A.Gender MH* MH* HH HH B.MathBkd HH MH* HH MH* HH HM C.TA MH* HH MH* HM HM Note. = reliability heteromethod HM or the coefficients. convergent = heterotraitmonomethod validity coefficient MH* = monotrait coefficients s. HH = heterotraitheteromethod coefficient. A C MH* Table Proposed Alternate MultitraitMultimethod DIF Correlation Matrix: Indices MHD B C IRTUA A B Loq C A Req B I.MHD A.Gender B.MathBkd HM C.TA HM HM II.IRTUA A.Gender MH* B.MathBkd MH* HM C.TA HH MH* HM HM III.Log Reg A.Gender MH* HH MH* HH B.MathBkd HH MH* HH MH* HH HM C.TA HH MH* HH HH MH* HM HM Note. = reliability heteromethod HM or the coefficients. convergent = heterotraitmonomethod heterotraitheteromethod MH* validity coefficients. = monotrait efficientt. HH = coefficient. A C coefficients when subpopulations are defined gender? Will monotraitheteromethod coefficients higher than coefficients different traits measured the same method (i.e., heterotraitmonomethod coefficients Will convergent correlation coefficients higher than discriminant coefficients measuring different traits different method (i.e., heterotrait heteromethod coefficient Will pattern correlations among three traits be similar over the three methods of DIF estimation? The final res earch que stion addressed consi stency procedures identifying aberrant items when subpopulations are conceptualized in different ways The stion was applied twice; was answered uniform methods alternate methods. was follows: For standard each deci DIF detection rules, what method the respectively, percent using agreement about aberrant items when subgroups are based on gender and when subgroups are based mathematics background .1 a A IlIIII ACS Following analysis uniform and alternate methods conducted a structural analyst the 30item quantitative test. Shealy and Stout (1993a, 1993b) stressed that practitioners must carefully identify valid subset items prior to conducting DIF analyses. They argued that DIF occurs as a consequence multidimensionality. potential DIF occurs when one or more nuisance dimensions interact with the valid dimension of a test (Ackerman, 1992; Camilli, 1992). Messick (1988) stress tructural component construct validation. The structural component concerned extent to which items are combined into scores that reflect the Loevinger structure (1957) termed the the underlying purity the latent construct. internal relationships structural fidelity, and appraised analyzing employed factor interite analytic structure procedures a test. to define structurally valid subset unidimens ional items identify problematic multidimensional items. hoped to define items measuring both intended dimension and nuisance. After identification a structurally valid *  & U  S S r 1 using the five methods with subpopulations defined gender, mathemati indices as the background, unit and analy test , two anxiety. MTMM Using matrices correlation coefficients were generated one matrix uniform methods and one matrix alternate methods applied the five research estions the MTMM matrices inferential stati stics using structurally valid items. contrasted findings the analyst s for entire test with findings analysis subset of test items. Theoretical Rationale The process ensuring that highs takes tests contain items that function differentially specific subpopulations a fundamental concern construct validation. Items that contain nuisance determinants membership correlated threaten with an examinee construct subpopulation interpretations derived from test scores that subpopulation. Psychometric researchers continue examine merits numerous DIF detection procedures and explore theoretical explanations DIF. However, to date, they have failed to reach consensus on methodological issues or to develop * 1 F 1 f * k r *I * identification with actual test data Linn, 1993; Shepard concerns et al., were 1984; Skaggs investigated Lissit from both , 1992). a practical These and theoretical perspective that been suggested Linn, 1993; Schmitt & Dorans, 1990; Skagg & Li ssitz, 1992; Tatsuoka et al., 1988) but rarely tested. Two significant premises underlie the study. first premise that there nothing inherent in being female that p examine redisposes or a member an individu of a specif al to find ethnic group a particular item troublesome. Educational and psychological phenomena function specific unique item. way s to di Traditional advantage DIF occurs an individual when on a phenomena correlate with demographic group interest. Consequently, gender or ethnicity can interpreted surrogate educational or psychological variables that potentially explain DIF s causes. Skaggs ssit (1992) posited that educational psychological variables that influence item performance and correlate with ethnic or gender groups would useful conceptualizing subpopulations. Millsap and Everson (1993 commented that modeling 1 educational psychological variable the study that were hypothesized as potentially explaining gender DIF quantitative test items were mathematics background and test anxiety. Mathematics background was selected because influences quantitative reasoning and problem solving. Further, high school college men tend to enroll more mathematics courses study mathematics more abstract level Educational Stati than women stics, (National 1993). Center Researchers assessing overall SATM performance have found that gender difference decrease subs tantially when differences high school mathematics background are taken into account, although differences background does (Ethington entirely Wolfle, 1984; explain Fennema score Sherman, 1977; Pallas Alexander, 1983). Quantitative aptitude test scores can contaminated familiarity with item context, the application of novel solutions, and the use partial knowledge to solve complex problems (Kimball, 1989). These type kills frequently are developed through background experien ces Test anxiety was selected because well * m I r 1 1 CI Liebert possessing & Morris, high 1967; level Tryon, test 1980) For anxiety, test individual scores frequently are depressed and construct interpretations become problematic (Everson et al., 1991; Hembree , 1988; Sarason, 1980). Cons equently, test anxiety exemplifies psychological variabi that potentially contaminates construct interpre stations scores For examinees with high level test anxiety, tests of mathematical ability tend to induce Woolfolk, 1980) extreme SFemal eve students anxiet tend y. (Richard to report h son higher level levels test including anxie coll than ege mal (Everson student et al. at all , 1991; grade Hembree, 1988; Wigfield Eccles 1989) Over the st 2 years several selfreported measures test anxiety have been developed that demonstrate high reliability and well defined th Schwarzer, Leoretical Seipp, properties Zahhar, enson, 1991; Moulin Sarason, Julian, 1984; Spielberger, Gon , Taylor, Algaze, Anton, 1978) Researchers have used the selfreported instruments measure test anxiety and assess efficacy treatment programs 1980) (Sa For reason, 1 studying 980; Spielberger gender et al on colle ., 1978; admi Wine, ssions S" A . S C C a m b I * *l 1 * . * * LL l_ _ threat to valid score interpretation, negative influence tests of mathematics ability, and gender effects. second fundamental premise tenet underlying educational the study measurement. Item responses examines are and products a set of complex items. interactions In part, between because complex interaction, examines approximately equivalent abilities who belong to different subpopulations occasionally have different likelihood of answering que stion correctly. Thi s fascinating finding currently understood only crudely. Before can be better understood, different effect means of DIF detection of conceptualizing methods subpopulations on item responses must examine Limitations the Study salient limitation study was nature performance task. Participants the study were administered a sample GRE and were told they would have minutes to complete test. They were told perform best of their ability and they would able to learn their results following testing. Although every effort was made to simulate the conditions performance might accurately reflect their performance on a high stakes college admi ssions test. Further, the participants believed that examination had low stakes, the level answering test the sample anxiety GREQ felt would examinees not while be equivalent level test anxiety experienced examinees while answering a college admi ssions test. Finally, examines study were predominantly undergraduate students taking classes the colleges education and business at a large, southern state univer sity. reason, although the sign, methodology, analy sis were conceived and executed maximize the general ability findings , a degree caution recommended in generalizing to other populations or settings. CHAPTER REVIEW OF LITERATURE The four central aspects thi study were Differential Item Functioning (DIF) methodology, gender differences in mathematical collegelevel aptitude testing, gender differences in mathematics background, and test anxiety. These four topics constitute the major themes the organi zation literature review pre sented this chapter. DIF Methodoloqy A Conceptual Framework DIF Tests placement education selection employment require scores fair representative individuals Since the mid 1960s, measurement special have been concerned explicitly with the fairness their struments and the ssibility that some tests may biased (Cole & Moss , 1989) Bias studies initially were signed to investigate the assertions that sparities between various subpopulations on cognitive ability test scores were product of cultural bias inherent measure (Angoff, 1993) Test critics charged that bias subpopulations had equivalent score stributions on the cons truct measured dismissed the possibility that actual differences may exist. Measurement specialist however, have reflect resolved bias that but i mean ndicat e test differences impact do not (Dorans necessarily Holland, 1993). Concerns about measurement bias are inherent validity theory (Cole Moss, 1989) A test score inference considered efficiently valid when various types evidence justify usage and eliminate other counterinterpretations (Messick, 1989; Moss, 1992 Bias been characterized as " a source invalidity that some examines with trait or knowledge being means ured from demon strating that ability" Shepard, Camilli, Williams, 1985, score based inference are equally valid relevant subgroup decis ions derived from score inferences will not fair individuals Therefore, measurement bias occurs when score interpretations are diff erentially valid subgroup test takers (Cole & Moss, 1989). To inves researchers tigate have the examined potential test items measurement bias as a source explanation. The suppos ition that biased item require bias r biased research are (Angoff, to identify 1993) and remove to provide test items detected developers with guidelines making future cons truction of biased items less likely (Scheuneman, 1987; Schmitt, Holland, & Dorans, 1993). Measurement specialists have defined item bias occurring when individuals , from different subpopulations, who are equally proficient on the construct measured have different probabilities of successfully answering the item (Angoff, 1993; Linn, Levine, Hastings Wardrop, 1981; Scheuneman, 1979; Shepard et al., 1985). Researchers apply stati stical methods to equate individuals on the construct, utili zing either erved scores latent ability scores, estimate examines each group probability of a correct response. These methods provide statistical evidence bias. When a statistically biased item identified, might interpreted as unfairly disadvantageous to a minority group cultural and social reasons. the other hand, the item might interpreted as unrelated an important and understood to cultural educational and soc outcome groups ial factors that this but not latter related equally case, known deleting item for strictly stati stical reasons may reduce validity.    1 Researchers discovered that tatis tical analyses item bias raised expectations and created confusion already scure and volatile topic. The term differential item functioning (DIF) gradually replaced item bias preferred technical term research connotations because (Angoff, more 1993; Dorans neutral Kulick, 1986). Holland and Wainer (1993) distinguished between item bias and DIF stating, item bias refers "an informed judgment about an item that takes into account purpose the test, the relevant experiences of certain subgroups examines taking and statistical information about item" xiv). DIF s a "relative term" xiv) a stati stical indication of a differential res ponse pattern. Shealy Stout (1993a proposed that difference between item bias and DIF "the degree the user or researcher embraced a construct validity argument" 197). Shealy and Stout (1993a, 1993b) conceptualized DIF violation unidimensional nature test items. They classified the intended dimension the target ability unintended dimensions nuisance determinants occurred because nuisance determinants existing differing degrees among subgroups. Crocker Algina construct, the distributions irrelevant sources variation are different subgroups Therefore, can be conceptualized as a consequence of multidimensionality with differing sources variation influencing subgroups' item responses A Formal Definition of DIF All DIF detection methods rely on assessment response subgroups patterns of subgroup , conceptualize in most test items. studies The the basi demographic characters tics , blacks and whites women and men), form a categorical variable. When two groups are contrasted, the group interest (e.g., blacks or women) designated the focal group, the group serving the group reference comparison group. (e.g., Examinees whites are or men) matched ignated on a criterion variable, assumed to be a valid representation purported group construct, response patterns DIF for methods assess individuals differential of equal ability. Denote the item score , frequently scored dichotomous variable denote the conditioning criterion; member and ship. denote Lack as the of measurement categorical bias variable or DIF group an item define all values X for reference and focal groups. this definition, Pg(Y=l the conditional probability function for at all levels (Millsap Everson, 1993). Although al definition, they DIF diffe procedures r on the b operate asis from this statistical models and possess various advantages. DIF procedures can characterized invariance or as models models using utilizing observed co unobserved nditional conditional invariance conditional (Millsap Everson, invariance used, 1993). the c When *riterion observed variable sum the total number of correct res ponses on the test or a subset the test. When unobserved conditional invariance is used, a unidimensional item res ponse theory (IRT) model estimates a 8 parameter each examinee that functions the criterion variable. Other differences detection procedures are capacity to detect nonuniform DIF, test statistical significance, and to conceptualize DIF as a consequence multidimensionality. Uniform DIF occurs when there interaction between group member ship and the conditioning criterion regarding probability answering an item correctly. In other words, DIF functions a uniform shion across ability spectrum. Nonuniform DIF refers ability spectrum and disfavor the subgroup other end the spectrum. All DIF procedures are used estimate an index describing magnitude the differential response pattern the groups item. Some procedures provide states tical tests to detect the DIF index differs significantly from zero. Finally, although DIF perceived as a consequence multidimen Stout' sionality, Simultaneous every Bias procedure Test except (SIBTEST Shealy functions and within unidimensional framework. Many DIF detection methods have been developed during past three decades. thi review, they are categorized as based upon observed conditional invariance unob erved latent conditional invariance. Related issues, research Following efforts t problems, the and review o explain potential of DIF the usage detection underlying are evaluated. methods, causes of DIF research are presented. Methods Upon Observed Scores Angoff detection Ford method (1973) called offered the delta first plot. widely The used deltaplot procedure was problematic due tendency, under conditions of differing ability score stributions, sample and was not based upon a chisquare sampling stributions ' in ect, a chisquare procedure at all (Baker, 1981) The full square procedure (Bis hop, Fienberg, & Holland, 1975) was a valid technique testing but required large sample zes at each ability level sustain statistical power Holland and Thayer (1988) built upon these chis quare hniques when they applied the Mantel and Haensze (1959) statistic, originally developed medical research, the detection DIF. Mantel Haensze procedure. The Mantel Haenszel (MH) statistic become most widely used method of DIF detection (Millsap Everson, 1993) The MH procedure assesses the item data a Jby by contingency table. At each score level , individual item data are presented two groups the two levels item response, right or wrong see Table The null hypothesis the MH procedure can expressed the odds answering an item correct given ability level are same both groups across ability levels The alternative hypothesis that the two group have equal probability answering item correctly some level Table Item Data for Groups Item Scores Ability Group Score on Studied Item 1 0 Total Group Total The MH stati stic uses a cons tant odds ratio (a ) as an index of DIF. The estimate constant odds ratio A. D C B /T. ] F; .RJ1 The constant odds ratio ranges value from zero infinity. The estimated value under null condition. interpreted the average factor which odds that a reference group examinee will answer the item correctly exceeds that of a focal group examine. estimated value am frequently transformed more easily interpret ed A metric via MH D DIF 2 35 in[taf ] Positive values of MH DDIF favor the focal group, whereas negative The values chi favor square the ref test erence group. significance r E(A i) .5]2 Var (A ) where = 7Ra m IT, 7 and var (A .) [n m, nrim [ 2( )] The MH chisquare is di tribute d approximately as a chi square with one degree freedom. Holland and Thayer The advantages are computational simplicity (Holland Thayer, 1988), tati stical test significance, lack sens itivity to subgroup differences the stribution ability (Donoghue, Holland, & Thayer, 1993; Shealy & Stout, 1993a; Swaminathan & Rogers , 1990). The most detec frequently c t nonuniform ited DIF disadvantage (Swaminathan s lack & Rogers, power 1990). further limited unidimensional conception and assumption that total test score provides a meaningful measure the construct purported to be estimated. The standardization procedure. The standardization procedure (Dorans nonparametric regr Kulick, session 1986) test based scores upon on item the scores two groups. Let define the expected item test nonparametric regression reference group, and E,(Y define the expected item test nonparametric regrets focal group, where item score X i test score. DIF analysis the individual score level The statistic, s the fundamental measure = E  E . differences and cannot explained differences the attribute ted. The standard zation procedure derived name from standardization group that functions to supply a set weights, one at each ability level, that will be used weight each individual The standardized pdifference (STD PDIF) Wi (EFj E. STD STD P The The essence specific of standard weight zation implemented the weighting function. tandardization depends upon nature study (Doran & Kulick, 1986) Plaus ible options of weighting include number examinees the total group at each level of j , the number  m   C,, 4.. a 1  aI A  n I _  .I.E. U Im L E used; the focal thus, observed group The STDP performance on an item standard zation defined and (Dorans procedure difference expected Kulick, contains between performance 1986) a significance test. The standard error using focal group weighting SE(STD p)= PF (1PF) +VAR (P;), where the proportion focal group members correctly answering the item, and where thought as the performance reference the group focal item group test members regression predicted curve from and J (Ps) Fj Pj (1  PA.) The tandardization procedure a flexible method inve stigating DIF (Dorans Holland, 1993), and been applied ass essing differential functioning stractors (Dorans, Schmitt, Blei stein, 1992) and the differential effect speedednes s (Schmitt & Dorans, 1990). DIF findings from the standard zation procedure will close agreement with the MH procedure (Millsap & Everson, 1993) a a a. a a S  S * a  1 * Jl standardization method are much the same as for The most commonly cited deficiency both methods their inability to detect nonuniform DIF. Donoghue et al. (1993) determined that both methods require approximately more items the conditioning score, the studied item should extreme included ranges determining item the difficulty conditioning can score, adversely influence DIF estimation. Linn (1993) observed that estimates using these procedures appear to be confounded with item discrimination. Loqistic repression model Swaminathan and Rogers (1990) applied logis regress to DIF analysis. Logistic regress model unlike least squares regression, permit categorical variables as dependent variables. Thus, permits analy of dichotomou scored item data. It has additional flexibility including the analysis interaction between group and ability, as well allowing the inclusion of other categorical and continuous independent variables model. A fundamental concept analysis with linear models assessment cons tency between a model and a set of data (Darlington, 1990). Cons istency between the model the data means ured the likelihood *  I .. i I III examinee will have a probability between and answering an item correctly the multiplicative law independent probabilities , an overall probability group examinees answering a specific pattern can estimated. For example, probability four individual each answering an item correctly 0.9, and three the subjects answer correctly, overall probability this pattern occurring X 0.9 X 0.9 X 0.9) or 0 .0729 Therefore, item , the likelihood function a set examinee responses each with ability level determined L(Datae 8) P(ui/1' Prui n=l where has value a correct response and a value an incorrect response. The logis regr ess model predicting probability a correct answer exp 0+3 ,10) + exp q30+p13)] where the response item given ability level fois the intercept parameter, and slope  __ 11e) exp(Po + exp po P3egj + 3ie + Pf.g5 aP3e i where estimate of uniform difference between groups, and estimated interaction between group ability. only deviate from zero, the item interpreted as containing no DIF. does not equal zero, equal zero, uniform DIF indicated. does equal zero, nonuniform DIF inferred. Estimation the parameters and carried out each item using a maximum likelihood procedure . The two null hypothe ses can tested jointly x2=P'c' (CC C}CB, where The test a chisquare stribution with degrees r .A et 1L  _  * A r A* _ AI  S .1   +,1e + 2g + L1 The logi stic regression procedure offers a powerful approach testing pre sence of both uniform and nonuniform DIF. In sample sizes and examinees per group and with and test items serving criterion, Swaminathan and Rogers (1990) concluded under conditions uniform DIF that the logistic regression procedure had power similar MH procedure controlled Type errors almost as well. The logis regression procedure had effective power in identifying nonuniform DIF, whereas the MH procedure was virtually powerless to do so. In demonstrating the ineffectiveness MH procedure to detect nonuniform DIF, Swaminathan Rogers (1990) simulated data keeping item difficulties equal varying the discrimination parameter. In effect, they simulated nonuniform symmetrical DIF. Their simulation created a set conditions where theoretically the procedure has no power. Researchers must ask whether such symmetrical interactions occur with actual test data. Millsap and Everson 1993) commented that Swaminathan Rogers (1990) utilized large numbers items, they conjectured that in cases with a small number homogeneous items forming the criterion variable, positive rates would increase unacceptably above nominal levels. i J ~ L _ i _ _ 1 which ability level is observed. The logi stic procedure, although developed from a unidimen ional pers pective, provides a fl exible model that can incorporate a diversity independent categorical and continuous variables. Millsap Everson (1993) observed that the procedure "allows inclusion of curvilinear terms other factorssuch as examine character stics like test anxiety instructional opportunitythat may relevant factors exploring poss ible causes of DIF" 306) Methods Based Upon Latent Ability Estimation DIF are detection developed methods through conditioning various model on latent ability approaches describe the relationship between individual item responses the construct measured test or subtes When applied to DIF analy ses permits the use estimates true more ability as the subjective criterion measure variable observed as opposed scores pite theoretical disadvantages appeal, approaches of requiring large possess sample the sizes inherent , being computationally complex cos tly, and including stringent assumption unidimensionality shima, 1989) The most widelyused model are Rasch model or one parameter model, twoparameter logis model (2PL), . A~~~~~~~~ 1I a1 A. 1 a an ,nr T.l1 A ~,nA T ft IIA AI L. L I ~ *I f 111 ability score, except possibly the studied item, contain DIF, MH provides a DIF index proportional index estimated the Rasch model. Therefore, methods based upon the Rasch model will not reviewed, the more complex and model will be reviewed regarding their potential. The central components model are unobserved latent trait estimate, termed 0, and a trace line each item res ponse, often termed the item character stic curve (ICC). The will take a specified monotonically increasing function. the model, the probability correct response to Item as a function exp [Da (   exp [Da1 (0 where the item parameters i and are item discrimination and difficulty, respectively, a constant order to convert logis scale into an approximate probit scale (Hambleton Swaminathan, 1985) the model, the probability a correct response P(u, 1 exp[Da i(e  exp[Da,  ,)]  bi)]  b)] The model general includes procedure combining estimating both DIF groups using and a 3PL estimating item parameters utili zing either a maximum likelihood Bayesian procedure, fixing the i parameter items, after dividing examinee s into reference and focal group members, estimating the and i parameters, equating parameters from the focal group scale reference group scale or vice versa, calculating the index and significance test, and utilizing purification procedure (Lord, 1980; Park Lautens chlager, 1990) to further examine and enhance analy Purification and reestimate items included, procedures, ability will which level extract without be elaborated potential the DIF potential upon. items DIF indices statistical tests based upon latent ability proceed either (ai, analyzing or analyzing Lord' chisouare difference area and between between LR. the the Lord's item groups' (1980) parameters ICCs. chi square and Likelihood Ratio (IRT LR) simultaneously tests dual hypothesis  aFp = b1i. Because pseudo chance parameter standard errors are not accurately estimated separate groups (Kim, Cohen, & Kim, 1994), usually tested with either procedure. fl ~ ~ ~ ~ ~ ~ ~ a a aa SE U~ .naa e e e S:, m m rl ~CI~ C Y I.rC ~~ ~C~I~UA nru T A large to effectively assume an infinite number of degrees freedom, test becomes 4 ) var(b, )+ var(br) Alternately, z2 will tribute as a chisquare stati stic with one degree freedom (Thissen, Steinberg, Wainer, 1988). simultaneous test the discrimination difficulty parameters based upon Mahalanobi stance between parameter vectors the groups The test states becomes = v'z which V i the vector of differences between the parameter estimations  b,) and the estimated covariance matrix. The test distributed chisquare with degrees freedom. The same hypothesis tested Lord s chisquare can with IRTLR (Thissen, Steinberg, Wainer 1993) null hypothes with LR tested through three steps model fitte simultaneou both groups data. A set of valid "anchor" items, containing no DIF,  a equality s or a's The model assessed maximum likelihood statistics and 2(0loglikelihood). The model refitted under the constraint that and a parameter are equal both groups 2(loglikelihood). The likelihood ratio test significance s the difference between the two models and likelihood ratio test assesses significant improvement model fit as a consequence allowing two parameters to fluctuate. likelihood ratio significant, either two the b parameter groups, or the DIF a parameter detected. is different this example, simultaneously testing differences both parameters, test statistic stributed as a chi square with two degree freedom. situation ting significance only item difficulty, stati stic would tribute as a chi square with one degree freedom. secondderivative approximations the standard errors estimated likelihood item parameters estimation as a part The IRTLR the procedure maximum does require estimated error variances and covariances. results from computing likelihood the overall mode under the equality constraints placed upon the data then estimating the probability under the null hypothesis (Thissen et al., 1988) Lord' chisquare IRTLR are capable of detecting nonuniform DIF sess good stati stical power (Cohen Kim, they 1993) tend Because to be expensive requis and large yield sample positive zes rates above the nominal eve s (Kim et al., 1994). . Linn (1981), with simulated data, and epard, Camilli, Williams (1984), with actual data, demon strated that significant differences detected Lord s chi square occurred even when plotted ICCs were nearly identical. additional problem when employing IRTLR need set of truly unbiased anchor items (Millsap & Everson, 1993). Procedures estimatinsc area between ICCs Eight different DIF procedures have been developed to estimate area between the reference group' and focal group 1  * I I II __ interval , (c) continuous integration or discrete approximation, weighting (Mills & Everson, 1993) The first area proc edures utilized bounded interval with discrete approximations Rudner (1977) suggested unsigned index z PF(e) with discrete interval from j 3 to 8 j = 3 Rudner (1977) used small interval stances (e.g.; .005) summed across interval The estimated is converted to a signed index removing the absolute value operator Shepard area et al. procedures (1984) extended introducing four signed hniques and that unsigned included sums of squared values, weights based upon number examinees in each interval along 0 scal and weighting initial differences inverse the estimated standard error the different ce. They determined that distinctively different interpretations occurred when signed area indices were estimate as compared to unsigned indices They further found that various weighting procedures influence interpretations only slightly, they concluded that item PR (e) A], All the area indices proposed Shepard et al. (1984) u standard tilized errors discrete to permit approximations significant t and ests lacked Raju sample (1988, 1990) augmented these procedures devi sing an index measure continuous inte gration over unbounded interval and derived (1988) standard proposed errors setting permitting the significant c parameter tests equal Raju both groups estimating the signed area SA = (,R b,) unsigned area estimated a,,) Da.a In( 1 + exgF Daa,(b,b,) F(BRV))) a Sa  b,) Raju (1990) derived asymptotic standard error formulas signed and unsigned area measures that can use generate tests to determine significance level DIF under conditions of normality Theoretically, Raju' procedure measuring and testing the significance area between ICCs two utili group zing a s score significant erval advancement Raju over (1990) procedure interpreted 2(a, (1993), analyzing data from a 45 item vocabulary trial test contras ting girl boys black and white students, found that significance tests the area measures identified identical aberrant items as Lord chisquare. Raju et al. (1993) the alpha rate at 0.001 to control Type errors Cohen and Kim (1993) found that two Lord comparing procedures s chisquare Lord produced appeared square similar to Raju's results, slightly more SA and although powerful identifying simulated DIF. as a Consequence Multidimensionalitv In all procedures thus reviewed, researchers have either conditioned an item response on an ob erved test score or a latent ability estimate. Procedure using observed scores assumed that total score valid meaning terms purported construct measured. procedures assumed response to a set items are unidimens ional even though examinees' scores may reflect composite abilities. potential DIF can conceptualized as occurring when test consists targeted ability, item respon ses are influenced one or more nuisance determinants Shealy Stout, 1993a, 1993b). Under thi circumstance, an item may misinterpreted means are not equal, means are not equal, the ratio o,/O are equal, correlations between the valid and nuisance dimensions are not equal (Ackerman, 1992). The presence multidimens ionality a set items does not necessarily lead to DIF. For example, quantitative achievement ability may test contain used to predict mathematical word future college problems requiring proficiency reading kills The test contains one primary dimensionquantitative ability; however, a second requisite measured skillreading abilityis valid specific usage. A unidimensional analysis applied to such multidimensional data would weight relative discrimination the multiple traits to form a reference composite (Ackerman, 1992; Camilli, 1992). the focal and reference groups share a common reference composite, not possible. Since any test containing two or more items will degree multidimens ional, practitioners should define validity sector approximately to identify the same test composite items of ab measuring ,ilitie (Ackerman, 1992). In DIF studies, conditioning variable should consist only items means urging the same compo site  a a a1 1 1  composites problem of ability. trying This compare creates, apple in essence, oranges. the The potential effect this to confound DIF with impact resulting in spurious interpretations (Camilli, 1992). The effect multidimensionality analy ses resulted limited consistency across method (Skaggs ssitz, 1992) across differing definitions conditioning Further, variable Linn (Clauser, (1993) Mazor, observed & Hambleton, that 1991). rigorous implementation to identify a proper set test items may restrict validity. For example, the SAT Verbal (SAT V), items with large erial correlations total score were more likely to be flagged than items with average or below average suggest biserial d that t correlations traditional using unidimens ional Thi DIF finding analyses, part, might be stati stical artifacts confounding group ability differences item discrimination. Differential item functioning procedures based upon multidimens ional perspective conditioning on items clearly defined from a validity sector have the potential reduce these problems (Ackerman, 1992). Further, multidimensional explanation approach (Camilli, 1992 should also Careful facilitate evaluation DIF and SIBTEST. Shealy and Stout (1993a, 1993b) have formulated a DIF detection procedure within multidimensional conceptualiz ation. They conceptualize test as measuring composite the a unidimensional target abilitythat trait or reference influenced periodically nuisance determinant DIF interpreted as the consequence the differential effect nui sance determinants functioning on an item or set items. The SIBTEST procedure employs factor analy Si1 identify sector. a set These items items that adheres constitute to a defined valid subtes validity , and remaining items become the tudied items. Examinees are divided into strata based upon the valid subtest score, and the DIF index estimated zP where the pooled weighting focal and reference group examinees who achieve The value identical the value PDIF when total number examines are weighting group Shealy and Stout (1993a) have referred standard zation procedure "progenitor" 161) SIBTEST. They present *  n rn a h e4nArn rrrr 0c+0 mAFtI ,.r: 41 4V* CArrC LSrrrnr SE(8)= .27  PJ) p.) With SIBTEST total score valid subset serves conditioning criterion. The SIBTEST procedure resembles methods on which an observed test score is the criterion; although, incorporates an adjustment item mean prior to comparing groups these means. Thi adjustment an attempt remove that portion group mean difference attributable group mean differences the valid targeted ability. When the matching criterion an observed score the studied item included the criterion score, group differences statistically in target inflate Cons ability will equently, tend SIBTEST employs correctional procedure based upon regression and theory. In effect, the purpose is to transform each observed mean group ability level score, into transformed mean so that ability remove the leve that trans formed I score portion mean. score, This group mea a valid adjustment at n differences estimate tempts that attributable group differences underlying targeted Pj(1 Pg (1 an estimate difference in subtest true scores referenced focal groups with examinees matched ability levels. this trans formation to yield unbiased estimate, valid subtest must contain a minimum 20 items (Shealy Stout, 1993a). SIBTEST only procedure based on conceptualizing DIF as a result of multidimensionality. Although resembles the procedures that condition on observed scores, offers conditioning conditions a regression correction on estimated demonstrates true good procedure scores. U adherence that nder allows simulated to nominal error rates even when group target ability distribution differences powerful as are MH i extreme, and it n the detection has been unifor shown m DIF to be a (Shealy Stout, 1993a). multidimensional conceptualization potentially nuisance can lead determinants identification greater of different understanding of DIF causes (Camilli, 1992). The major weakness SIBTEST are inability assist the user in detecting nonuniform DIF and the need or more items to fit a unidimensional validity sector. With a relatively short test or subtest, this latter weakness would problematic under some practical testing Methods Summary After years development, a plethora sophisticated DIF procedures have been devised. Each method approaches DIF identification from a fundamentally different per spective, each method contains advantages and limitations Currently, no consensus among DIF researchers exits regarding a single theoretical or practical best method. The design thi study reflected lack consensus. possessing selected theoretical five different or pra procedures, appeal, each assess item responses examinees The design the study was compare the reliability and validity the methods themselves, but assess the similarity results obtained from methods when subpopulations were define conceptually different ways. Uncovering the Underlyingq Causes DIF The overwhelming majority of DIF researchers have focused on designing stati stical proc edures evaluating their efficacy detecting aberrant items Few researchers have attempted move beyond methodological issues examine DIF' causes. The researchers broaching thi topic have experience ed few successes many frustrations Schmitt et al . (1993) propo that explanatory be classified post speculation , (b) hypothesis testing item item manipulations, DIF can categories and attributed hypothesis manipulation to a complex testing of other interaction using variables between the item and examinee Scheuneman Gerrit , 1990) Researchers are unlikely to find a single identifiable cause of DIF since it stems from both differences within examinees and item characters (Scheuneman, 1987). earchers examining DIF from perspective examinee differences may uncover takers, Gerritz significant educators, (1990) and suggest finding policy d that with makers "prior implications Scheunem learning, test an and experience, and eres t pattern between mal females and between Black and White examinees may linked with DIF" . 129) Researchers examining from the perspective item character sti may discover findings with strong implications developers test may need developers to balance and conten item t and writers item Test format ensure fairness. Post hoc evaluations, despite their limitations, dominate the literature (Freedle tin, 1990; . Linn & Harn 1984; isch, Skaggs 1981; & Lis O'Neill sitz, 1 McPee 992) , 1993; Speculation Shepard ns for et al causes (O'Neill & McPeek, 1993; Shepard et al., 1984; Skaggs Liss itz, 1992 Hypothes s testing item categories a second, more sophisticated, means uncovering explanations DIF. Doolittle and Cleary (1987) Harri and Carlton (1993) evaluated several DIF hypotheses on math test items. Doolittle and Mathematics Cleary Usage Test (1987) (ACT employed items ACT and Assessment a pseudo detection procedure to analyze differences across item categories test forms Mal examinees performed better on geometr examinees mathematical performed better reasoning items, on computation whereas items femal Harri and Carlton (1993), using SATMathematics (SAT items MH procedure, concluded that mal examinees better application problems femal examinees did better on more textbook type problems Scheuneman concerning (1987) potential analyzed separate causes black hypotheses white examinees manipulating test items on the experimental portion the general test The hypoth eses , analyzed through linear models included examinee character , such test seness, and item character such format. Complex interactions earlier post review. employed the STDP DIF index with ANOVA and found that panic examinees were favored antonym common items root that included Englis h and a true Spanis cognate and a word on reading with passages containing material inter to Hispanics False cognates, containing words spelled different similarly meanings both language homograph words but spelled alike Engli to be more greater for h but containing difficult for Puerto Rican Hispanics examinees different The , a gro meanings, differences up generally tended were more dependent on Spanis as compared to Mexican American examines. Yamamoto . Tats (1988) ;uoka, studi . L. ed DIF Linn, on a 40 . Tatsuoka, item fractions and test. They initially analyzed examines dividing them into groups based upon instructional methods. Thi procedure failed to provide an effective means detecting DIF However, upon subsequ review and analyst they divided examinee solving into groups problems Wit based h thi upon solution grouping strategies variable, they used found DIF indices consi stent with their a priori hypoth eses They concluded that the use of cognitive and instructional subgroup categories , although counter traditional DIF Miller and Linn (1988) considered the invariance item parameters Second International Mathematics Study (SIMS) examination across different levels mathematical instructional coverage. Although their principal concern was the multidimensionality achievement test data as related to instructional differences and model usefulness, they found that instructional differences could explain a significant portion observed DIF. Using cluster analysis, they divided students into three instructional groups based upon teacher res ponses opportunity tolearn que stionnaire. The size differences in the ICCs groups based upon instructional groups was much greater than differences observed previously reported compare sons black and white examinees They interpreted these findings as supportive Linn and Harnisch s (1981) stulation that what appears item bias may reality "'instructional bias'" 216). Despite Miller and Linn s (1988) straightforward interpretation instructional experiences , Doolittle (1984, 1985) found that instructional differences did not account for or parallel gender DIF on ACT M items dichotomized high school math background into strong and tended to favor female examinees did not favor low background examinees vice versa. Correlations of DIF indices were negative, sugge ting that gender DIF was unrelated to math background DIF. Muthen, Kao, Burstein (1991), analyzing core items the SIMS test, found several items to be sensitive to instructional effects. In approaching DIF from alternative methodological perspective, they employed linear structural modeling assess the effects instruction latent mathematic ability and item performance. They found that instructional effects had negligible effects on math ability, but had significant influence on specific test items. Several items appeared particularly sensitive instructional influences They interpreted the identified items less an indicator general mathematics ability more an indicator exposure to a specified math content area. In using linear tructural modeling, Muthen et al. (1991) avoided arbitrariness defining group categories a situation where group membership varied across items. The SIMS data permitted estimation instructional background each core items. Under most testing conditions, estimating examinee _ nuisance dimen sons estimated. Analyzing the relationship theoretical causes nuisance dimensions combines the approach Muthen et al. (1991) with Shealy and Stout (1993a, 1993b). Summary Researchers investigating underlying causes have produce ed few significant result After more than years DIF studi , conclu sions test wiseness (Scheuneman, cognates 1987) Schmitt, or Hi 1988) spanic must tendencies on true be interpreted as meager guidance test developers and educators. These limited results can explained problems inherent traditional Tatsuoka DIF et al procedures ., 1988) Skaggs Indices & Li derived ssitz , 1992; using served total scores as the conditioning variable have been observed to be confounded with item difficulty (Freedle Kostin, 1990) and item disc rimination Linn, 1993; Masters 1988). Indi ces derived from model s are conceptualized from an unidimensional perspective, DIF a product multidimensionality (Ackerman, Camilli, 1992) Consequently, DIF detection procedures have been criticized a lack of reliability between methods and across samples (Hoover Kolen, 1984; Skaggs ssitz , 199 1988). The uninterpretability findings may because group membership only a weak surrogate variable greater psychological or educational significance. For example, demographic categories women or blacks) lack any psychological or educational explanatory meaning. Moving beyond demographic subgroups more meaningful categories would expedite understanding causes Linn, 1993; Schmitt & Dorans, 1990; Skaggs Lissitz, Tatsuoka conceptualization et al been ., 1988). Although advocated, this been used paringly Doolittle (1984, 1985), Miller and R.L. Linn (1988), Muthen et al. (1991) . K. Tats uoka et al. (1988) used this conception and appeared to have reached promise ing, incompatible, interpretations . Future researcher analyse need to achieve to apply alternative explanatory power. approaches Approache to DIF s advocated Muthen et al . (1991) and Shealy Stout (1993a, 1993b) provide sound methods that potentially permit modeling of differing influences on item respon ses. Gender and Quantitative Aptitude Educational psychological researchers have been concerned with gender differences in scores on quantitative aptitude tests (Benbow, 1988; Benbow & Stanley, 1980; filter" that prohibits many women from having access high paying and pre tigiou occupations Sell , 1978). Although gender differences in quantitative ability interact with development, with elementary children demonstrating difference or difference slightly favoring girls , by late adolescence early adulthood, when college entrance examinations are taken critical career deci sions are made, slight Sherman, 1977; differences Hyde appear , Fennema, favoring & Lamon, boy 1990). (Fennema In studies linking gender difference quantitative test scores with underrepresentation women prestigious technical careers, analyses should limited taken late adol escence or early adulthood that significantly influence career deci sons opportunity es. Significant Important Test Score Differences Standardize d achievement tests utili zing representative samples (e.g., National Assessment Educational Progress, High School Beyond) college admi ssions tests utilize selfselected samples S, SAT, ACT, GRE) have been analyzed to ascertain gender differences Gender differences found in representative samples are systematically diff erent from those found selfselected samples (Feingold, 1992) Women appear ess proficient, successfully matriculate through a process that relies heavily upon admi sslons test scores There fore, in studying quantitative differences with primary concern related career decis ions opportunities selfselected admiss test scores are most germane measures analysis . C. studied Linn (Friedman, Hyde 1989; (1989) Hyde concluded et al., 1990) from that metaanalytic "average quantitative gender differences have declined to essentially zero" 19), and differences in quantitative aptitude can no longer used to ju tify underrepresentation women in technical profess ions. Feingold (1988) assess gender differences several cognitive measures on the Differential Aptitude Test (DAT) and the SAT concluded that gender different ces are rapidly diminishing areas one exception this finding was the SAT (Feingold, 1988). Although mean diff erences had either substantially dimini shed or vanished on DAT measures of numerical ability, stract reasoning, space relations, and mechanical reasoning, during past years, SATM differences have remained relatively cons tant. Despite the finding that gender differences are appearing on many mathematical ability tests the major colle entrance examinations gender differences higher SATM than women (National Center Education Stati stics , 1993). Thi difference can also stated units an effect size 0.39 which represents difference between the means divided pooled standard deviation) The trends regarding gender differences on the ACTM are similar. The ACT scale range from to 39 point the mean difference favoring male examines from 1978 1987 was 2.33 points 0.33 (National Center Education Statistics, 1993) Thi scoreI differential been relatively consistent provides indication disappearing. The greatest parity between men' and women' mean scores occurs on the Quantitative (GRE . For 1986 1987 testing years, U.S . mal examinees averaged and point higher than U.S . femal examines (Educational ting Service, 1991) Transformed into effect zes , these differences are d and 0.62 res pectively . Gender mean score diff erences on the large part, reflect gender differences in choice of major field. Particularly the case graduate admi sslons tests , mean scores are confounded with gender differences choice undergraduate major . Analyz ing GREQ data differences favoring men were points , respectively and .19). examinees intending major the humanities and education the same testing year, mean score differences favoring men were and 37 points, res pectively .31) Averaging across identified intended study, mean score differences favoring men were points .35) (Educational Testing Service, 19868 sizes 1991). testing appear Although years, to indicate data mean e tha was score t U.S available differences . male only and examinees the effect tend score higher than . female examinees on the GREQ pattern consistent with SATM ACTM. Despite changes curriculum and text material that depict both genders ess stereotypic manners (Sherman, 1983) reductions gender difference on many mathematics tests (Feingold, 1988), on coll ege admi ssions quantitative tests ender differences are significant and appear not to be diminishing Due the importance these tests regarding colle admission deci sions and the awarding finan cial aid, parity in scores tends reduce opportunities women sser, 1989) Predictive Validity Evidence Although mean scores on quantitative admiss scores i evidence that admi sslon tests are biased against women (Rosser, 1989) Defenders the use college admission tests argued that other relevant factor explain phenomenon (McCornack McLeod, 1988; Pallas & Alexander, 1983). They postulated that women tend to enroll major fields where faculty tend to grade ess rigorously women are more likely to major the humanities whereas men are more major sciences Investigators analyze differential predictive validity of college admi ssions exams, therefore, must consider gender differences in course enrollment patterns. McCornack (1988) g patterns and generally were co McLeod found (1988) that, nsidered, SAT and when V an Elliot differential d M coupled and Strenta course with h taking igh school grades were not biased in predicting achievement men and women. McCornack and McLeod (1988) considered performance in introductory level college courses at a state university and used SAT composites with high school grade point average . They found no pr edictive bias when analy zing data the course level Elliot and Strenta (1988) considered performance in various collegelevel courses private university utili SAT composites with scores from a college placement examination and high school rank. .g., were found flawed that no bias Had they they combined separately various studied predictors SATM and and high school grades , they might have arrived at a different interpretation. Bridgeman Wendler (1991) and Wainer and Steinberg (1992) conducted more extens studies and concluded that, equivalent mathematics courses , the SATM tends underpredict college performance women. Bridgeman and Wendler (1991) studied SATM as a predictor college mathematic course performance at nine colleges universe ities. They divided mathematics courses into three categories found that, algebra and ec alculus courses women s achievement was underpredicted and, calculus courses, no underprediction occurred. The most extensive study to date concerning predictive validity the SAT was conducted Wainer Steinberg colleges (1992). and Analyzing universities, nearly they 47,000 concluded student that, at 51 students same relative course rece giving the same letter grade, SATM underpredicted women' achievement. Using backward regression model, they estimated that women, earning the same grades similar courses , tended score roughly 2530 points ess on the SATM. quantitative admission exams, although women generally outperform men in high school and college courses The principal explanation offered this paradox gender differences course taking. Researchers investigating relationship achievement, of quantitative controlling admission course tests taking and patterns subsequent and course performance, mathematics courses, have the concluded tests that, underpredict equivalent women achievement. Although underprediction is not large as mean appear score to be differences, biased quantitative underpredicting admission women's tests college achievement. recognized that predictive bias and DIF are fundamentally distinct; however, the determination predictive bias quantitative admi ssion tests makes them an evocative instrument analy Potential Explanations DIF This study will approach from the pers pective examinee characteristics When analy zing DIF explanations from this perspective, theoretical explanations predictive Kimball bias (1989) offer a reasonable sented three point theorectical departure. explanations paradoxical relationship gender differences admissions test scores and college grades: men have learning styles , and men tend to prefer novel tasks whereas women tend to prefer familiar tasks. these three theorectical explanations, would submit a fourth explanation related test taking behaviordifferences between men in women test anxiety. Differences Mathematics Background It i well document d that students enter high school and proceed toward graduation boys tend to take more mathematics courses than girls (Fennema Sherman, 1977; Pallas & Alexander, 1983) During the 1980s , high school boys averaged 2.92 Carnegie units of mathematics whereas high s Center school school for girl girls av Education s entered eraged Statis the tics Carnegie , 1993). uppertrack nint units Althoui h grad' (National gh high e mathematics curriculum slightly greater numbers than boys, graduation, boys outnumbered girls advanced courses such as calculus and trigonometry. High school boys were more likely (National trends co to study Center ntinue computer for as s science Education students and Stati enter phys stics college. than 1993). During girls These the 1980s, men slightly outnumbered women in achieving undergraduate outnumbered w mathematics omen degrees, in attaining and overwhelmingly undergraduate degree science, and physics (National Center Education Stati stics, 1993). Researchers investigating relationship between mathemati background test scores have found that, when enrollment differences are controlled, gender differences mathematical reasoning tests are reduced (Fennema & Sherman, 1977; Pallas Alexander, 1983; Ethington Wolfle, 1984) Gender score diff erences on the SATM, when high school course taking was controlled , were reduced approximately twothirds (Palla Alexander, 1983) and onethird (Ethington & Wolfle, 1984). These studies analyze total score differences controlling course background. Miller and Linn (1988) and Doolittl (1984, 1985) analyzed item differences controlling instructional diff erences , but their results were contradictory Background differences offer plausible explanation that implores additional investigation. Rote Versus Autonomous Learnincr Styles Boys tend to develop a more autonomous learning style which facilitates performance on mathematics reasoning problems and girls tend to develop a rote learning style which facilitates classroom performance (Fennema & Petersen, better, are more motivated, and are more likely persevere o independent n difficult format. S tasks students presented splaying in a novel rote learning behavior tend to do well applying memorized algorithms learned direction. challenging class Often, tasks and are thes when heavily student given dependent tend an option. upon to choos This teacher less dichotomy congruent with finding that girls tend to perform better on computational problems boys tend to perform better application and reasoning problems (Doolittle Cleary, 1988; Harri Carlton, 1992). The autonomous versus rote learning style theory consistent with literature addre ssing gender socialization patterns standardized test performances. Before can further applied, however, must more completely operationalized (Kimball, 1989). To validate this theory, researcher must demonstrate that boys and girls approach study mathematics differently, then relate learning styles to achievement on classroom asses sments and standardized tests (Kimball, 1989). Novelty Versus Familiarity Kimball (1989) hypothesized that girls tend to be more motivated to do well are more confident when working on familiar demonstrate classroom higher assessments achievement and on novel boys tend standardized tests. Thi theory is based on the work Dweck and her colleagues (Dweck, 1986; . Elliot Dweck. 1987; Licht Dweck, 1983) who related attributions to learning and achievement. Students with a performance orientation and low confidence tend to avoid difficult and threatening tasks They prefer familiar, nonthreatening tasks and seek to avoid failure. Students with a performance orientation high challenging confidence tasks are Consi more stent likely to select findings moderately demonstrate that girls tend to have less confidence their mathematical abilities than boys (Eccles , Adler, Meece, 1984; Licht Dweck, 1983) Girl are also more likely on standard tests to leave items unanswered or mark "I don know" when given thi option Linn, DeBenedictis, Delucchi, Harri , & Stage, 1987). Girl , more than boys, attribute their success mathematics to effort rather than ability and their failures to lack ability (Fennema, 1985; Ryckman their Peckham, abilities 1987). , girls Therefore, generally due are to less ess confidence motivated novel mathematical task , find them more threatening, and perform ess well. achievement tests. High test anxiety individuals tend score lower than low test anxiety individuals of comparable ability (Hembree, 1988; Sarason, 1980). Because aptitude and achievement tests are not intended to include test anxiety a component total score, because estimated million elementary and secondary pupil have substantial test anxiety (Hill & Wigfield, 1984), exemplifies a nuisance factor influencing item res ponses. Test anxiety been theorized both cognitive behavioral Gonzales, terms Taylor, (Hembree, Algaze, 1988; & Anton Sarason, , 1978; 1984; Wine, Spielberger, 1980). Liebert and Morris (1967) proposed a two dimens ional theory test anxiety, consisting worry and emotionality. Worry includes expression concern about one' performance consequences stemming from inadequate performance. Emotionality refers to the autonomic reactions to test situations increased heartrate, stomach pains, and per spiration) Hembree (1988) use meta analysis test anxiety studied and found that, although both dimensions related significantly performance, worry was more strongly correlated to test scores. The mean correlations worry and emotionality aptitude /achievement tests were and 0.15,  a . *  1 I 1 Wine (1980) proposed a cognitiveattentional interpretation test anxiety which examinee who are high low on test anxiety experience different thoughts when confronted test situations. The low test anxious individual experiences relevant thoughts and attends to the task. The high test anxious individual experiences self preoccupation and s absorbed thoughts of failure These task irrelevant cognitions only create unpleasant experiences, but as major tractions Sarason (1984) proposed the Reactions Test (RTT) scal based upon cognitive, emotional, and behavioral model. The 40item Likert scaled questionnaire operationalized a four dimen ional test anxiety model worry, tension, bodily Benson symptoms, Bandalos (199 test elevant a confirmatory thinking cross validation, problematic. large item item number deletion, fourfactor found They four factor speculated similarly they model. that worded found structure misfit items substantial To further the resulted Through support validate the RTT from a process a 20 structure test anxiety, Benson, Moulin Julian, Schwar zer , Seipp, and El Zahhar (1991) combined and RTT formulate a new scal The Revi Test Anxiety scale (RTA) The cognitive emotional structure of math anxiety is closely related test anxiety Richardson and Woolfolk (1980) demonstrated that math anxiety and test anxiety were highly related, and mathemati testing provided a superb context studying test anxiety They reported correlations between inventories test anxiety and math anxiety ranging mathematics test near with 0.65. a time They commented limit under i that takingig instructions to do as well as possible appears to be nearly threatening as a real life test most mathematicsanxious individual s" (p. 271) Children first sec ond grade indicate inconsequential anxiety emerges test anxiety increases level , but in seven rity third until grade sixth test grade. Female student at all eve tend ssess higher test anxiety level than mal students at all grade level (Everson, behavioral Millsap, and & Rodriguez cognitiveb , 1991; ehavioral Hembree, treatments 1988) have . Some been demonstrated to eff ectively reduce test anxiety and lead increases in performance (Hembree, 1988). This finding supports lower the causal performance direction and test test anxiety anxiety s multidimens producing ional structure. in cases unduly model influence misfit, forces performance such the test item level. anxiety High might test anxiety individual may find some items differentially more difficult than other test items. Summary have reviewed several different methods identifying DIF. large part because computational efficiency, has emerged the most widely used method. It is limited terms flexibility, as researchers continue to search underlying explanations apparent. L DIF, ogistic limitations regression will models become more (Swaminathan Rogers 1990) provide an efficient method that has greater flexibility than MH and potentially models theoretical causes of DIF. Raju s (1988) signed and unsigned area measures supply a theoretically sound method of contrasting item response patterns. Shealy and Stout s SIBTEST (1993a, 1993b) conceptualizes a multidimen sional phenomenon and defines a validity sector as the conditioning variable. sound theoretical foundation coupled with computational efficiency and explanatory potential makes perhaps the most comprehens DIF procedure. These five approaches were employed the study. Linear structural findings the validation study. Thus, the significance validation consistency of DIF estimation was considered. Gender context DIF thi on quantitative study. test context items was will taken serve because paradoxical finding that men tend score higher standardized tests math reasoning, although women tend achieve common equivalent categorical supplement or higher course variable dichotomi examines grades. studies, into Gender, will substantial weak mathematics background and high and low test anxiety. Thi study based on the premise that gender differences serve as as urrogate differences background and test anxiety. The two variables were selected an effort to explain in terms consi stent with theoretical explanations of gender differences mathematics test scores course achievement. Mathematics background has been applied other DIF studies with inconsistent interpretation Test anxiety interest to both educators cognitive psychology and highly related to performance. tudy an attempt determine consis use tency these indices, variable detection serves methods, improve and CHAPTER METHODOLOGY The present study was designed to investigate the intermethod consistency five separate differential item functioning (DIF) indices and associated tati stical tests when defining subpopulations educationally significant variables as well the commonly used demographic variable gender. The study was conducted the context of college admi ssion quantitative examinations gender issues study was designed evaluate the effect on DIF indices of defining subpopulations gender, mathematics background, test anxiety. Factor analytic procedures were used to define structurally valid subt items. Following the identification a valid subtest, the DIF analysis was repeated. The findings DIF analysis before validation were contrasted with the DIF analysis based valid subset. A description examinees, ins truments, data analysis methods is presented thi chapter. Examinees The data pool to be analyzed consisted test scores item respon ses from 1263 undergraduate college students The sample consi sted women and men. solicited help various instructors in the colleges of education business , and in most cases students participate d in study during their class time. the total sample examinees, individual were tested asses college of education, individual were tested asses the college business, and individual were ted at other sites on campus. background Women were examines largest with groups little the c mathematics college education asses and men examinees with substantial mathematics college of background business were asses see largest Table group of Appendix examine frequencies test setting, gender, and mathematics background). majority student received class credit participating. No remuneration was provided any participant. All students had previous taken a college admi ssion examination, some the students (approximately percent) had taken the Graduate Record ExaminationQuantitative Test (GREQ). Instruments The operational definition a collegiatelevel quantitative aptitude test was a released form GRE Test anxiety was operationally defined a widely used, tandardi measure, Revi Test Anxiety Scal (RTA) The mathematics background variable was measured using the whether dichotomous or not student response had an item completed concerning a particular advanced mathemati class the college level (i.e., calculus) . In following sections, a more detailed description of each ese instruments is presented accompanied particular technical instruments information item that the supports purpose use the study. Released GRE Each examinee comply a released form GRE 30item was contained test, supplied a 30minute "many timed kinds Educational examination. of questions Testing Service The sample that test are included currently used forms" (ETS, 1993, GRE The test was signed measure basic mathematical skills concepts required to solve problems in quantitative settings . It was divided into reason quantity accurately comparing or to recognize when relative insufficient two information had been provided to make such a comparison. The format the second section, employing multiple choice items, assess ability to perform computations and manipulations of quantitative symbols and to solve word problems in applied or abstract contexts. The instructional described background "arithmetic, required algebra, answer geometry, items and was data analysis " and "content areas usually studied high school" (ETS, 1993, . 18). The internal consis tency the test 1263 participant was relatively good, KR20 = 0 pilot study, sample test correlations with the GRE examines with Scholastic Aptitude Test Mathematics (SATM) examinees were 0.67 and .79, respectively , the scores on the released GRE were similar scores examinees earned on other college admission quantitative examinations Revised Test Anxiety Scale (RTA) The Seipp, RTA & El scale Zahhar; (Benson, 1991) Moulin was Julian, formed Schwarzer, combining theoretical framework two recognized measures test Reactions to Tests (RTT)(Sarason, 1984). The TAI, based upon a twofactor theoretical conception test anxiety worry and emotionality (Liebert Morris , 1967), contained items. Sarason (1984) augmented this conceptual zation with a four factor model test anxietyworry, tension, bodily symptoms, and test irrelevant thinking. To capture the best qualities of both scales Benson et al . (1991) combined the instruments to form the RTA scale. capture They intended Sarason four that propo combined factors. scale From would L the original combined items, using a sample more than college students from three countries, they eliminated items the basis items not loading a single factor, having low item/factor correlation having low reliability They retained items each loading on the intend ed factor and containing high item reliability. The bodily symptoms subscale, containing only items, was problematic due to low internal reliability. Consequently, Benson and Zahhar (1994) further refine the RTA scale and developed a 20 item scal with four factors and relatively high scale internal reliability see Table With a sample of 562 coll ege students from two countries , randomly split into correlations , and item uniquene sses criptive stati stic each subscale the RTA Benson and Zahhar are (1994) reported American Table sample The in and study strument was s sample selected because evidence reliability and construct validity compared favorably with that of other leading test anxiety scales used with college students Table criptive Statistics the item RTA Scale Scal Benson  El American Zahhar Sample = 202 Study Sample N = 1263 Total Scale 38.31 10.40 39.17 9.37 Worry 11.61 12.03 3.50 Tension 3.85 Test Irrelevant 6.79 Thinking Bodily Symptoms 7.54 2.79 7.35 Note. First U .fl 4ha Numbe entry items in each nC 4~ n A  per column subscale s the Airt.4 :44 nfl 4ba in parenthe mean, 414 .A second nfl w 4* rt ses entry Mathematics Background Researchers have experienced problems selecting best approach measure subjects ' mathematics background (Doolittle, subjects' 1984). background Typically, include methods asking cla subjects ssifying to report number mathematics credits earned or semesters studied Pajares (Doolittle, & Miller, 1 1984, 1985; .994) Hacket asking & Betts subjects , 1989; a series que tion related to specific courses studied (Chipman, Marshall, Scott, 1991). Asking subject que stions concerning their course background implies that one or two "watershed" subjects' i mathematics instructional courses qualitatively background. To decide capture which these two options to employ thi study, conducted pilot study to ascertain whether measuring examinees mathematics background quantitatively counting mathematics a watershed credits earned mathemati or by course qualitatively was more identifying useful In a pilot study, undergraduates were asked answer the five question posed Chipman et al. (1991) report the number coll credits earned mathematics see Appendix questions and the scoring scheme used with Chipman et al., 1991) Subject were The subjects were then divided using their responses the single question about successful comply etion college calculu course. two methods dividing subjects into two background groups had an 84% agreement predictors rate; with however, co performance relations on the GRE thes and two SATM indicated that dichotomous calculus completion question was more valid students in this study. pattern relationships between these tests, calculus question, indicated and that the for number these mathematics college students credit earned calculus completion had a stronger relationship the test scores .51) than the number of mathematics credits earned .40) see Tabl In a continuation pilot study, examinees reported they had successfully taken a college calculus Table Correlations of Calculus Completion, SATM, GREO , and College Mathematics Credits SATM GRE Credits culus Compl etion .51(58) .50(55) .49(141) Total Credits .08(58) .40(55) course, and examinees reported they had not successfully examinees re taken porting a college success calculu course. completion of a college calculus course had earned an average 13.3 college mathematics credits The students reporting they had successfully completed a college calculus course had earned an average of 5.7 college mathematics credit Therefore, thi sample there was substantial evidence that calculus courses serve as a waters hed to other more advanced mathemati courses, and that completion calculus course could use to differentiate students terms mathematics background. Subsequently, mathemati background was operationalized having each examine answer following question : "Have you successfully completed collegelevel calculus course?" Examinee responding were classified as having a sub stantial background, and examinees background. responding Utili no were zing examinee ass ified as having responses the little question calculus completion was justified because high degree students ' colle agreement ge course between calculus backgrounds, completion the higher correlation calculus completion to student ' SATM and sample mathematics background applying DIF procedures. Analv Testing Procedures SubDODulation Definitions Prior taking released GRE examinees answered the Differential Item Function Ques tionnaire (see Appendix RTA scale. It contained Examinee s demographic provided questions information and regarding the their gender, m Examinees mathematics were background, classified and as having test anxiety. substantial or little mathematics concerning background by completion of answering a college the qu calculu estion course. the 1263 participants reported that they had completed a coll calculu course and reported that they had completed a coll ege calculus course Frequency counts percentage mathematics background gender are presented Table Men and women did not ssess similar the class men mathemati reported , whereas backgrounds. completing the women the a college reported sample, calculus completing coll ege calculus ass High test anxious groups were formed following manner Examinees scoring approximately Table Frequencies Background and Percentages Gender and Mathematics Mathematic Background Total Subs tantial Little Women Pct. Men Pct. 14.6 40.3 Total Pet. 50.4 Examinees scoring middle percent the tribution were defined as p possess moderate level test anxiety Examinees scoring in approximately lowest percent stribution were defin possessing low level test anxiety For analysis examinees class ified as po ssess moderate level test test anxiety examines. Women tended to be classified having high test anxiety at greater rates than men. Following completion the questionnaire, examinee answered item GRE Examinees received a standard instructions were told they had minute to complete test. Examinees were requ ested their following the test, they sired, they could learn their results DIF Estimation The five different methods estimating were Mantel Haenszel (MH) (Holland & Thayer, 1988), Item Response TheorySigned Area (IRT SA) and Item Response TheoryUnsigne Area (IRTUA (Raju, 1988, 1990), Simultaneous Item Bias Test SIBTES (Shealy & Stout, 1993b), and logis regression (Swaminathan & Rogers 1990). A di stinction was made between uniform and alternate measures Uniform nonuniform methods estimate DIF fundamentally different ways. nonuniform DIF exits , the two approaches produce unique findings (Shepard, Camilli, Williams, 1984). Consequently, five method were divided into two groups Mante Haenszel, SA, and SIBTEST formed uniform measures of DIF. Logis Regres sion and UA, not have designed used measure extensively nonuniform indicating DIF, test that practitioners actual testing circumstances they assume nonuniform DIF either trivial or a stati stical artifact. examining the relationship between the DIF indi ces estimated those estimated IRTUA and logis regression, researchers will able to determine important information lost when only uniform methods are used. Mantel Haensze indi ces tests of significance were estimated using SIBTEST Stout Roussos , 1992 Item Response Theory signed and unsigned indices and tests significance were estimate using PCBILOG (Mis levy Bock, 1990) combination with 6.03 (SAS titute, Inc., 1988). SIBTEST indi ces and tests significance were estimated using SIBTEST Stout Roussos, 1992). Logi stic regression indi ces ests of significance were estimated through 6.03 Institute Inc., 1988). Thu each test items was analy with three different subpopulation definitions five different procedures , producing each item tinct indices significance Structural tests Validation The structural component of construct validation (Mess ick, 1988) tructural component appraised analyze the interrelationships test items The released GREQ was structurally validated through factor analysis the matrix tetrachoric coefficients the item test a subsample examines Initially, sample 1263 examinees was randomly split into two amples. first subsample was used the exploratory study, the second ample was used cross validate findings derived from the exploratory analyst i The tetrachoric coeffi ent matrix was generated with PRELIS using (Joreskog an unweight (Joreskog & Sorbom, ed least & Sorbom, 1989a) squares 1989b) were Factor solution used analytic through assess it model LISREL em dimensionality potential nuisance determinant Research Design Prior to validation, assessed the consis tency combination five DIF methods and three subpopulation definitions. The intermethod consistency of DIF indices was asses through a multitrait multimethod of DIF (MTMM) significant matrix. tests The was intermethod assessed consistency comparing percentofagreement rates between DIF methods when A subset unidimensional items was identified applying factor analytic procedures. Problematic items and items contaminated nuisance determinants were identified. Following structural validation, the DIF analysis was repeated. Utilizing combination DIF methods subpopulation definitions, DIF indices and significant tests were generated the subset items. The consi stency indices associated inferential stati stic was assessed. The findings ass imilating validation were compared the preceding findings to appraise effect of structural validation on DIF DIF analy Research ses. Questions Research question one through four addressed the consi stency DIF indices through two MTMM matrices correlation coefficients Research questions one through four were first applied to the analy uniform DIF procedures and the MTMM matrix derived from these coefficients (see Table on page The same set questions were then applied the alternate DIF procedures and the MTMM matrix derived from these coefficients (see Table on page 10). The first que stion applied uniform DIF the correlation indices when the subgroup trait gender methods are and SA). Were the convergent coeffi clients based upon the subpopulations mathematics background and test anxiety greater than convergent coefficients based upon gender subpopulations? Specific stati stical hypotheses were formulated provide criteria addressing research questions Let represent correlation between MH and IRT SA DIF indices items when examinee subpopulations are defined gender. Let represent correlation between SIBTEST indices items when examine ees are defined gender Let PIS(G) represent the correlation between the IRTSA and SIBTEST indices gender. Comparable items when notation examinees will are represent defined examinee subpopulation defined mathematics background and test anxiety (TA). Three families stati tical tests each with two a priori hypotheses were defined answer first research question the uniform methods They were follows Hla: PI(M) Hib: P(TA) PMI(G) PMI(G) t H3a : Pzs(M) PIS(G)' H3b: The PIS(TA) first PIS(G) question applied the alternate DIF procedures also addre sse convergent or monotrait heteromethod coefficients Were convergent coefficients based upon subgroup mathematics background test anxiety greater than the convergent coefficients based upon gender subpopulations? Similarly, the alternate procedures represent corre lation between the and UA DIF indices items when examine subpopulations are defined gender. represent the corre lation between MH and logis regression indi ces the items when examinees are defined gender Let PIL(G) represent the correlation between the IRTUA logis tic regres sion indices the items when examinees are defined gender Comparable notation will repre sent examinee subpopulations defined mathematics background and test anxiety (TA). In a similar manner, three families states tical tests each with two a priori hypotheses were defined answer the first research question the alternate methods. They were follows Hla: PMI(M) PHI(G), 