Citation
Gender differences on college admission test items

Material Information

Title:
Gender differences on college admission test items exploring the role of mathematical background and test anxiety using multiple methods of differential item functioning detection
Creator:
Langenfeld, Thomas E
Publication Date:
Language:
English
Physical Description:
x, 213 leaves : ; 29 cm.

Subjects

Subjects / Keywords:
Anxiety ( jstor )
College mathematics ( jstor )
Correlation coefficients ( jstor )
Mathematics ( jstor )
Mathematics anxiety ( jstor )
Matrices ( jstor )
Men ( jstor )
Research methods ( jstor )
Standardized tests ( jstor )
Test anxiety ( jstor )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1995.
Bibliography:
Includes bibliographical references (leaves 201-212).
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Thomas E. Langenfeld.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
002056850 ( ALEPH )
AKP4870 ( NOTIS )
33815423 ( OCLC )

Downloads

This item has the following downloads:


Full Text














GENDER DIFFERENCES ON COLLEGE ADMISSION TEST ITEMS:
EXPLORING THE ROLE OF MATHEMATICAL BACKGROUND
AND TEST ANXIETY USING MULTIPLE METHODS
OF DIFFERENTIAL ITEM FUNCTIONING DETECTION







By


THOMAS


LANGENFELD


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF
THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY


______ --_ l -- m i


__ m a A i I I















ACKNOWLEDGEMENTS


would


like


express


sincerest


appreciation


individuals


who


have


ass


sted


in completing


this


study.


am extremely


indebted


to Dr.


Linda


Crocker,


chairperson


doctoral


committee,


helping


conceptualization,


development,


and


writing


this


ssertation.


Her


assistance


encouragement


were


extremely


important


in enabling


me to achieve


doctorate.


also


want


thank


the


other


members


committee,


James


Algina,


Jin-win


Hsu,


Marc


Mahlios,


and


Rodman


Webb,


patiently


reading


the


manuscript,


offering


constructive


comments


, providing


editorial


assistance,


and


giving


continuous


support.


further


wish


thank


David


Miller,


John


Hall,


Scott


Behrens


their


assi


stance


related


to different


aspects


thi


study.


want


expr


ess


deepest


gratitude


family


providing


during


the


graduate


emotional


exp


support


erience.


that


want


was


so vital


thank


wife,


Ann--


many


way


s thi


degree


is as much


hers


as mine,










studies


Space


limitations


not


allow


express


many


personal


sac


rifi


ces


made


wife


that


could


complete


this


study.


also


want


thank


daughter,


Kathryn


Loui


who


was


born


early


stages


thi


study


and


come


to provide


a special


type


support.
















TABLE OF CONTENTS


pacrge


ACKNOWLEDGEMENTS...................................

LIST OF TABLES.....................................


ABSTRACT .. .... ........ .


CHAPTERS


Statement of Problem........................
The Measurement Context of the Study........
The Research Problem........................
Theoretical Rationale.......................
Limitations of the Study....................


REVIEW OF LITERATURE........................


DIF Methodology.............................
Gender and Quantitative Aptitude............
Potential Explanations of DIF...............
Summary.....................................


METHODOLOGY....... .........................


Examinees......c............................
Instruments.................................
Analysis ....................... ............
Summary............... .. ..... .... ....... ...


RESULTS AND DISCUSSION......................


Descriptive Statistics......................
Research Findings..................* .........


LIST OF FIGURES.................................... v111ii


INTRODUCTION................................











SUMMARY


AND


CONCLUSIONS.................... .


APPENDICES


SUMMARY


STATISTICAL


TABLES..................


DIFFERENTIAL ITEM
QUESTIONNAIRE
REVISED TEST A


FUNCTIONING
INCLUDING THE
NXIETY SCALE......


THE


CHIPMAN,
INSTRUMENT
MATHEMATIC


MARSHALL, AND SCOTT
FOR ESTIMATING
S BACKGROUND.......


(1991)


BIOGRAPHICAL


REFERENCES.......................................


SKETCH...............................















LIST


OF TABLES


Table


page


Proposed
Matrix


Multitrait
: Uniform


-Multimethod


Indic


es.


Correlation
. S .


Proposed


Matrix:


Multitrait


Alternate


-Multimethod


Indices


Correlation


...... 10


Item


Data


Group


and


Item


Scores


Ability


Group..


Desc


riptive


Stati


stics


item


RTA


Scale


Corre


lation


GRE


Calculus


Colle


Completion,


Mathematics


SAT-


Credits


Frequencies an
Mathematics


Percentages
Background..


Gender


and


Mean


Scores


Revise
Total


the Rel


d Test
Sample,


ease


Anxiety
Gender,


d GRE


Scal
and


and


(RTA)


the


the


Mathematics


Background.


Intercorrelation


and


Mathematics


Sample


, Women,


Rel


Bac
and


eas


ground
Men....


GRE
the


Q, RTA,
Total


Multitrait


-Multimethod


Correlation


Matrix


Uniform DIF

Percent-of-Agr


Indi

eemen


ces..


Rates


Inferential


Tests
and T


by Gender,
A Between D


Mathematics


Methods


Bac


ground,


: 30-Item


GRE-Q.......


Multitrait


-Multimethod


Corre


lation


Matrix


- a


99


_ -











Tetrachoric
Estimates


Corre


lations


Four


Standardized


Problematic


Items


: Exploratory


Sample.


e. 119


Multitrait-Multimethod


Correlation


Matrix


Valid


Test


Items


: Uniform


DIF


Indices


Percent


-of-


Agr


eement


Rates


Inferential


Tests
and T


Gender,


'A Between


Mathematic


Methods


:
*


S
2


Background,
6-Item


Valid


Test


Multitrait-Multimethod


the


Valid


Test


.. 133


Correlation


Items


: Alt


Matrix
ernate


DIF


Indices
















LIST


OF FIGURES


Figure


page


The


Four


Problematic


Test


Questions


LRCs

LRCS


Women

Women


Men

Men


on Item


Item


LRCs


Women


Men


Item


LRCs


Examinees


with


Subs


tantial


Little


Mathematics


Background


Item


LRCs


Women


Men


Item


LRCs


Examin


ees


with


Sub


stantial


and


Little


Mathematics


Bac


ground


on Item


.. 125


LRCs


Women


Men


on Item


Illu
DIF


LRCs


strating the
Condition...


Women


Symmetrical


Men


Nonuniform


Item


Illu


DIF


strating the
Condition...


More


Typical


Nonuniform
















Abstract


the


of Dissertation


University


Requirements


S


of
for


Presented


Florida


the


Partial


Degree


Doctor


Graduate


School


Fulfillment


the


of Philosophy


GENDER


DIFFERENCES


EXPLORING


AND


TEST


ON COLLEGE


ROLE


ANXIETY


ADMISSION


OF MATHEMATICAL


USING


MULTIPLE


TEST


ITEMS:


BACKGROUND


METHODS


OF DIFFERENTIAL


ITEM


FUNCTIONING


DETECTION


Thomas


August,


. Langenfeld


1995


Chairper
Major De


son


: Linda


apartment:


Crocker


Foundations


Education


The


purpose


s study


was


to discover


whether


defining


examine


subpopulations


relevant


educational


or psychological


variable


, rather


than


gender,


would


yield


item


statistic


that


were


more


cons


istent


across


five


methods


detection


differential


item


functioning


(DIF)


subsidiary


purpose


thi


study


was


assess


how


consis


tency


DIF


estimates


were


affected


when


structural


validation


findings


were


incorporated


into


analyst


The


study


was


conduct


the


context


of college


admission


quantitative


examinations


and


gender


issues.


Participants


consi


sted


1263


university


students.


For


purposes


this


study,










were


analyzed


categorizing


examines


their


gender,


mathematics


backgrounds


, and


level


of test


anxiety.


The


hypothesis


that


defining


subpopulations


mathematic


background


or test


anxiety


would


yield


higher


consistency


estimation


than


defining


subpopulations


gender


was


not


substantiated.


Result


indicated


that


using


mathematics


background


to define


subpopulations


explain


gender


had


potential


usefulness;


however,


this


study,


use


of test


anxiety


to define


subpopulations


The


finding


explain

confirmed


DIF

the


was


ineffectual.


importance


structural


validation


analy


ses


Results


from


using


entire


test revealed


that


nonuniform


DIF


methods


had


low


inter-method


consis


tency


variance


related


to methods.


When


structural


validation


findings


were


used


to define


valid


subset


of items,


highly


cons


stent


DIF


indices


resulted


across


methods


and


minimal


variance


related


to methods.


nonuniform


Results


method


further


and


suggested


the


need


importance


use


jointly


interpreting


both


DIF


indices


and


significant


tests.


Implications


recommendations


research


and


practice


are


included.

















CHAPTER


INTRODUCTION


Statement


the


Problem


Differential


item


functioning


(DIF),


a statistical


indication


of item


bias


, occurs


when


equally


proficient


individuals


, from


different


subpopulations,


have


different


probabilities


answering


an item


correctly


Linn,


Levine,


Hastings


, & Wardrop,


1981;


Scheuneman,


1979;


Shepard,


Camilli,


& Williams,


1984).


storically,


researchers


studying


have


addressed


two


principal


concerns.


The


first


concern


researchers


been


development


evaluation


of statistical


methods


detecting


"bia


sed"


item


The


second


concern


been


identify


plausible


explanations


item


bias.


study,


both


methodological


and


substantive


educational


issues


concerning


item


bias


DIF


were


addressed.


During


methods


been


past


four


developed


decades


(for


, a plethora


a comprehensive


detection


review


advances


in item


bias


detection


methods


over


the


ten


years


see


Millsap


Everson,


1993).


DIF


method


have












variable


formed


from


an observed


conditional


score


an unobserved


conditional


estimate


latent


ability


and


whether


they


can


detect


nonuniform


as well


uniform


DIF.


Researchers


applying


methods


using


an observed


conditional


responses


score


the


commonly


test


sum


number


or subsection


the


correct


test


estimate


the


ability


of each


examinee.


Researchers


using


unobserved


unidimensional


conditional


item


estimates


response


most


theory


frequently


(IRT)


apply


model


estimating


the


Uniform


latent


DIF


occurs


ability

when


of each


there


examine.


interaction


between


ability


level


group


membership.


That


the


probability


answering


an item


correctly


is greater


one


group


than


the


other


group


uniformly


over


ability


level


Nonuniform


DIF,


detectable


only


some


methods,


occurs


when


there


interaction


between


ability


level


and


group


membership.


That


difference


the


probabilities


a correct


res


ponse


the


two


groups


the


same


at all


ability


level


. In


terms,


nonuniform


DIF


indicated


nonparallell"


item


character


stic


curves.


I.


L L*


I


IB | "I


~


rr r *1












emerged


most


widely


used


procedure


(more


because


Educ


national


ting


Service'


usage


than


a result


oretical


consensus),


and


s frequently


method


to which


others


are


compared


(Hambl


eton


Rogers


, 1989;


Raju,


1990;


Shealy


& Stout,


1993a;


Swaminathan


& Rogers


, 1990).


The


appeal


procedure

of use, c


hi-square


imple

test


conceptualization,


significance,


relative


and


ease


desirable


statisti


properties


(Dorans


Holland,


1993;


Millsap


Ever son,


1993)


Researchers


applying


MH employ


observed


score


as the


conditioning


variable


and


recognize


that


MH i


sens


itive


to only


uniform


DIF.


Other


methods


compared


with


the


MH procedure


thi


study


included


logi


stic


regression


(Swaminathan


Rogers,


1990),


-Signe


d Area


(IRT


-Unsigned


Area


(IRT-


(Raju,


1988,


1990),


the


Simultaneous


Item


Bias


Test


SIBTEST)


& Stout,


1993a,


1993b


SLogis


regre


ssion


was


signed


to condition


on observed


scores


analyze


item


res


ponses


With


logistic


regre


ssion,


user


can


detect


both


uniform


and


nonuniform


DIF


IRT-SA


and


IRT-UA


were


devised


to condition


on latent


ability


estimates


and


assess


the


area


between


an item


* S .


-SA),


_


* k


L


- -












developed

developed


to detect

to detect


only

both


uniform

uniform


DIF,

and


whereas I

nonuniform


RT-UA


DIF


SIBTEST


was


designed


to conceptualize


DIF


multidimensional


phenomenon


where


nuisance


determinant


adverse


influence


item


responses


(Shealy


& Stout,


1993a,


1993b).


Research


hers


using


SIBTEST


apply


factor


analysis


to define


a valid


subtes


t and


a regression


correction


procedure


to estimate


criterion


variable.


SIBTEST


was


developed


to detect


only


uniform


DIF.


assess


diff


erent


indi


ces


with


data


from


curriculum-based,


eighth


grade


mathemati


test,


Skaggs


Liss


(199


found


that


the


consis


tency


between


methods


was


low,


no reasonable


explanation


items


manif


testing


could


be hypothes


ized.


They


posited


that


cate


goriz


ing


subpopulations


demographic


characteristics


such


as gender


or ethni


city


DIF


studies


was


"not


very


helpful


conceptual


zing


cognitive


issues


and


indicated


nothing


the


reasons


the differences"


239)


number


researchers


have


suggested


need


to explore


using


subpopulations


categorize


d by


psychologically


and


with


educationally


gender


significant


or ethnicity


variables


potentially


that


correlate


influence


item


a ~ a


mL*


____


I












Skaggs


& Liss


itz,


1992;


Tatsuoka,


Linn,


. M.


Tats


uoka,


Yamamoto,


1988).


Thu


, a major


concern


the


study


was


the


consi


tency


results


from


different


DIF


estimation


proce


dure


when


subpopulations


are


conceptualized


psychological


or educational


variable


Three


methods


conceptual


zing


subpopulations


were


combined


with


five


fundamentally


differ


state-of-


-art


procedures


assess


DIF.


The


Measurement


Context


Study


The


differ


substantive


ences


issue


on a sample


was


test


the


investigation


containing


items


gender


similar


those


found


on advanced


college


admi


ssion


quantitative


examinations.


Generally,


men


tend


to outperform


women


Scholastic


Aptitude


t-Math


(SAT


the


American


College


Testing


Assessment


Mathemati


Usage


Test


(ACT-M),


Graduate


Rec


Examination-Quantitative


(GRE-Q)


However,


from


a predictive


validity


perspective,


these


diff


erences


are


problematic


example,


men


tend


score


the SAT-


approximately


(National


standard


Center


deviation


Education


units


Stati


higher


tics,


1993),


although


women


tend


to perform


at nearly


same


-M),












tend


to outperform


men


in general


college


courses


(Young,


1991,


1994).


poss


ible


explanation


quantitative


test


score


differences


between


men


women


background


experience.


Men


tend


to enroll


in more


years


mathematics


(National


Center


Education


Statistics,


1993).


A second


explanation


that


could


potentially


explain


differential


validity


such


tests


test


anxiety.


Test


anxiety


relates


to examinees'


fear


negative


evaluation


defensiveness


(Hembree,


1988).


Women


generally


report


higher


level


test


anxiety


than


men


(Everson,


Mill


sap,


Rodrique


, 1991;


Hembree,


1988;


Wigfield


Eccl


, 1989


Thus


, for


high


-stakes


tests


mathematical


aptitude,


mathematics


background


and


test


anxiety


could


influence


item


responses


differentially


each


gender.


The


earch


Problem


thi


study,


explored


the


feasibility


conceptual


zing


psychological


subpopulations


variable


relevant


contrast


the


educational


use


traditional


demographic


variables.


The


vehicle


to achieve


thi


purpose


was


a released


form


the


GRE-Q.


this


- S


I


L


1 L












examinees


with


substantial


and


little


mathematics


background,


anxiety.


DIF


examines


was


assessed


high


using


and


five


low in t

different


est


measures.


The

IRT


DIF

-UA,


measures


and


were


SIBTEST.


MH,

The


logis

DIF m


regression,


ethods


were


class


IRT


-SA,


ified


into


two


groups--methods


measuring


uniform


DIF


and


alternate


methods.


The


uniform


method


were


IRT


-SA,


and


SIBTEST.


Alternate


methods


included


logis


regre


ssion


and


were


-UA,


along


designed


with


measure


Logis


both


regression


uniform


and


and


nonuniform


-UA


DIF.


Mantel-Haen


was


placed


into


both


analy


group


because


wides


pread


use


test


practitioners.


Regarding


the


study


s methodological


issues,


the


results


five


methods


of estimating


will


contrasted


within


each


three


modes


of defining


subpopulation


groups.


The


observation


interest


was


the


DIF


indices


estimated


each


item


under


a particular


combination


subpopulation


definition


DIF


method.


Replications


were


items


on a released


form


the


GRE


test.


For


research


questions


that


follow,


trait


effects


refer


the


three


subpopulation


conceptual


zations


and


A I -


~t C1





CC L












The


consis


first


tency


four


of DIF


research

indices


questions


between


address


methods


when


subpopulations


are


conceptualized


using


different


traits


The


uniform


methods


of MH,


IRT-SA,


and


SIBTEST


were


combined


with


traits


gender,


mathematics


background,


test


matrix


anxiety


to yield


of correlation


a multitrait-multimethod


coefficients


. (See


(MTMM)


Table


illustration


a MTMM


matrix


with


uniform


measures.


Similarly,


alternate


estimation


methods


-UA,


and


logistic


regrets


sion


were


combined


with


traits


gender,


mathematics


background,


and


test


anxiety


yield


a second


multitrait-multimethod


(MTMM)


matrix


correlation


coefficients.


(See


Table


an illustration


a MTMM


matrix


with


alternate


measures.)


Each


following


research


questions


was


address

uniform


twice;


methods


each


and


question


the


was


alternate


answered

methods,


for

resp


the

ectively:


. Among


the


three


sets of


convergent


coeffi


clients


often


termed


monotrait


-heteromethod


coefficient


(e.g.,


the


correlation


between


the


indices


obtained


from


-SA


methods


when


subpopulations


are


defined


the


trait


gender),


will


coefficients


base


--_, ---------------------- -_


r .1__ -


- a


_ T


I _


_- _


-1













Table


Proposed
Uniform


Multitrait-Multimethod


Correlation


Matrix:


Indices


MH-D


B


C


A


-SA


B


C


SIBTEST-b


A


B


I.MH-D

A.Gender


B.MathBkd


H-M


C.TA


H-M


H-M


II.IRT-SA


A.Gender


B.MathBkd


H-H


M-H*


H-M


C.TA


H-H


M-H*


H-M


H-M


III.SIBTEST-b


A.Gender


M-H*


M-H*


H-H


H-H


B.MathBkd


H-H


M-H*


H-H


M-H*


H-H


H-M


C.TA


M-H*


H-H


M-H*


H-M


H-M


Note.


= reliability


heteromethod


H-M


or the


coefficients.


convergent


= heterotrait-monomethod


validity


coefficient


M-H*


= monotrait-


coefficients
s. H-H =


heterotrait-heteromethod


coefficient.


A


C


M-H*













Table


Proposed
Alternate


Multitrait-Multimethod


DIF


Correlation


Matrix:


Indices


MH-D


B


C


IRT-UA


A


B


Loq


C


A


Req


B


I.MH-D


A.Gender


B.MathBkd


H-M


C.TA


H-M


H-M


II.IRT-UA


A.Gender


M-H*


B.MathBkd


M-H*


H-M


C.TA


H-H


M-H*


H-M


H-M


III.Log


Reg


A.Gender


M-H*


H-H


M-H*


H-H


B.MathBkd


H-H


M-H*


H-H


M-H*


H-H


H-M


C.TA


H-H


M-H*


H-H


H-H


M-H*


H-M


H-M


Note.


= reliability


heteromethod


H-M


or the


coefficients.


convergent


= heterotrait-monomethod


heterotrait-heteromethod


M-H*


validity


coefficients.


= monotrait-


efficientt.
H-H =


coefficient.


A


C












coefficients


when


subpopulations


are


defined


gender?


Will


monotrait-heteromethod


coefficients


higher


than


coefficients


different


traits


measured


the


same


method


(i.e.,


heterotrait-monomethod


coefficients


Will


convergent


correlation


coefficients


higher


than


discriminant


coefficients


measuring


different


traits


different


method


(i.e.,


heterotrait-


heteromethod


coefficient


Will


pattern


correlations


among


three


traits


be similar


over


the


three


methods


of DIF


estimation?


The


final


res


earch


que


stion


addressed


consi


stency


procedures


identifying


aberrant


items


when


subpopulations


are


conceptualized


in different


ways


The


stion


was


applied


twice;


was


answered


uniform


methods


alternate


methods.


was


follows:


For


standard


each


deci


DIF


detection


rules,


what


method


the


respectively,


percent


using


agreement


about


aberrant


items


when


subgroups


are


based


on gender


and


when


subgroups


are


based


mathematics


background


.1 a A


IlIIII ACS












Following


analysis


uniform


and


alternate


methods


conducted


a structural


analyst


the


30-item


quantitative


test.


Shealy


and


Stout


(1993a,


1993b)


stressed


that


practitioners


must


carefully


identify


valid


subset


items


prior


to conducting


DIF


analyses.


They


argued


that


DIF


occurs


as a consequence


multidimensionality.


potential


DIF


occurs


when


one


or more


nuisance


dimensions


interact


with


the


valid


dimension


of a test


(Ackerman,


1992;


Camilli,


1992).


Messick


(1988)


stress


tructural


component


construct


validation.


The


structural


component


concerned


extent


to which


items


are


combined


into


scores


that


reflect


the


Loevinger


structure


(1957)


termed


the

the


underlying


purity


the


latent


construct.


internal


relationships


structural


fidelity,


and


appraised


analyzing


employed


factor


interite

analytic


structure


procedures


a test.


to define


structurally


valid


subset


unidimens


ional


items


identify


problematic


multidimensional


items.


hoped


to define


items


measuring


both


intended


dimension


and


nuisance.


After


identification


a structurally


valid


* -


& U


- S S


r


1













using


the


five


methods


with


subpopulations


defined


gender,


mathemati


indices


as the


background,


unit


and


analy


test


, two


anxiety.


MTMM


Using


matrices


correlation


coefficients


were


generated-


-one


matrix


uniform


methods


and


one


matrix


alternate


methods


applied


the


five


research


estions


the


MTMM


matrices


inferential


stati


stics


using


structurally


valid


items.


contrasted


findings


the


analyst


s for


entire


test


with


findings


analysis


subset


of test


items.


Theoretical


Rationale


The


process


ensuring


that


high-s


takes


tests


contain


items


that


function


differentially


specific


subpopulations


a fundamental


concern


construct


validation.


Items


that


contain


nuisance


determinants


membership


correlated


threaten


with


an examinee


construct


subpopulation


interpretations


derived


from


test


scores


that


subpopulation.


Psychometric


researchers


continue


examine


merits


numerous


DIF


detection


procedures


and


explore


theoretical


explanations


DIF.


However,


to date,


they


have


failed


to reach


consensus


on methodological


issues


or to develop


*


1 F


1 f


* k


r


*I *












identification


with


actual


test


data


Linn,


1993;


Shepard


concerns


et al.,


were


1984;


Skaggs


investigated


Lissit


from


both


, 1992).


a practical


These


and


theoretical


perspective


that


been


suggested


Linn,


1993;


Schmitt


& Dorans,


1990;


Skagg


& Li


ssitz,


1992;


Tatsuoka


et al.,


1988)


but


rarely


tested.


Two


significant


premises


underlie


the


study.


first


premise


that


there


nothing


inherent


in being


female

that p


examine


redisposes


or a member

an individu


of a specif

al to find


ethnic


group


a particular


item


troublesome.


Educational


and


psychological


phenomena


function

specific


unique


item.


way


s to di


Traditional


advantage


DIF


occurs


an individual


when


on a


phenomena


correlate


with


demographic


group


interest.


Consequently,


gender


or ethnicity


can


interpreted


surrogate


educational


or psychological


variables


that


potentially


explain


DIF


s causes.


Skaggs


ssit


(1992)


posited


that


educational


psychological


variables


that


influence


item


performance


and


correlate


with


ethnic


or gender


groups


would


useful


conceptualizing


subpopulations.


Millsap


and


Everson


(1993


commented


that


modeling


1












educational


psychological


variable


the


study


that


were


hypothesized


as potentially


explaining


gender


DIF


quantitative


test


items


were


mathematics


background


and


test


anxiety.


Mathematics


background


was


selected


because


influences


quantitative


reasoning


and


problem


solving.


Further,


high


school


college


men


tend


to enroll


more


mathematics


courses


study


mathematics


more


abstract


level


Educational


Stati


than


women


stics,


(National


1993).


Center


Researchers


assessing


overall


SAT-M


performance


have


found


that


gender


difference


decrease


subs


tantially


when


differences


high


school


mathematics


background


are


taken


into


account,


although


differences


background


does


(Ethington


entirely


Wolfle,


1984;


explain


Fennema


score


Sherman,


1977;


Pallas


Alexander,


1983).


Quantitative


aptitude


test


scores


can


contaminated


familiarity


with


item


context,


the


application


of novel


solutions,


and


the


use


partial


knowledge


to solve


complex


problems


(Kimball,


1989).


These


type


kills


frequently


are


developed


through


background


experien


ces


Test


anxiety


was


selected


because


well-


-*


m


I r 1


1 CI












Liebert


possessing


& Morris,


high


1967;


level


Tryon,


test


1980)


For


anxiety,


test


individual


scores


frequently


are


depressed


and


construct


interpretations


become


problematic


(Everson


et al.,


1991;


Hembree


, 1988;


Sarason,


1980).


Cons


equently,


test


anxiety


exemplifies


psychological


variabi


that


potentially


contaminates


construct


interpre


stations


scores


For


examinees


with


high


level


test


anxiety,


tests


of mathematical


ability


tend


to induce


Woolfolk,


1980)


extreme

SFemal


eve


students


anxiet

tend


y. (Richard

to report h


son


higher


level


levels


test


including


anxie


coll


than


ege


mal


(Everson


student

et al.


at all


, 1991;


grade

Hembree,


1988;


Wigfield


Eccles


1989)


Over


the


st 2


years


several


self-reported


measures


test


anxiety


have


been


developed


that


demonstrate


high


reliability


and


well


defined th

Schwarzer,


Leoretical


Seipp,


properties


Zahhar,


enson,


1991;


Moulin


Sarason,


-Julian,

1984;


Spielberger,


Gon


, Taylor,


Algaze,


Anton,


1978)


Researchers


have


used


the self-reported


instruments


measure


test


anxiety


and


assess


efficacy


treatment


programs


1980)


(Sa


For


reason, 1

studying


980;


Spielberger


gender


et al


on colle


., 1978;


admi


Wine,


ssions


S" A


. S C C


a m b-


I


* *l 1


* .


* *


LL- l_ _













threat


to valid


score


interpretation,


negative


influence


tests


of mathematics


ability,


and


gender


effects.


second


fundamental


premise


tenet


underlying


educational


the


study


measurement.


Item


responses


examines


are


and


products


a set


of complex


items.


interactions


In part,


between


because


complex


interaction,


examines


approximately


equivalent


abilities


who


belong


to different


subpopulations


occasionally


have


different


likelihood


of answering


que


stion


correctly.


Thi


s fascinating


finding


currently


understood


only


crudely.


Before


can


be better


understood,


different


effect


means


of DIF


detection


of conceptualizing


methods


subpopulations


on item


responses


must


examine


Limitations


the


Study


salient


limitation


study


was


nature


performance


task.


Participants


the


study


were


administered


a sample


GRE


and


were


told


they


would


have


minutes


to complete


test.


They


were


told


perform


best of


their


ability


and


they


would


able


to learn


their


results


following


testing.


Although


every


effort


was


made


to simulate


the


conditions












performance


might


accurately


reflect


their


performance


on a high


-stakes


college


admi


ssions


test.


Further,


the


participants


believed


that


examination


had


low-


stakes,


the


level


answering


test


the


sample


anxiety


GRE-Q


felt


would


examinees


not


while


be equivalent


level


test


anxiety


experienced


examinees


while


answering


a college


admi


ssions


test.


Finally,


examines


study


were


predominantly


undergraduate


students


taking


classes


the


colleges


education


and


business


at a large,


southern


state


univer


sity.


reason,


although


the


sign,


methodology,


analy


sis


were


conceived


and


executed


maximize


the


general


ability


findings


, a degree


caution


recommended


in generalizing


to other


populations


or settings.















CHAPTER


REVIEW


OF LITERATURE


The


four


central


aspects


thi


study


were


Differential


Item


Functioning


(DIF)


methodology,


gender


differences


in mathematical


college-level


aptitude


testing,


gender


differences


in mathematics


background,


and


test


anxiety.


These


four


topics


constitute


the


major


themes


the


organi


zation


literature


review


pre


sented


this


chapter.


DIF


Methodoloqy


A Conceptual


Framework


DIF


Tests


placement


education


selection


employment


require


scores


fair


representative


individuals


Since


the


mid-


1960s,


measurement


special


have


been


concerned


explicitly


with


the


fairness


their


struments


and


the


ssibility


that


some


tests


may


biased


(Cole


& Moss


, 1989)


Bias


studies


initially


were


signed


to investigate


the


assertions


that


sparities


between


various


subpopulations


on cognitive


ability


test


scores


were


product


of cultural


bias


inherent


measure


(Angoff,


1993)


Test


critics


charged


that


bias










subpopulations


had


equivalent


score


stributions


on the


cons


truct


measured


dismissed


the


possibility


that


actual


differences


may


exist.


Measurement


specialist


however,


have


reflect


resolved


bias


that

but i


mean


ndicat


e test


differences


impact


do not


(Dorans


necessarily

Holland,


1993).


Concerns


about


measurement


bias


are


inherent


validity


theory


(Cole


Moss,


1989)


A test


score


inference


considered


efficiently


valid


when


various


types


evidence


justify


usage


and


eliminate


other


counterinterpretations


(Messick,


1989;


Moss,


1992


Bias


been


characterized


as "


a source


invalidity


that


some


examines


with


trait


or knowledge


being


means


ured


from


demon


strating


that


ability"


Shepard,


Camilli,


Williams,


1985,


score


-based


inference


are


equally


valid


relevant


subgroup


decis


ions


derived


from


score


inferences


will


not


fair


individuals


Therefore,


measurement


bias


occurs


when


score


interpretations


are


diff


erentially


valid


subgroup


test


takers


(Cole


& Moss,


1989).


To inves


researchers


tigate


have


the


examined


potential


test


items


measurement


bias


as a source


explanation.


The


suppos


ition


that


biased


item


require










bias r

biased


research


are


(Angoff,


to identify


1993)


and


remove


to provide


test


items


detected


developers


with


guidelines


making


future


cons


truction


of biased


items


less


likely


(Scheuneman,


1987;


Schmitt,


Holland,


& Dorans,


1993).


Measurement


specialists


have


defined


item


bias


occurring


when


individuals


, from


different


subpopulations,


who


are


equally


proficient


on the


construct


measured


have


different


probabilities


of successfully


answering


the


item


(Angoff,


1993;


Linn,


Levine,


Hastings


Wardrop,


1981;


Scheuneman,


1979;


Shepard


et al.,


1985).


Researchers


apply


stati


stical


methods


to equate


individuals


on the


construct,


utili


zing


either


erved


scores


latent


ability


scores,


estimate


examines


each


group


probability


of a correct


response.


These


methods


provide


statistical


evidence


bias.


When


a statistically


biased


item


identified,


might


interpreted


as unfairly


disadvantageous


to a minority


group


cultural


and


social


reasons.


the


other


hand,


the


item


might


interpreted


as unrelated

an important


and


understood


to cultural

educational


and soc

outcome


groups


ial


factors


that


this


but


not


latter


related


equally


case,


known


deleting


item


for


strictly


stati


stical


reasons


may


reduce


validity.


- -- -


1










Researchers


discovered


that


tatis


tical


analyses


item


bias


raised


expectations


and


created


confusion


already


scure


and


volatile


topic.


The


term


differential


item


functioning


(DIF)


gradually


replaced


item


bias


preferred


technical


term


research


connotations


because


(Angoff,


more


1993;


Dorans


neutral


Kulick,


1986).


Holland


and


Wainer


(1993)


distinguished


between


item


bias


and


DIF


stating,


item


bias


refers


"an


informed


judgment


about


an item


that


takes


into


account


purpose


the


test,


the


relevant


experiences


of certain


subgroups


examines


taking


and


statistical


information


about


item"


xiv).


DIF


s a


"relative


term"


xiv)


a stati


stical


indication


of a differential


res


ponse


pattern.


Shealy


Stout


(1993a


proposed


that


difference


between


item


bias


and


DIF


"the


degree


the


user


or researcher


embraced


a construct


validity


argument"


197).


Shealy


and


Stout


(1993a,


1993b)


conceptualized


DIF


violation


unidimensional


nature


test


items.


They


classified


the


intended


dimension


the


target


ability


unintended


dimensions


nuisance


determinants


occurred


because


nuisance


determinants


existing


differing


degrees


among


subgroups.


Crocker


Algina










construct,


the


distributions


irrelevant


sources


variation


are


different


subgroups


Therefore,


can


be conceptualized


as a consequence


of multidimensionality


with


differing


sources


variation


influencing


subgroups'


item


responses


A Formal


Definition


of DIF


All


DIF


detection


methods


rely


on assessment


response

subgroups


patterns


of subgroup


, conceptualize


in most


test


items.


studies


The


the


basi


demographic


characters


tics


, blacks


and


whites


women


and


men),


form


a categorical


variable.


When


two


groups


are


contrasted,


the


group


interest


(e.g.,


blacks


or women)


designated


the


focal


group,


the


group


serving


the


group


reference


comparison


group.


(e.g.,


Examinees


whites


are


or men)


matched


ignated


on a criterion


variable,


assumed


to be a valid


representation


purported


group


construct,


response


patterns


DIF

for


methods


assess


individuals


differential


of equal


ability.


Denote


the


item


score


, frequently


scored


dichotomous


variable


denote


the


conditioning


criterion;


member


and


ship.


denote


Lack


as the


of measurement


categorical


bias


variable


or DIF


group


an item


define










all


values


X for


reference


and


focal


groups.


this


definition,


Pg(Y=l


the


conditional


probability


function


for


at all


levels


(Millsap


Everson,


1993).


Although al

definition, they


DIF

diffe


procedures

r on the b


operate


asis


from


this


statistical


models


and


possess


various


advantages.


DIF


procedures


can


characterized

invariance or


as models


models


using


utilizing


observed co

unobserved


nditional

conditional


invariance

conditional


(Millsap


Everson,


invariance


used,


1993).

the c


When


*riterion


observed

variable


sum


the


total


number


of correct


res


ponses


on the


test


or a subset


the


test.


When


unobserved


conditional


invariance


is used,


a unidimensional


item


res


ponse


theory


(IRT)


model


estimates


a 8 parameter


each


examinee


that


functions


the


criterion


variable.


Other


differences


detection


procedures


are


capacity


to detect


nonuniform


DIF,


test


statistical


significance,


and


to conceptualize


DIF


as a consequence


multidimensionality.


Uniform


DIF


occurs


when


there


interaction


between


group


member


ship


and


the


conditioning


criterion


regarding


probability


answering


an item


correctly.


In other


words,


DIF


functions


a uniform


shion


across


ability


spectrum.


Nonuniform


DIF


refers










ability


spectrum


and


disfavor


the


subgroup


other


end


the


spectrum.


All


DIF


procedures


are


used


estimate


an index


describing


magnitude


the


differential


response


pattern


the


groups


item.


Some


procedures


provide


states


tical


tests


to detect


the


DIF


index


differs


significantly


from


zero.


Finally,


although


DIF


perceived


as a consequence


multidimen


Stout'


sionality,


Simultaneous


every

Bias


procedure


Test


except


(SIBTEST


Shealy


functions


and

within


unidimensional


framework.


Many


DIF


detection


methods


have


been


developed


during


past


three


decades.


thi


review,


they


are


categorized


as based


upon


observed


conditional


invariance


unob


erved


latent


conditional


invariance.


Related


issues,


research

Following

efforts t


problems,


the


and


review


o explain


potential


of DIF


the


usage


detection


underlying


are


evaluated.


methods,


causes


of DIF


research

are


presented.


Methods


Upon


Observed


Scores


Angoff


detection


Ford


method


(1973)

called


offered


the


delta


first


-plot.


widely


The


used


delta-plot


procedure


was


problematic


due


tendency,


under


conditions


of differing


ability


score


stributions,










sample


and


was


not


based


upon


a chi-square


sampling


stributions


' in


ect,


a chi-square


procedure


at all


(Baker,


1981)


The


full


-square


procedure


(Bis


hop,


Fienberg,


& Holland,


1975)


was


a valid


technique


testing


but


required


large


sample


zes


at each


ability


level


sustain


statistical


power


Holland


and


Thayer


(1988)


built


upon


these


chi-s


quare


hniques


when


they


applied


the


Mantel


and


Haensze


(1959)


statistic,


originally


developed


medical


research,


the


detection


DIF.


Mantel


-Haensze


procedure.


The


Mantel


-Haenszel


(MH)


statistic


become


most


widely


used


method


of DIF


detection


(Millsap


Everson,


1993)


The


MH procedure


assesses


the


item


data


a J-by


-by-


contingency


table.


At each


score


level


, individual


item


data


are


presented


two


groups


the


two


levels


item


response,


right


or wrong


see


Table


The


null


hypothesis


the MH procedure


can


expressed


the odds


answering


an item


correct


given


ability


level


are


same


both


groups


across


ability


levels


The


alternative


hypothesis


that


the


two


group


have


equal


probability


answering


item


correctly


some


level










Table


Item


Data


for


Groups


Item


Scores


Ability


Group


Score on Studied Item

1 0 Total


Group


Total


The


MH stati


stic


uses


a cons


tant


odds


ratio


(a )


as an


index


of DIF.


The


estimate


constant


odds


ratio


A. D


C B /T. ]
F; .RJ1


The


constant


odds


ratio


ranges


value


from


zero


infinity.


The


estimated


value


under


null


condition.


interpreted


the


average


factor


which


odds


that


a reference


group


examinee


will


answer


the


item


correctly


exceeds


that


of a focal


group


examine.










estimated


value


am frequently


transformed


more


easily


interpret


ed A metric


via


MH D


-DIF


--2


35 in[taf ]


Positive


values


of MH D-DIF


favor


the


focal


group,


whereas


negative

The


values


chi


favor


-square


the ref


test


erence


group.


significance


-r


E(A i)


.5]2


Var (A )


where


= 7Ra m


IT,
-7


and


var (A .)


[n m,


nrim


[ 2( )]


The


MH chi-square


is di


tribute


d approximately


as a chi-


square


with


one


degree


freedom.


Holland


and


Thayer










The


advantages


are


computational


simplicity


(Holland


Thayer,


1988),


tati


stical


test


significance,


lack


sens


itivity


to subgroup


differences


the


stribution


ability


(Donoghue,


Holland,


& Thayer,


1993;


Shealy


& Stout,


1993a;


Swaminathan


& Rogers


, 1990).


The


most

detec


frequently c

t nonuniform


ited

DIF


disadvantage

(Swaminathan


s lack


& Rogers,


power


1990).


further


limited


unidimensional


conception


and


assumption


that


total


test


score


provides


a meaningful


measure


the


construct


purported


to be estimated.


The


standardization


procedure.


The


standardization


procedure


(Dorans


nonparametric


regr


Kulick,


session


1986)


test


based


scores


upon


on item


the


scores


two


groups.


Let


define


the


expected


item


test


nonparametric


regression


reference


group,


and


E,(Y


define


the


expected


item


test


nonparametric


regrets


focal


group,


where


item


score


X i


test


score.


DIF


analysis


the


individual


score


level


The


statistic,


s the


fundamental


measure


= E


- E .











differences


and


cannot


explained


differences


the


attribute


ted.


The


standard


zation


procedure


derived


name


from


standardization


group


that


functions


to supply


a set


weights,


one


at each


ability


level,


that


will


be used


weight


each


individual


The


standardized


p-difference


(STD


P-DIF)


Wi (EFj


-E.


STD


STD -P


The

The


essence

specific


of standard


weight


zation


implemented


the


weighting


function.


tandardization


depends


upon


nature


study


(Doran


& Kulick,


1986)


Plaus


ible


options


of weighting


include


number


examinees


the


total


group


at each


level


of j


, the


number


---- --m -- -


C,,


4.. a


1 -- aI


A -


n-


I


_ -


.I.E.


U Im L


E










used;


the


focal


thus,


observed


group


The


STD-P


performance


on an item


standard


zation


defined


and


(Dorans


procedure


difference


expected


Kulick,


contains


between


performance


1986)


a significance


test.


The


standard


error


using


focal


group


weighting


SE(STD


-p)=


PF (1-PF)


+VAR (P;),


where


the


proportion


focal


group


members


correctly


answering


the


item,


and


where


thought


as the


performance


reference


the


group


focal


item


group


test


members


regression


predicted


curve


from


and


J


(Ps)


Fj Pj (1


- PA.)


The


tandardization


procedure


a flexible


method


inve


stigating


DIF


(Dorans


Holland,


1993),


and


been


applied


ass


essing


differential


functioning


stractors


(Dorans,


Schmitt,


Blei


stein,


1992)


and


the


differential


effect


speedednes s


(Schmitt


& Dorans,


1990).


DIF


findings


from


the


standard


zation


procedure


will


close


agreement


with


the


MH procedure


(Millsap


& Everson,


1993)


a a


a. a-


a S


- S *


a -


1


* Jl










standardization


method


are


much


the


same


as for


The


most


commonly


cited


deficiency


both


methods


their


inability


to detect


nonuniform


DIF.


Donoghue


et al.


(1993)


determined


that


both


methods


require


approximately


more


items


the


conditioning


score,


the


studied


item


should


extreme


included


ranges


determining


item


the


difficulty


conditioning


can


score,


adversely


influence


DIF


estimation.


Linn


(1993)


observed


that


estimates


using


these


procedures


appear


to be


confounded


with


item


discrimination.


Loqistic


repression


model


Swaminathan


and


Rogers


(1990)


applied


logis


regress


to DIF


analysis.


Logistic


regress


model


unlike


least


squares


regression,


permit


categorical


variables


as dependent


variables.


Thus,


permits


analy


of dichotomou


scored


item


data.


It has


additional


flexibility


including


the


analysis


interaction


between


group


and


ability,


as well


allowing


the


inclusion


of other


categorical


and


continuous


independent


variables


model.


A fundamental


concept


analysis


with


linear


models


assessment


cons


tency


between


a model


and


a set


of data


(Darlington,


1990).


Cons


istency


between


the


model


the


data


means


ured


the


likelihood


* --


I ..


i


I III










examinee


will


have


a probability


between


and


answering


an item


correctly


the


multiplicative


law


independent


probabilities


, an overall


probability


group


examinees


answering


a specific


pattern


can


estimated.


For


example,


probability


four


individual


each


answering


an item


correctly


0.9,


and


three


the


subjects


answer


correctly,


overall


probability


this


pattern


occurring


X 0.9


X 0.9 X


-0.9)


or 0


.0729


Therefore,


item


, the


likelihood


function


a set


examinee


responses


each


with


ability


level


determined


L(Datae 8)


P(ui/1'


-Prui


n=l


where


has


value


a correct


response


and


a value


an incorrect


response.


The


logis


regr


ess


model


predicting


probability


a correct


answer


exp 0+3 ,10)
+ exp q30+p13)]


where


the


response


item


given


ability


level


fois


the


intercept


parameter,


and


slope


--


__














11e)


exp(Po


+ exp po


P3egj


+ 3ie +


Pf.g5


aP3e i


where


estimate


of uniform


difference


between


groups,


and


estimated


interaction


between


group


ability.


only


deviate


from


zero,


the


item


interpreted


as containing


no DIF.


does


not


equal


zero,


equal


zero,


uniform


DIF


indicated.


does


equal


zero,


nonuniform


DIF


inferred.


Estimation


the


parameters


and


carried


out


each


item


using


a maximum


likelihood


procedure


. The


two


null


hypothe


ses


can


tested


jointly


x2=P'c' (CC- C}CB,


where


The


test


a chi-square


stribution


with


degrees


r .A


et-


1L -- --_ -


* A r A* _


A-I- -


S- .-1 -- -


+,1e +


2g +


L1










The


logi


stic


regression


procedure


offers


a powerful


approach


testing


pre


sence


of both


uniform


and


nonuniform


DIF.


In sample


sizes


and


examinees


per


group


and


with


and


test


items


serving


criterion,


Swaminathan


and


Rogers


(1990)


concluded


under


conditions


uniform


DIF


that


the


logistic


regression


procedure


had


power


similar


MH procedure


controlled


Type


errors


almost


as well.


The


logis


regression


procedure


had


effective


power


in identifying


nonuniform


DIF,


whereas


the


MH procedure


was


virtually


powerless


to do


so.


In demonstrating


the


ineffectiveness


MH procedure


to detect


nonuniform


DIF,


Swaminathan


Rogers


(1990)


simulated


data


keeping


item


difficulties


equal


varying


the


discrimination


parameter.


In effect,


they


simulated


nonuniform


symmetrical


DIF.


Their


simulation


created


a set


conditions


where


theoretically


the


procedure


has


no power.


Researchers


must


ask


whether


such


symmetrical


interactions


occur


with


actual


test


data.


Millsap


and


Everson


1993)


commented


that


Swaminathan


Rogers


(1990)


utilized


large


numbers


items,


they


conjectured


that


in cases


with


a small


number


homogeneous


items


forming


the


criterion


variable,


positive


rates


would


increase


unacceptably


above


nominal


levels.


i J


~ L


_


i


_ _


1-










which


ability


level


is observed.


The


logi


stic


procedure,


although


developed


from


a unidimen


ional


pers


pective,


provides


a fl


exible


model


that


can


incorporate


a diversity


independent


categorical


and


continuous


variables.


Millsap


Everson


(1993)


observed


that


the


procedure


"allows


inclusion


of curvilinear


terms


other


factors--such


as examine


character


stics


like


test


anxiety


instructional


opportunity--that


may


relevant


factors


exploring


poss


ible


causes


of DIF"


306)


Methods


Based


Upon


Latent


Ability


Estimation


DIF


are


detection


developed


methods


through


conditioning


various


model


on latent


ability


approaches


describe


the


relationship


between


individual


item


responses


the


construct


measured


test


or subtes


When


applied


to DIF


analy


ses


permits


the


use


estimates


true


more


ability


as the


subjective


criterion


measure


variable


observed


as opposed


scores


pite


theoretical


disadvantages


appeal,


approaches


of requiring


large


possess


sample


the


sizes


inherent


, being


computationally


complex


cos


tly,


and


including


stringent


assumption


unidimensionality


shima,


1989)


The


most


widely-used


model


are


Rasch


model


or one-


parameter


model,


two-parameter


logis


model


(2PL),


.


A-~~~~~~~~ 1I a1 A. 1 a a--n--


,nr


T.l1 A


~,nA


T ft


IIA AI


L. L-


I ~ *I


f


111










ability


score,


except


possibly


the


studied


item,


contain


DIF,


MH provides


a DIF


index


proportional


index


estimated


the


Rasch


model.


Therefore,


methods


based


upon


the


Rasch


model


will


not


reviewed,


the


more


complex


and


model


will


be reviewed


regarding


their


potential.


The


central


components


model


are


unobserved


latent


trait


estimate,


termed


0, and


a trace


line


each


item


res


ponse,


often


termed


the


item


character


stic


curve


(ICC).


The


will


take


a specified


monotonically


increasing


function.


the


model,


the


probability


correct


response


to Item


as a function


exp [Da ( -
- exp [Da1 (0


where


the


item


parameters


i and


are


item


discrimination


and


difficulty,


respectively,


a constant


order


to convert


logis


scale


into


an approximate


probit


scale


(Hambleton


Swaminathan,


1985)


the


model,


the


probability


a correct


response


P(u,


1


exp[Da i(e
- exp[Da,


- ,)]


- bi)]


- b)]










The


model


general


includes


procedure


combining


estimating


both


DIF


groups


using


and


a 3PL


estimating


item


parameters


utili


zing


either


a maximum


likelihood


Bayesian


procedure,


fixing


the


i parameter


items,


after


dividing


examinee s


into


reference


and


focal


group


members,


estimating


the


and


i parameters,


equating


parameters


from


the


focal


group


scale


reference


group


scale


or vice


versa,


calculating


the


index


and


significance


test,


and


utilizing


purification


procedure


(Lord,


1980;


Park


Lautens


chlager,


1990)


to further


examine


and


enhance


analy


Purification


and reestimate

items included,


procedures,


ability


will


which


level


extract


without


be elaborated


potential


the


DIF


potential


upon.


items


DIF


indices


statistical


tests


based


upon


latent


ability


proceed


either


(ai,


analyzing


or analyzing


Lord'


chi-souare


difference


area


and


between


between


-LR.


the


the


Lord's


item


groups'


(1980)


parameters


ICCs.


chi- square


and


-Likelihood


Ratio


(IRT


-LR)


simultaneously


tests


dual


hypothesis


- aFp


= b1i.


Because


pseudo-


chance


parameter


standard


errors


are


not


accurately


estimated


separate


groups


(Kim,


Cohen,


& Kim,


1994),


usually


tested


with


either


procedure.


fl ~ ~ ~ ~ ~ ~ ~ a -a-- -a-a SE U-~ .naa -e e -e


S:,


m m


rl


~CI~ C Y I.rC ~~


~C~I~UA


nru


T A










large


to effectively


assume


an infinite


number


of degrees


freedom,


test


becomes


-4 )


var(b, )+ var(br)


Alternately,


z2 will


tribute


as a chi-square


stati


stic


with


one


degree


freedom


(Thissen,


Steinberg,


Wainer,


1988).


simultaneous


test


the


discrimination


difficulty


parameters


based


upon


Mahalanobi


stance


between


parameter


vectors


the


groups


The


test


states


becomes


= v'z


which


V i


the


vector


of differences


between


the


parameter


estimations


- b,)


and


the


estimated


covariance


matrix.


The


test


distributed


chi-square


with


degrees


freedom.


The


same


hypothesis


tested


Lord


s chi-square


can


with


IRT-LR


(Thissen,


Steinberg,


Wainer


1993)


null


hypothes


with


-LR


tested


through


three


steps


model


fitte


simultaneou


both


groups


data.


A set


of valid


"anchor"


items,


containing


no DIF,


- a










equality


s or a's


The


model


assessed


maximum


likelihood


statistics


and


-2(0loglikelihood).


The


model


refitted


under


the


constraint


that


and


a parameter


are


equal


both


groups


-2(loglikelihood).


The


likelihood


ratio


test


significance


s the


difference


between


the


two


models


and


likelihood


ratio


test


assesses


significant


improvement


model


fit


as a consequence


allowing


two


parameters


to fluctuate.


likelihood


ratio


significant,


either


two


the


b parameter


groups,


or the


DIF


a parameter


detected.


is different


this


example,


simultaneously


testing


differences


both


parameters,


test


statistic


stributed


as a chi


-square


with


two


degree


freedom.


situation


ting


significance


only


item


difficulty,


stati


stic


would


tribute


as a chi


-square


with


one


degree


freedom.










second-derivative


approximations


the


standard


errors


estimated


likelihood


item


parameters


estimation


as a part


The


IRT-LR


the


procedure


maximum


does


require


estimated


error


variances


and


covariances.


results


from


computing


likelihood


the


overall


mode


under


the


equality


constraints


placed


upon


the


data


then


estimating


the


probability


under


the


null


hypothesis


(Thissen


et al., 1988)


Lord'


chi-square


IRT-LR


are


capable


of detecting


nonuniform


DIF


sess


good


stati


stical


power


(Cohen


Kim,


they


1993)


tend


Because


to be expensive


requis


and


large


yield


sample


positive


zes


rates


above


the


nominal


eve


s (Kim


et al., 1994).


. Linn


(1981),


with


simulated


data,


and


epard,


Camilli,


Williams


(1984),


with


actual


data,


demon


strated


that


significant


differences


detected


Lord


s chi


-square


occurred


even


when


plotted


ICCs


were


nearly


identical.


additional


problem


when


employing


IRT-LR


need


set of


truly


unbiased


anchor


items


(Millsap


& Everson,


1993).


Procedures


estimatinsc


area


between


ICCs


Eight


different


DIF


procedures


have


been


developed


to estimate


area


between


the


reference


group'


and


focal


group


1- -


*


I I-


II


__










interval


, (c)


continuous


integration


or discrete


approximation,


weighting


(Mills


& Everson,


1993)


The


first


area


proc


edures


utilized


bounded


interval


with


discrete


approximations


Rudner


(1977)


suggested


unsigned


index


-z


PF(e)


with


discrete


interval


from


j -3


to 8


j = 3


Rudner


(1977)


used


small


interval


stances


(e.g.;


.005)


summed


across


interval


The estimated


is converted


to a signed


index


removing


the absolute


value


operator


Shepard


area


et al.


procedures


(1984)


extended


introducing


four


signed


hniques


and

that


unsigned

included


sums


of squared


values,


weights


based


upon


number


examinees


in each


interval


along


0 scal


and


weighting


initial


differences


inverse


the


estimated


standard


error


the


different


ce.


They


determined


that


distinctively


different


interpretations


occurred


when


signed


area


indices


were


estimate


as compared


to unsigned


indices


They


further


found


that


various


weighting


procedures


influence


interpretations


only


slightly,


they


concluded


that


item


PR (e)


A],










All


the


area


indices


proposed


Shepard


et al.


(1984) u

standard


tilized

errors


discrete

to permit


approximations

significant t


and


ests


lacked

Raju


sample

(1988,


1990)


augmented


these


procedures


devi


sing


an index


measure


continuous


inte


gration


over


unbounded


interval


and


derived


(1988)


standard


proposed


errors


setting


permitting


the


significant


c parameter


tests


equal


Raju


both


groups


estimating


the signed


area


SA = (,R


-b,)


unsigned


area


estimated


-a,,)


Da.a


In( 1


+ exgF


Daa,(b,-b,)
F(BRV)))
a Sa


- b,)


Raju


(1990)


derived


asymptotic


standard


error


formulas


signed


and


unsigned


area


measures


that


can


use


generate


tests


to determine


significance


level


DIF


under


conditions


of normality


Theoretically,


Raju'


procedure


measuring


and


testing


the


significance


area


between


ICCs


two


utili


group


zing


a s


score


significant


erval


advancement


Raju


over


(1990)


procedure


interpreted


2(a,










(1993),


analyzing


data


from


a 45-


item


vocabulary


trial


test


contras


ting


girl


boys


black


and


white


students,


found


that


significance


tests


the


area


measures


identified


identical


aberrant


items


as Lord


chi-square.


Raju


et al.


(1993)


the


alpha


rate


at 0.001


to control


Type


errors


Cohen


and


Kim


(1993)


found


that


two

Lord


comparing


procedures


s chi-square


Lord


produced


appeared


-square


similar


to Raju's


results,


slightly


more


SA and


although

powerful


identifying


simulated


DIF.


as a Consequence


Multidimensionalitv


In all


procedures


thus


reviewed,


researchers


have


either


conditioned


an item


response


on an ob


erved


test


score


or a latent


ability


estimate.


Procedure


using


observed


scores


assumed


that


total


score


valid


meaning


terms


purported


construct


measured.


procedures


assumed


response


to a set


items


are


unidimens


ional


even


though


examinees'


scores


may


reflect


composite


abilities.


potential


DIF


can


conceptualized


as occurring


when


test


consists


targeted


ability,


item


respon


ses


are


influenced


one


or more


nuisance


determinants


Shealy


Stout,


1993a,


1993b).


Under


thi


circumstance,


an item


may


misinterpreted










means


are


not


equal,


means


are


not


equal,


the


ratio


o,/O


are


equal,


correlations


between


the


valid


and


nuisance


dimensions


are


not


equal


(Ackerman,


1992).


The


presence


multidimens ionality


a set


items


does


not


necessarily


lead


to DIF.


For


example,


quantitative

achievement


ability


may


test


contain


used


to predict


mathematical


word


future college

problems requiring


proficiency


reading


kills


The


test


contains


one


primary


dimension--quantitative


ability;


however,


a second


requisite


measured


skill--reading


ability--is


valid


specific


usage.


A unidimensional


analysis


applied


to such


multidimensional


data


would


weight


relative


discrimination


the


multiple


traits


to form


a reference


composite


(Ackerman,


1992;


Camilli,


1992).


the


focal


and


reference


groups


share


a common


reference


composite,


not


possible.


Since


any


test


containing


two


or more


items


will


degree


multidimens


ional,


practitioners


should


define


validity


sector


approximately


to identify


the


same


test


composite


items

of ab


measuring


,ilitie


(Ackerman,


1992).


In DIF


studies,


conditioning


variable


should


consist


only


items


means


urging


the


same


compo


site


- a a


a1 1


1


-










composites


problem


of ability.


trying


This


compare


creates,


apple


in essence,


oranges.


the

The


potential


effect


this


to confound


DIF


with


impact


resulting


in spurious


interpretations


(Camilli,


1992).


The


effect


multidimensionality


analy


ses


resulted


limited


consistency


across


method


(Skaggs


ssitz,


1992)


across


differing


definitions


conditioning


Further,


variable


Linn


(Clauser,


(1993)


Mazor,


observed


& Hambleton,


that


1991).


rigorous


implementation


to identify


a proper


set


test


items


may


restrict


validity.


For


example,


the


SAT


-Verbal


(SAT


-V),


items


with


large


erial


correlations


total


score


were


more


likely


to be


flagged


than


items


with


average


or below


average

suggest


biserial

d that t


correlations


traditional


using


unidimens


ional


Thi


DIF


finding

analyses,


part,


might


be stati


stical


artifacts


confounding


group


ability


differences


item


discrimination.


Differential


item


functioning


procedures


based


upon


multidimens


ional


perspective


conditioning


on items


clearly


defined


from


a validity


sector


have


the


potential


reduce


these


problems


(Ackerman,


1992).


Further,


multidimensional


explanation


approach


(Camilli,


1992


should


also


Careful


facilitate


evaluation


DIF


and










SIBTEST.


Shealy


and


Stout


(1993a,


1993b)


have


formulated


a DIF


detection


procedure


within


multidimensional


conceptualiz


ation.


They


conceptualize


test


as measuring


composite


the


a unidimensional


target


ability--that


trait


or reference


influenced


periodically


nuisance


determinant


DIF


interpreted


as the


consequence


the


differential


effect


nui


sance


determinants


functioning


on an item


or set


items.


The


SIBTEST


procedure


employs


factor


analy


Si1


identify


sector.


a set


These


items


items


that


adheres


constitute


to a defined


valid


subtes


validity


, and


remaining


items


become


the


tudied


items.


Examinees


are


divided


into


strata


based


upon


the


valid


subtest


score,


and


the


DIF


index


estimated


-zP


where


the


pooled


weighting


focal


and


reference


group


examinees


who


achieve


The


value


identical


the


value


P-DIF


when


total


number


examines


are


weighting


group


Shealy


and


Stout


(1993a)


have


referred


standard


zation


procedure


"progenitor"


161)


SIBTEST.


They


present


* n rn a


h e4-nArn rrrr


0c+0 mAF-tI


,.r: 4-1


4-V*


CArrC


LSrrrnr













SE(8)=


.27


- PJ)


-p.)


With


SIBTEST


total


score


valid


subset


serves


conditioning


criterion.


The


SIBTEST


procedure


resembles


methods


on which


an observed


test


score


is the


criterion;


although,


incorporates


an adjustment


item


mean


prior


to comparing


groups


these


means.


Thi


adjustment


an attempt


remove


that


portion


group


mean


difference


attributable


group


mean


differences


the


valid


targeted


ability.


When


the


matching


criterion


an observed


score


the


studied


item


included


the


criterion


score,


group


differences


statistically


in target


inflate


Cons


ability


will


equently,


tend


SIBTEST


employs


correctional


procedure


based


upon


regression


and


theory.


In effect,


the


purpose


is to


transform


each


observed


mean


group


ability


level


score,


into


transformed


mean


so that

ability

remove


the

leve

that


trans


formed


I score

portion


mean.


score,


This


group


mea


a valid


adjustment at

n differences


estimate


tempts

that


attributable


group


differences


underlying


targeted


Pj(1


Pg (1










an estimate


difference


in subtest


true


scores


referenced


focal


groups


with


examinees


matched


ability


levels.


this


trans


formation


to yield


unbiased


estimate,


valid


subtest


must


contain


a minimum


20 items


(Shealy


Stout,


1993a).


SIBTEST


only


procedure


based


on conceptualizing


DIF


as a result


of multidimensionality.


Although


resembles


the


procedures


that


condition


on observed


scores,


offers


conditioning


conditions


a regression


correction


on estimated

demonstrates


true

good


procedure


scores. U

adherence


that


nder


allows


simulated


to nominal


error


rates


even


when


group


target


ability


distribution


differences

powerful as


are


MH i


extreme, and it

n the detection


has been


unifor


shown

m DIF


to be a

(Shealy


Stout,


1993a).


multidimensional


conceptualization


potentially


nuisance


can


lead


determinants


identification


greater


of different


understanding


of DIF


causes


(Camilli,


1992).


The


major


weakness


SIBTEST


are


inability


assist


the


user


in detecting


nonuniform


DIF


and


the


need


or more


items


to fit


a unidimensional


validity


sector.


With


a relatively


short


test


or subtest,


this


latter


weakness


would


problematic


under


some


practical


testing










Methods


Summary


After


years


development,


a plethora


sophisticated


DIF


procedures


have


been


devised.


Each


method


approaches


DIF


identification


from


a fundamentally


different


per


spective,


each


method


contains


advantages


and


limitations


Currently,


no consensus


among


DIF


researchers


exits


regarding


a single


theoretical


or practical


best


method.


The


design


thi


study


reflected


lack


consensus.


possessing


selected


theoretical


five


different


or pra


procedures,


appeal,


each


assess


item


responses


examinees


The


design


the


study


was


compare


the


reliability


and


validity


the


methods


themselves,


but


assess


the


similarity


results


obtained


from


methods


when


subpopulations


were


define


conceptually


different


ways.


Uncovering


the


Underlyingq


Causes


DIF


The


overwhelming


majority


of DIF


researchers


have


focused


on designing


stati


stical


proc


edures


evaluating


their


efficacy


detecting


aberrant


items


Few


researchers


have


attempted


move


beyond


methodological


issues


examine


DIF'


causes.


The


researchers


broaching


thi


topic


have


experience


ed few


successes


many


frustrations


Schmitt


et al


. (1993)


propo


that


explanatory










be classified


post


speculation


, (b)


hypothesis


testing


item


item


manipulations,


DIF


can


categories


and


attributed


hypothesis


manipulation


to a complex


testing


of other


interaction


using

variables


between


the


item


and


examinee


Scheuneman


Gerrit


, 1990)


Researchers


are


unlikely


to find


a single


identifiable


cause


of DIF


since


it stems


from


both


differences


within


examinees


and


item


characters


(Scheuneman,


1987).


earchers


examining


DIF


from


perspective


examinee


differences


may


uncover


takers,

Gerritz


significant


educators,


(1990)


and


suggest


finding


policy

d that


with


makers

"prior


implications


Scheunem

learning,


test-


an and

experience,


and


eres


t pattern


between


mal


females


and


between


Black


and


White


examinees


may


linked


with


DIF"


. 129)


Researchers


examining


from


the


perspective


item


character


sti


may


discover


findings


with


strong


implications


developers


test


may


need


developers


to balance


and


conten


item

t and


writers


item


Test


format


ensure


fairness.


Post


hoc


evaluations,


despite


their


limitations,


dominate


the


literature


(Freedle


tin,


1990;


. Linn


& Harn

1984;


isch,

Skaggs


1981;

& Lis


O'Neill

sitz, 1


McPee


992)


, 1993;


Speculation


Shepard

ns for


et al

causes










(O'Neill


& McPeek,


1993;


Shepard


et al., 1984;


Skaggs


Liss


itz,


1992


Hypothes


s testing


item


categories


a second,


more


sophisticated,


means


uncovering


explanations


DIF.


Doolittle


and


Cleary


(1987)


Harri


and


Carlton


(1993)


evaluated


several


DIF


hypotheses


on math


test


items.


Doolittle


and


Mathematics


Cleary


Usage


Test


(1987)


(ACT


employed


items


ACT


and


Assessment


a pseudo-


detection


procedure


to analyze


differences


across


item


categories


test


forms


Mal


examinees


performed


better


on geometr

examinees


mathematical


performed


better


reasoning


items,


on computation


whereas


items


femal


Harri


and


Carlton


(1993),


using


SAT-Mathematics


(SAT


items


MH procedure,


concluded


that


mal


examinees


better


application


problems


femal


examinees


did


better


on more


textbook


-type


problems


Scheuneman


concerning


(1987)


potential


analyzed


separate


causes


black


hypotheses


white


examinees


manipulating


test


items


on the


experimental


portion


the


general


test


The


hypoth


eses


, analyzed


through


linear


models


included


examinee


character


, such


test


seness,


and


item


character


such


format.


Complex


interactions










earlier


post


review.


employed


the


STDP


-DIF


index


with


ANOVA


and


found


that


panic


examinees


were


favored


antonym

common


items


root


that


included


Englis


h and


a true


Spanis


cognate


and


a word


on reading


with


passages


containing


material


inter


to Hispanics


False


cognates,

containing


words


spelled


different


similarly


meanings


both


language


homograph


words


but

spelled


alike


Engli


to be more

greater for


h but


containing


difficult for

Puerto Rican


Hispanics

examinees


different


The


, a gro


meanings,


differences

up generally


tended


were

more


dependent


on Spanis


as compared


to Mexican


-American


examines.


Yamamoto


. Tats

(1988)


;uoka,

studi


. L.


ed DIF


Linn,


on a 40


. Tatsuoka,


-item


fractions


and


test.


They


initially


analyzed


examines


dividing


them


into


groups


based


upon


instructional


methods.


Thi


procedure


failed


to provide


an effective


means


detecting


DIF


However,


upon


subsequ


review


and


analyst


they


divided


examinee

solving


into


groups


problems


Wit


based

h thi


upon


solution


grouping


strategies


variable,


they


used

found


DIF


indices


consi


stent


with


their


a priori


hypoth


eses


They


concluded


that


the


use


of cognitive


and


instructional


subgroup


categories


, although


counter


traditional


DIF










Miller


and


Linn


(1988)


considered


the


invariance


item


parameters


Second


International


Mathematics


Study


(SIMS)


examination


across


different


levels


mathematical


instructional


coverage.


Although


their


principal


concern


was


the


multidimensionality


achievement


test


data


as related


to instructional


differences


and


model


usefulness,


they


found


that


instructional


differences


could


explain


a significant


portion


observed


DIF.


Using


cluster


analysis,


they


divided


students


into


three


instructional


groups


based


upon


teacher


res


ponses


opportunity-


to-learn


que


stionnaire.


The


size


differences


in the


ICCs


groups


based


upon


instructional


groups


was


much


greater


than


differences


observed


previously


reported


compare


sons


black


and


white


examinees


They


interpreted


these


findings


as supportive


Linn


and


Harnisch


s (1981)


stulation


that


what


appears


item


bias


may


reality


"'instructional


bias'"


216).


Despite


Miller


and


Linn


s (1988)


straightforward


interpretation


instructional


experiences


, Doolittle


(1984,


1985)


found


that


instructional


differences


did


not


account


for


or parallel


gender


DIF


on ACT


-M items


dichotomized


high


school


math


background


into


strong


and










tended


to favor


female


examinees


did


not


favor


low


background


examinees


vice


versa.


Correlations


of DIF


indices


were


negative,


sugge


ting


that


gender


DIF


was


unrelated


to math


background


DIF.


Muthen,


Kao,


Burstein


(1991),


analyzing


core


items


the


SIMS


test,


found


several


items


to be


sensitive


to instructional


effects.


In approaching


DIF


from


alternative


methodological


perspective,


they


employed


linear


structural


modeling


assess


the


effects


instruction


latent


mathematic


ability


and


item


performance.


They


found


that


instructional


effects


had


negligible


effects


on math


ability,


but


had


significant


influence


on specific


test


items.


Several


items


appeared


particularly


sensitive


instructional


influences


They


interpreted


the


identified


items


less


an indicator


general


mathematics


ability


more


an indicator


exposure


to a specified


math


content


area.


In using


linear


tructural


modeling,


Muthen


et al.


(1991)


avoided


arbitrariness


defining


group


categories


a situation


where


group


membership


varied


across


items.


The


SIMS


data


permitted


estimation


instructional


background


each


core


items.


Under


most


testing


conditions,


estimating


examinee


_













nuisance


dimen


sons


estimated.


Analyzing


the


relationship


theoretical


causes


nuisance


dimensions


combines


the


approach


Muthen


et al.


(1991)


with


Shealy


and


Stout


(1993a,


1993b).


Summary


Researchers


investigating


underlying


causes


have


produce


ed few


significant


result


After


more


than


years


DIF


studi


, conclu


sions


test


wiseness


(Scheuneman,


cognates


1987)


Schmitt,


or Hi


1988)


spanic


must


tendencies


on true


be interpreted


as meager


guidance


test


developers


and


educators.


These


limited


results


can


explained


problems


inherent


traditional


Tatsuoka


DIF


et al


procedures


., 1988)


Skaggs


Indices


& Li


derived


ssitz


, 1992;


using


served


total


scores


as the


conditioning


variable


have


been


observed


to be confounded


with


item


difficulty


(Freedle


Kostin,


1990)


and


item


disc


rimination


Linn,


1993;


Masters


1988).


Indi


ces


derived


from


model


s are


conceptualized


from


an unidimensional


perspective,


DIF


a product


multidimensionality


(Ackerman,


Camilli,


1992)


Consequently,


DIF


detection


procedures


have


been


criticized


a lack


of reliability


between


methods


and


across


samples


(Hoover


Kolen,


1984;


Skaggs


ssitz


, 199










1988).


The


uninterpretability


findings


may


because


group


membership


only


a weak


surrogate


variable


greater


psychological


or educational


significance.


For


example,


demographic


categories


women


or blacks)


lack


any


psychological


or educational


explanatory


meaning.


Moving


beyond


demographic


subgroups


more


meaningful


categories


would


expedite


understanding


causes


Linn,


1993;


Schmitt


& Dorans,


1990;


Skaggs


Lissitz,


Tatsuoka


conceptualization


et al


been


., 1988).


Although


advocated,


this


been


used


paringly


Doolittle


(1984,


1985),


Miller


and


R.L.


Linn


(1988),


Muthen


et al.


(1991)


. K.


Tats


uoka


et al.


(1988)


used


this


conception


and


appeared


to have


reached


promise


ing,


incompatible,


interpretations


. Future


researcher


analyse


need


to achieve


to apply


alternative


explanatory


power.


approaches

Approache


to DIF

s advocated


Muthen


et al


. (1991)


and


Shealy


Stout


(1993a,


1993b)


provide


sound


methods


that


potentially


permit


modeling


of differing


influences


on item


respon


ses.


Gender


and


Quantitative


Aptitude


Educational


psychological


researchers


have


been


concerned


with


gender


differences


in scores


on quantitative


aptitude


tests


(Benbow,


1988;


Benbow


& Stanley,


1980;










filter"


that


prohibits


many


women


from


having


access


high


-paying


and


pre


tigiou


occupations


Sell


, 1978).


Although


gender


differences


in quantitative


ability


interact


with


development,


with


elementary


children


demonstrating


difference


or difference


slightly


favoring


girls


, by


late


adolescence


early


adulthood,


when


college


entrance


examinations


are


taken


critical


career


deci


sions


are


made,


slight


Sherman,


1977;


differences


Hyde


appear


, Fennema,


favoring


& Lamon,


boy


1990).


(Fennema


In studies


linking


gender


difference


quantitative


test


scores


with


underrepresentation


women


prestigious


technical


careers,


analyses


should


limited


taken


late


adol


escence


or early


adulthood


that


significantly


influence


career


deci


sons


opportunity


es.


Significant


Important


Test


Score


Differences


Standardize


d achievement


tests


utili


zing


representative


samples


(e.g.,


National


Assessment


Educational


Progress,


High


School


Beyond)


college


admi


ssions


tests


utilize


self-selected


samples


S, SAT,


ACT,


GRE)


have


been


analyzed


to ascertain


gender


differences


Gender


differences


found


in representative


samples


are


systematically


diff


erent


from


those


found


self-selected


samples


(Feingold,


1992)


Women


appear


ess


proficient,










successfully


matriculate


through


a process


that


relies


heavily


upon


admi


sslons


test


scores


There


fore,


in studying


quantitative


differences


with


primary


concern


related


career


decis


ions


opportunities


self-selected


admiss


test


scores


are


most


germane


measures


analysis


. C.


studied


Linn


(Friedman,


Hyde


1989;


(1989)


Hyde


concluded


et al., 1990)


from


that


meta-analytic


"average


quantitative


gender


differences


have


declined


to essentially


zero"


19),


and


differences


in quantitative


aptitude


can


no longer


used


to ju


tify


underrepresentation


women


in technical


profess


ions.


Feingold


(1988)


assess


gender


differences


several


cognitive


measures


on the


Differential


Aptitude


Test


(DAT)


and


the SAT


concluded


that


gender


different


ces


are


rapidly


diminishing


areas


one


exception


this


finding


was


the SAT


(Feingold,


1988).


Although


mean


diff


erences


had


either


substantially


dimini


shed


or vanished


on DAT


measures


of numerical


ability,


stract


reasoning,


space


relations,


and


mechanical


reasoning,


during


past


years,


SAT-M


differences


have


remained


relatively


cons


tant.


Despite


the


finding


that


gender


differences


are


appearing


on many


mathematical


ability


tests


the


major


colle


entrance


examinations


gender


differences










higher


SAT-M than


women


(National


Center


Education


Stati


stics


, 1993).


Thi


difference


can


also


stated


units


an effect


size


0.39


which


represents


difference


between


the


means


divided


pooled


standard


deviation)


The


trends


regarding


gender


differences


on the


ACT-M


are


similar.


The


ACT


scale


range


from


to 39 point


the


mean


difference


favoring


male


examines


from


1978


1987


was


2.33


points


0.33


(National


Center


Education


Statistics,


1993)


Thi


scoreI


differential


been


relatively


consistent


provides


indication


disappearing.


The


greatest


parity


between


men'


and


women'


mean


scores


occurs


on the


-Quantitative


(GRE


. For


1986-


1987


testing


years,


U.S


. mal


examinees


averaged


and


point


higher


than


U.S


. femal


examines


(Educational


ting


Service,


1991)


Transformed


into


effect


zes


, these


differences


are


d and


0.62


res


pectively


. Gender


mean


score


diff


erences


on the


large


part,


reflect


gender


differences


in choice


of major


field.


Particularly


the


case


graduate


admi


sslons


tests


, mean


scores


are


confounded


with


gender


differences


choice


undergraduate


major


. Analyz


ing


GRE-Q


data










differences


favoring


men


were


points


, respectively


and


.19).


examinees


intending


major


the


humanities


and


education


the


same


testing


year,


mean


score


differences


favoring


men


were


and


37 points,


res


pectively


.31)


Averaging


across


identified


intended


study,


mean


score


differences


favoring


men


were


points


.35)


(Educational


Testing


Service,


1986-8

sizes


1991).


testing


appear


Although


years,


to indicate


data


mean

e tha


was


score

t U.S


available


differences


. male


only


and


examinees


the


effect


tend


score


higher


than


. female


examinees


on the


GRE-Q


pattern


consistent


with


SAT-M


ACT-M.


Despite


changes


curriculum


and


text


material


that


depict


both


genders


ess


stereotypic


manners


(Sherman,


1983)


reductions


gender


difference


on many


mathematics


tests


(Feingold,


1988),


on coll


ege


admi


ssions


quantitative


tests


ender


differences


are


significant


and


appear


not


to be diminishing


Due


the


importance


these


tests


regarding


colle


admission


deci


sions


and


the


awarding


finan


cial


aid,


parity


in scores


tends


reduce


opportunities


women


sser,


1989)


Predictive


Validity


Evidence


Although


mean


scores


on quantitative


admiss


scores


i










evidence


that


admi


sslon


tests


are


biased


against


women


(Rosser,


1989)


Defenders


the


use


college


admission


tests


argued


that


other


relevant


factor


explain


phenomenon


(McCornack


McLeod,


1988;


Pallas


& Alexander,


1983).


They


postulated


that


women


tend


to enroll


major


fields


where


faculty


tend


to grade


ess


rigorously


women


are


more


likely


to major


the humanities


whereas


men


are


more


major


sciences


Investigators


analyze


differential


predictive


validity


of college


admi


ssions


exams,


therefore,


must


consider


gender


differences


in course


enrollment


patterns.


McCornack


(1988) g

patterns


and


generally

were co


McLeod


found


(1988)


that,


nsidered,


SAT


and


when

-V an


Elliot


differential

d -M coupled


and


Strenta


course

with h


taking


igh


school


grades


were


not


biased


in predicting


achievement


men


and


women.


McCornack


and


McLeod


(1988)


considered


performance


in introductory


level


college


courses


at a state


university


and


used


SAT


composites


with


high


school


grade


point


average


. They


found


no pr


edictive


bias


when


analy


zing


data


the


course


level


Elliot


and


Strenta


(1988)


considered


performance


in various


college-level


courses


private


university


utili


SAT


composites


with


scores


from


a college


placement


examination


and


high


school


rank.


.g.,










were


found


flawed


that


no bias


Had


they


they


combined


separately


various


studied


predictors


SAT-M and


and


high


school


grades


, they


might


have


arrived


at a different


interpretation.


Bridgeman


Wendler


(1991)


and


Wainer


and


Steinberg


(1992)


conducted


more


extens


studies


and


concluded


that,


equivalent


mathematics


courses


, the


SAT-M


tends


underpredict


college


performance


women.


Bridgeman


and


Wendler


(1991)


studied


SAT-M


as a predictor


college


mathematic


course


performance


at nine


colleges


universe


ities.


They


divided


mathematics


courses


into


three


categories


found


that,


algebra


and


e-c


alculus


courses


women


s achievement


was


underpredicted


and,


calculus


courses,


no underprediction


occurred.


The


most


extensive


study


to date


concerning


predictive


validity


the


SAT


was


conducted


Wainer


Steinberg


colleges


(1992).


and


Analyzing


universities,


nearly


they


47,000


concluded


student


that,


at 51


students


same


relative


course


rece


giving


the


same


letter


grade,


SAT-M


underpredicted


women'


achievement.


Using


backward


regression


model,


they


estimated


that


women,


earning


the


same


grades


similar


courses


, tended


score


roughly


25-30 points


ess


on the


SAT-M.










quantitative


admission


exams,


although


women


generally


outperform


men


in high


school


and


college


courses


The


principal


explanation


offered


this


paradox


gender


differences


course


taking.


Researchers


investigating


relationship


achievement,


of quantitative


controlling


admission


course


tests


taking


and


patterns


subsequent


and


course


performance,


mathematics


courses,


have

the


concluded


tests


that,


underpredict


equivalent


women


achievement.


Although


underprediction


is not


large


as mean

appear


score

to be


differences,


biased


quantitative


underpredicting


admission


women's


tests


college


achievement.


recognized


that


predictive


bias


and


DIF


are


fundamentally


distinct;


however,


the


determination


predictive


bias


quantitative


admi


ssion


tests


makes


them


an evocative


instrument


analy


Potential


Explanations


DIF


This


study


will


approach


from


the


pers


pective


examinee


characteristics


When


analy


zing


DIF


explanations


from


this


perspective,


theoretical


explanations


predictive


Kimball


bias


(1989)


offer


a reasonable


sented


three


point


theorectical


departure.


explanations


paradoxical


relationship


gender


differences


admissions


test


scores


and


college


grades:


men


have










learning


styles


, and


men


tend


to prefer


novel


tasks


whereas


women


tend


to prefer


familiar


tasks.


these


three


theorectical


explanations,


would


submit


a fourth


explanation


related


test-


taking


behavior--differences


between


men


in women


test


anxiety.


Differences


Mathematics


Background


It i


well


document


d that


students


enter


high


school


and


proceed


toward


graduation


boys


tend


to take


more


mathematics


courses


than


girls


(Fennema


Sherman,


1977;


Pallas


& Alexander,


1983)


During


the


1980s


, high


school


boys


averaged


2.92


Carnegie


units


of mathematics


whereas


high s

Center

school


school

for

girl


girls av

Education

s entered


eraged

Statis


the


tics


Carnegie

, 1993).


upper-track


nint


units

Althoui

h grad'


(National

gh high

e mathematics


curriculum


slightly


greater


numbers


than


boys,


graduation,


boys


outnumbered


girls


advanced


courses


such


as calculus


and


trigonometry.


High


school


boys


were


more


likely


(National

trends co


to study


Center


ntinue


computer


for

as s


science


Education


students


and


Stati


enter


phys


stics


college.


than


1993).

During


girls


These

the


1980s,


men


slightly


outnumbered


women


in achieving


undergraduate

outnumbered w


mathematics


omen


degrees,


in attaining


and


overwhelmingly


undergraduate


degree










science,


and


physics


(National


Center


Education


Stati


stics,


1993).


Researchers


investigating


relationship


between


mathemati


background


test


scores


have


found


that,


when


enrollment


differences


are


controlled,


gender


differences


mathematical


reasoning


tests


are


reduced


(Fennema


& Sherman,


1977;


Pallas


Alexander,


1983;


Ethington


Wolfle,


1984)


Gender


score


diff


erences


on the


SAT-M,


when


high


school


course


taking


was


controlled


, were


reduced


approximately


two-thirds


(Palla


Alexander,


1983)


and


one-third


(Ethington


& Wolfle,


1984).


These


studies


analyze


total


score


differences


controlling


course


background.


Miller


and


Linn


(1988)


and


Doolittl


(1984,


1985)


analyzed


item


differences


controlling


instructional


diff


erences


, but


their


results


were


contradictory


Background


differences


offer


plausible


explanation


that


implores


additional


investigation.


Rote


Versus


Autonomous


Learnincr


Styles


Boys


tend


to develop


a more


autonomous


learning


style


which


facilitates


performance


on mathematics


reasoning


problems


and


girls


tend


to develop


a rote


learning


style


which


facilitates


classroom


performance


(Fennema


& Petersen,










better,


are


more


motivated,


and


are


more


likely


persevere o

independent


n difficult

format. S


tasks


students


presented


splaying


in a novel


rote


learning


behavior


tend


to do


well


applying


memorized


algorithms


learned


direction.

challenging


class


Often,

tasks


and


are


thes

when


heavily


student


given


dependent


tend


an option.


upon


to choos

This


teacher


less


dichotomy


congruent


with


finding


that


girls


tend


to perform


better


on computational


problems


boys


tend


to perform


better


application


and


reasoning


problems


(Doolittle


Cleary,


1988;


Harri


Carlton,


1992).


The


autonomous


versus


rote


learning


style


theory


consistent


with


literature


addre


ssing


gender


socialization


patterns


standardized


test


performances.


Before


can


further


applied,


however,


must


more


completely


operationalized


(Kimball,


1989).


To validate


this


theory,


researcher


must


demonstrate


that


boys


and


girls


approach


study


mathematics


differently,


then


relate


learning


styles


to achievement


on classroom


asses


sments


and


standardized


tests


(Kimball,


1989).


Novelty


Versus


Familiarity


Kimball


(1989)


hypothesized


that


girls


tend


to be


more


motivated


to do well


are


more


confident


when


working










on familiar


demonstrate


classroom


higher


assessments


achievement


and


on novel


boys


tend


standardized


tests.


Thi


theory


is based


on the


work


Dweck


and


her


colleagues


(Dweck,


1986;


. Elliot


Dweck.


1987;


Licht


Dweck,


1983)


who


related


attributions


to learning


and


achievement.


Students


with


a performance


orientation


and


low


confidence


tend


to avoid


difficult


and


threatening


tasks


They


prefer


familiar,


non-threatening


tasks


and


seek


to avoid


failure.


Students


with


a performance


orientation


high


challenging


confidence


tasks


are


Consi


more


stent


likely


to select


findings


moderately


demonstrate


that


girls


tend


to have


less


confidence


their


mathematical


abilities


than


boys


(Eccles


, Adler,


Meece,


1984;


Licht


Dweck,


1983)


Girl


are


also


more


likely


on standard


tests


to leave


items


unanswered


or mark


"I don


know"


when


given


thi


option


Linn,


DeBenedictis,


Delucchi,


Harri


, & Stage,


1987).


Girl


, more


than


boys,


attribute


their


success


mathematics


to effort


rather


than


ability


and


their


failures


to lack


ability


(Fennema,


1985;


Ryckman


their


Peckham,


abilities


1987).

, girls


Therefore,


generally


due


are


to less


ess


confidence


motivated


novel


mathematical


task


, find


them


more


threatening,


and


perform


ess


well.










achievement


tests.


High


test


anxiety


individuals


tend


score


lower


than


low


test


anxiety


individuals


of comparable


ability


(Hembree,


1988;


Sarason,


1980).


Because


aptitude


and


achievement


tests


are


not


intended


to include


test


anxiety


a component


total


score,


because


estimated


million


elementary


and


secondary


pupil


have


substantial


test


anxiety


(Hill


& Wigfield,


1984),


exemplifies


a nuisance


factor


influencing


item


res


ponses.


Test


anxiety


been


theorized


both


cognitive


behavioral

Gonzales,


terms


Taylor,


(Hembree,

Algaze,


1988;

& Anton


Sarason,

, 1978;


1984;

Wine,


Spielberger,

1980).


Liebert


and


Morris


(1967)


proposed


a two


dimens


ional


theory


test


anxiety,


consisting


worry


and


emotionality.


Worry


includes


expression


concern


about


one'


performance


consequences


stemming


from


inadequate


performance.


Emotionality


refers


to the


autonomic


reactions


to test


situations


increased


heartrate,


stomach


pains,


and


per


spiration)


Hembree


(1988)


use


meta-


analysis


test


anxiety


studied


and


found


that,


although


both


dimensions


related


significantly


performance,


worry


was


more


strongly


correlated


to test


scores.


The


mean


correlations


worry


and


emotionality


aptitude


/achievement


tests


were


and


-0.15,


- a -.


* |


1 I


1










Wine


(1980)


proposed


a cognitive-attentional


interpretation


test


anxiety


which


examinee


who


are


high


low


on test anxiety


experience


different


thoughts


when


confronted


test


situations.


The


low


test


anxious


individual


experiences


relevant


thoughts


and


attends


to the


task.


The


high


test


anxious


individual


experiences


self-


preoccupation


and


s absorbed


thoughts


of failure


These


task


irrelevant


cognitions


only


create


unpleasant


experiences,


but


as major


tractions


Sarason


(1984)


proposed


the


Reactions


Test


(RTT)


scal


based


upon


cognitive,


emotional,


and


behavioral


model.


The


40-item


Likert-


scaled


questionnaire


operationalized


a four


dimen


ional


test


anxiety


model


worry,


tension,


bodily


Benson


symptoms,


Bandalos


(199


test-


elevant


a confirmatory


thinking


cross-


validation,


problematic.


large


item


item


number


deletion,


four-factor


found


They


four


-factor


speculated


similarly


they


model.


that


worded


found


structure


misfit


items


substantial


To further


the


resulted


Through


support


validate


the


RTT


from


a process


a 20-


structure


test


anxiety,


Benson,


Moulin


-Julian,


Schwar


zer


, Seipp,


and


El Zahhar


(1991)


combined


and


RTT


formulate


a new


scal


The


Revi


Test


Anxiety


scale


(RTA)










The


cognitive


emotional


structure


of math


anxiety


is closely


related


test


anxiety


Richardson


and


Woolfolk


(1980)


demonstrated


that


math


anxiety


and


test


anxiety


were


highly


related,


and


mathemati


testing


provided


a superb


context


studying


test


anxiety


They


reported


correlations


between


inventories


test


anxiety


and


math


anxiety


ranging


mathematics


test


near

with


0.65.

a time


They commented

limit under i


that


takingig


instructions


to do


as well


as possible


appears


to be


nearly


threatening


as a


real


-life


test


most


mathematics-anxious


individual


s" (p.


271)


Children


first


sec


ond


grade


indicate


inconsequential


anxiety


emerges


test


anxiety


increases


level


, but


in seven


rity


third


until


grade


sixth


test


grade.


Female


student


at all


eve


tend


ssess


higher


test


anxiety


level


than


mal


students


at all


grade


level


(Everson,

behavioral


Millsap,


and


& Rodriguez


cognitive-b


, 1991;


ehavioral


Hembree,


treatments


1988)

have


. Some

been


demonstrated


to eff


ectively


reduce


test


anxiety


and


lead


increases


in performance


(Hembree,


1988).


This


finding


supports


lower


the


causal


performance


direction


and


test


test


anxiety


anxiety


s multidimens


producing


ional


structure.










in cases


unduly


model


influence


misfit,


forces


performance


such


the


test


item


level.


anxiety


High


might


test


anxiety


individual


may


find


some


items


differentially


more


difficult


than


other


test


items.


Summary


have


reviewed


several


different


methods


identifying


DIF.


large


part


because


computational


efficiency,


has emerged


the


most


widely


used


method.


It is


limited


terms


flexibility,


as researchers


continue


to search


underlying


explanations

apparent. L


DIF,


ogistic


limitations


regression


will


models


become


more


(Swaminathan


Rogers


1990)


provide


an efficient


method


that


has


greater


flexibility


than


MH and


potentially


models


theoretical


causes


of DIF.


Raju


s (1988)


signed


and


unsigned


area


measures


supply


a theoretically


sound


method


of contrasting


item


response


patterns.


Shealy


and


Stout


s SIBTEST


(1993a,


1993b)


conceptualizes


a multidimen


sional


phenomenon


and


defines


a validity


sector


as the


conditioning


variable.


sound


theoretical


foundation


coupled


with


computational


efficiency


and


explanatory


potential


makes


perhaps


the


most


comprehens


DIF


procedure.


These


five


approaches


were


employed


the


study.


Linear


structural










findings


the


validation


study.


Thus,


the


significance


validation


consistency


of DIF


estimation


was


considered.


Gender


context


DIF


thi


on quantitative


study.


test


context


items


was


will


taken


serve


because


paradoxical


finding


that


men


tend


score


higher


standardized


tests


math


reasoning,


although


women


tend


achieve

common


equivalent

categorical


supplement


or higher


course


variable


dichotomi


examines


grades.

studies,


into


Gender,


will


substantial


weak


mathematics


background


and


high


and


low


test


anxiety.


Thi


study


based


on the


premise


that


gender


differences


serve


as as


urrogate


differences


background


and


test


anxiety.


The


two


variables


were


selected


an effort


to explain


in terms


consi


stent


with


theoretical


explanations


of gender


differences


mathematics


test


scores


course


achievement.


Mathematics


background


has


been


applied


other


DIF


studies


with


inconsistent


interpretation


Test


anxiety


interest


to both


educators


cognitive


psychology


and


highly


related


to performance.


tudy


an attempt


determine


consis


use


tency


these


indices,


variable


detection


serves


methods,


improve


and

















CHAPTER


METHODOLOGY


The


present


study


was


designed


to investigate


the


inter-method


consistency


five


separate


differential


item


functioning


(DIF)


indices


and


associated


tati


stical


tests


when


defining


subpopulations


educationally


significant


variables


as well


the


commonly


used


demographic


variable


gender.


The


study


was


conducted


the


context


of college


admi


ssion


quantitative


examinations


gender


issues


study


was


designed


evaluate


the


effect


on DIF


indices


of defining


subpopulations


gender,


mathematics


background,


test


anxiety.


Factor


analytic


procedures


were


used


to define


structurally


valid


subt


items.


Following


the


identification


a valid


subtest,


the


DIF


analysis


was


repeated.


The


findings


DIF


analysis


before


validation


were


contrasted


with


the


DIF


analysis


based


valid


subset.


A description


examinees,


ins


truments,


data


analysis


methods


is presented


thi


chapter.











Examinees


The


data


pool


to be analyzed


consisted


test


scores


item


respon


ses


from


1263


undergraduate


college


students


The


sample


consi


sted


women


and


men.


solicited


help


various


instructors


in the


colleges


of education


business


, and


in most


cases


students


participate


d in


study


during


their


class


time.


the


total


sample


examinees,


individual


were


tested


asses


college


of education,


individual


were


tested


asses


the


college


business,


and


individual


were


ted


at other


sites


on campus.

background


Women


were


examines


largest


with


groups


little

the c


mathematics


college


education


asses


and


men


examinees


with


substantial


mathematics

college of


background


business


were

asses


see


largest

Table


group


of Appendix


examine


frequencies


test


setting,


gender,


and


mathematics


background).


majority


student


received


class


credit


participating.


No remuneration


was


provided


any


participant.


All


students


had


previous


taken


a college


admi


ssion


examination,


some


the


students


(approximately


percent)


had


taken


the


Graduate


Record


Examination-Quantitative


Test


(GRE-Q).











Instruments


The


operational


definition


a collegiate-level


quantitative


aptitude


test


was


a released


form


GRE-


Test


anxiety


was


operationally


defined


a widely


used,


tandardi


measure,


Revi


Test


Anxiety


Scal


(RTA)


The


mathematics


background


variable


was


measured


using


the


whether


dichotomous


or not


student


response


had


an item


completed


concerning


a particular


advanced


mathemati


class


the


college


level


(i.e.,


calculus)


. In


following


sections,


a more


detailed


description


of each


ese


instruments


is presented


accompanied


particular


technical


instruments


information


item


that


the


supports


purpose


use


the


study.


Released


GRE


Each


examinee


comply


a released


form


GRE


30-item


was


contained


test,


supplied


a 30-minute


"many


timed


kinds


Educational

examination.


of questions


Testing


Service


The sample


that


test


are


included


currently


used


forms"


(ETS,


1993,


GRE


The


test


was


signed


measure


basic


mathematical


skills


concepts


required


to solve


problems


in quantitative


settings


. It


was


divided


into











reason


quantity


accurately


comparing


or to recognize


when


relative


insufficient


two


information


had


been


provided


to make


such


a comparison.


The


format


the


second


section,


employing


multiple


choice


items,


assess


ability


to perform


computations


and


manipulations


of quantitative


symbols


and


to solve


word


problems


in applied


or abstract


contexts.


The


instructional


described


background

"arithmetic,


required

algebra,


answer


geometry,


items

and


was

data


analysis


" and


"content


areas


usually


studied


high


school"


(ETS,


1993,


. 18).


The


internal


consis


tency


the


test


1263


participant


was


relatively


good,


KR-20 = 0


pilot


study,


sample


test


correlations


with


the


GRE


examines


with


Scholastic


Aptitude


Test-


Mathematics


(SAT-M)


examinees


were


0.67


and


.79,


respectively


, the


scores


on the


released


GRE


were


similar


scores


examinees


earned


on other


college


admission


quantitative


examinations


Revised


Test


Anxiety


Scale


(RTA)


The


Seipp,


RTA


& El


scale


-Zahhar;


(Benson,


1991)


Moulin


was


-Julian,


formed


Schwarzer,


combining


theoretical


framework


two


recognized


measures


test











Reactions


to Tests


(RTT)(Sarason,


1984).


The


TAI,


based


upon


a two-factor


theoretical


conception


test


anxiety--


worry


and


emotionality


(Liebert


Morris


, 1967),


contained


items.


Sarason


(1984)


augmented


this


conceptual


zation


with


a four


-factor


model


test


anxiety--worry,


tension,


bodily


symptoms,


and


test


irrelevant


thinking.


To capture


the


best


qualities


of both


scales


Benson


et al


. (1991)


combined


the


instruments


to form


the


RTA


scale.

capture


They


intended


Sarason


four


that


propo


combined

factors.


scale

From


would

L the


original


combined


items,


using


a sample


more


than


college


students


from


three


countries,


they


eliminated


items


the


basis


items


not


loading


a single


factor,


having


low


item/factor


correlation


having


low


reliability


They


retained


items


each


loading


on the


intend


ed factor


and


containing


high


item


reliability.


The


bodily


symptoms


subscale,


containing


only


items,


was


problematic


due


to low


internal


reliability.


Consequently,


Benson


and


-Zahhar


(1994)


further


refine


the


RTA


scale


and


developed


a 20-


item


scal


with


four


factors


and


relatively


high


scale


internal


reliability


see


Table


With


a sample


of 562


coll


ege


students


from


two


countries


, randomly


split


into











correlations


, and


item


uniquene


sses


criptive


stati


stic


each


subscale


the


RTA


Benson


and


Zahhar


are


(1994)


reported


American


Table


sample


The in


and


study


strument


was


s sample


selected


because


evidence


reliability


and


construct


validity


compared


favorably


with


that


of other


leading


test


anxiety


scales


used


with


college


students


Table


criptive


Statistics


the


item


RTA


Scale


Scal


Benson


- El


American


Zahhar


Sample


= 202


Study
Sample
N = 1263


Total


Scale


38.31
10.40


39.17
9.37


Worry


11.61


12.03
3.50


Tension


3.85


Test


Irrelevant


6.79


Thinking


Bodily
Symptoms


7.54
2.79


7.35


Note.
First


U
.fl 4-ha


Numbe
entry


items


in each


nC 4-~ n A -


per


column


subscale


s the


Airt.4 :4-4 nfl


4-ba


in parenthe


mean,
4-1-4 .A


second
nfl w 4-* rt


ses


entry











Mathematics


Background


Researchers


have


experienced


problems


selecting


best


approach


measure


subjects


' mathematics


background


(Doolittle,


subjects'


1984).


background


Typically,


include


methods


asking


cla


subjects


ssifying


to report


number


mathematics


credits


earned


or semesters


studied

Pajares


(Doolittle,

& Miller, 1


1984,


1985;


.994)


Hacket

asking


& Betts

subjects


, 1989;

a series


que


tion


related


to specific


courses


studied


(Chipman,


Marshall,


Scott,


1991).


Asking


subject


que


stions


concerning


their


course


background


implies


that


one


or two


"watershed"

subjects' i


mathematics


instructional


courses qualitatively

background. To decide


capture

which


these


two


options


to employ


thi


study,


conducted


pilot


study


to ascertain


whether


measuring


examinees


mathematics


background


quantitatively


counting


mathematics


a watershed


credits


earned


mathemati


or by


course


qualitatively


was


more


identifying


useful


In a pilot


study,


undergraduates


were


asked


answer


the


five


question


posed


Chipman


et al.


(1991)


report


the


number


coll


credits


earned


mathematics


see


Appendix


questions


and


the


scoring


scheme


used


with


Chipman


et al.,


1991)


Subject


were










The


subjects


were


then


divided


using


their


responses


the


single


question


about


successful


comply


etion


college


calculu


course.


two


methods


dividing


subjects


into


two


background


groups


had


an 84%


agreement

predictors


rate;

with


however, co

performance


relations

on the GRE


thes

and


two


SAT-M


indicated


that


dichotomous


calculus


completion


question


was


more


valid


students


in this


study.


pattern


relationships


between


these


tests,


calculus


question,

indicated


and

that


the

for


number

these


mathematics


college


students


credit


earned


calculus


completion


had


a stronger


relationship


the


test


scores


.51)


than


the


number


of mathematics


credits


earned


.40)


see


Tabl


In a continuation


pilot


study,


examinees


reported


they


had


successfully


taken


a college


calculus


Table


Correlations


of Calculus


Completion,


SAT-M,


GRE-O


, and


College


Mathematics


Credits


SAT-M


GRE


Credits


culus


Compl


etion


.51(58)


.50(55)


.49(141)


Total


Credits


.08(58)


.40(55)











course,


and


examinees


reported


they


had


not


successfully

examinees re


taken


porting


a college


success


calculu


course.


completion


of a college


calculus


course


had


earned


an average


13.3


college


mathematics


credits


The


students


reporting


they


had


successfully


completed


a college


calculus


course


had


earned


an average


of 5.7


college


mathematics


credit


Therefore,


thi


sample


there


was


substantial


evidence


that


calculus


courses


serve


as a waters


hed


to other


more


advanced


mathemati


courses,


and


that


completion


calculus


course


could


use


to differentiate


students


terms


mathematics


background.


Subsequently,


mathemati


background


was


operationalized


having


each


examine


answer


following


question


: "Have


you


successfully


completed


college-level


calculus


course?"


Examinee


responding


were


classified


as having


a sub


stantial


background,


and


examinees


background.


responding


Utili


no were


zing


examinee


ass


ified


as having


responses


the


little


question


calculus


completion


was


justified


because


high


degree


students


' colle


agreement

ge course


between


calculus


backgrounds,


completion


the


higher


correlation


calculus


completion


to student


' SAT-M


and










sample


mathematics


background


applying


DIF


procedures.


Analv


Testing


Procedures


SubDODulation


Definitions


Prior


taking


released


GRE


examinees


answered


the


Differential


Item


Function


Ques


tionnaire


(see


Appendix


RTA


scale.


It contained


Examinee s


demographic


provided


questions


information


and


regarding


the

their


gender, m

Examinees


mathematics


were


background,


classified


and


as having


test


anxiety.


substantial


or little


mathematics

concerning


background by

completion of


answering

a college


the qu

calculu


estion


course.


the


1263


participants


reported


that


they


had


completed


a coll


calculu


course


and


reported


that


they


had


completed


a coll


ege


calculus


course


Frequency


counts


percentage


mathematics


background


gender


are


presented


Table


Men


and


women


did


not


ssess


similar


the


class


men


mathemati


reported


, whereas


backgrounds.


completing


the


women


the


a college


reported


sample,


calculus


completing


coll


ege


calculus


ass


High


test


anxious


groups


were


formed


following


manner


Examinees


scoring


approximately











Table


Frequencies
Background


and


Percentages


Gender


and


Mathematics


Mathematic


Background


Total


Subs


tantial


Little


Women


Pct.

Men


Pct.


14.6


40.3


Total


Pet.


50.4


Examinees


scoring


middle


percent


the


tribution


were


defined


as p


possess


moderate


level


test


anxiety


Examinees


scoring


in approximately


lowest


percent


stribution


were


defin


possessing


low


level


test


anxiety


For


analysis


examinees


class


ified


as po


ssess


moderate


level


test











test


anxiety


examines.


Women


tended


to be


classified


having


high


test


anxiety


at greater


rates


than


men.


Following


completion


the


questionnaire,


examinee


answered


-item


GRE


Examinees


received


a standard


instructions


were


told


they


had


minute


to complete


test.


Examinees


were


requ


ested


their


following


the


test,


they


sired,


they


could


learn


their


results


DIF


Estimation


The


five


different


methods


estimating


were


Mantel


-Haenszel


(MH)


(Holland


& Thayer,


1988),


Item


Response


Theory-Signed


Area


(IRT


-SA)


and


Item


Response


Theory-Unsigne


Area


(IRT-UA


(Raju,


1988,


1990),


Simultaneous


Item


Bias


Test


SIBTES


(Shealy


& Stout,


1993b),


and


logis


regression


(Swaminathan


& Rogers


1990).


A di


stinction


was


made


between


uniform


and


alternate


measures


Uniform


nonuniform


methods


estimate


DIF


fundamentally


different


ways.


nonuniform


DIF


exits


, the


two


approaches


produce


unique


findings


(Shepard,


Camilli,


Williams,


1984).


Consequently,


five


method


were


divided


into


two


groups


Mante


-Haenszel,


-SA,


and


SIBTEST


formed


uniform


measures


of DIF.


Logis


Regres


sion


and


-UA,











not


have


designed


used


measure


extensively


nonuniform


indicating


DIF,


test


that


practitioners


actual


testing


circumstances


they


assume


nonuniform


DIF


either


trivial


or a stati


stical


artifact.


examining


the


relationship


between


the


DIF


indi


ces


estimated


those


estimated


IRT-UA


and


logis


regression,


researchers


will


able


to determine


important


information


lost


when


only


uniform


methods


are


used.


Mantel


-Haensze


indi


ces


tests


of significance


were


estimated


using


SIBTEST


Stout


Roussos


, 1992


Item


Response


Theory


signed


and


unsigned


indices


and


tests


significance


were


estimate


using


PC-BILOG


(Mis


levy


Bock,


1990)


combination


with


6.03


(SAS


titute,


Inc.,


1988).


SIBTEST


indi


ces


and


tests


significance


were


estimated


using


SIBTEST


Stout


Roussos,


1992).


Logi


stic


regression


indi


ces


ests of


significance


were


estimated


through


6.03


Institute


Inc.,


1988).


Thu


each


test


items


was


analy


with


three


different


subpopulation


definitions


five


different


procedures


, producing


each


item


tinct


indices


significance


Structural


tests


Validation


The


structural


component


of construct


validation











(Mess


ick,


1988)


tructural


component


appraised


analyze


the


interrelationships


test


items


The


released


GRE-Q


was


structurally


validated


through


factor


analysis


the


matrix


tetrachoric


coefficients


the


item


test


a subsample


examines


Initially,


sample


1263


examinees


was


randomly


split


into


two


amples.


first


subsample


was


used


the


exploratory


study,


the


second


ample


was


used


cross-


validate


findings


derived


from


the


exploratory


analyst i


The


tetrachoric


coeffi


ent


matrix


was


generated


with


PRELIS

using


(Joreskog

an unweight


(Joreskog


& Sorbom,

ed least


& Sorbom,


1989a)

squares


1989b)


were


Factor

solution


used


analytic

through

assess it


model

LISREL


em


dimensionality


potential


nuisance


determinant


Research


Design


Prior


to validation,


assessed


the


consis


tency


combination


five


DIF


methods


and


three


subpopulation


definitions.


The


inter-method


consistency


of DIF


indices


was


asses


through


a multitrait-


multimethod


of DIF


(MTMM)


significant


matrix.


tests


The


was


inter-method


assessed


consistency


comparing


percent-of-agreement


rates


between


DIF


methods


when











A subset


unidimensional


items


was


identified


applying


factor


analytic


procedures.


Problematic


items


and


items


contaminated


nuisance


determinants


were


identified.


Following


structural


validation,


the


DIF


analysis


was


repeated.


Utilizing


combination


DIF


methods


subpopulation


definitions,


DIF


indices


and


significant


tests


were


generated


the


subset


items.


The


consi


stency


indices


associated


inferential


stati


stic


was


assessed.


The


findings


ass


imilating


validation


were


compared


the


preceding


findings


to appraise


effect


of structural


validation


on DIF


DIF


analy


Research


ses.


Questions


Research


question


one


through


four


addressed


the


consi


stency


DIF


indices


through


two


MTMM


matrices


correlation


coefficients


Research


questions


one


through


four


were


first


applied


to the


analy


uniform


DIF


procedures


and


the


MTMM


matrix


derived


from


these


coefficients


(see


Table


on page


The


same


set


questions


were


then


applied


the


alternate


DIF


procedures


and


the


MTMM


matrix


derived


from


these


coefficients


(see


Table


on page


10).


The


first


que


stion


applied


uniform


DIF










the


correlation


indices


when


the


subgroup


trait


gender


methods


are


and


-SA).


Were


the


convergent


coeffi


clients


based


upon


the


subpopulations


mathematics


background


and


test


anxiety


greater


than


convergent


coefficients


based


upon


gender


subpopulations?


Specific


stati


stical


hypotheses


were


formulated


provide


criteria


addressing


research


questions


Let


represent


correlation


between


MH and


IRT-


SA DIF


indices


items


when


examinee


subpopulations


are


defined


gender.


Let


represent


correlation


between


SIBTEST


indices


items


when


examine


ees


are


defined


gender


Let


PIS(G)


represent


the


correlation


between


the


IRT-SA


and


SIBTEST


indices


gender.


Comparable


items


when


notation


examinees


will


are


represent


defined


examinee


subpopulation


defined


mathematics


background


and


test


anxiety


(TA).


Three


families


stati


tical


tests


each


with


two


a priori


hypotheses


were


defined


answer


first


research


question


the


uniform


methods


They


were


follows


Hla: PI(M)

Hib: P(TA)


PMI(G)
PMI(G) t











H3a


: Pzs(M)


PIS(G)'


H3b:


The


PIS(TA)

first


PIS(G)


question


applied


the


alternate


DIF


procedures


also


addre


sse


convergent


or monotrait-


heteromethod


coefficients


Were


convergent


coefficients


based


upon


subgroup


mathematics


background


test


anxiety


greater


than


the


convergent


coefficients


based


upon


gender


subpopulations?


Similarly,


the


alternate


procedures


represent


corre


lation


between


the


and


-UA


DIF


indices


items


when


examine


subpopulations


are


defined


gender.


represent


the


corre


lation


between


MH and


logis


regression


indi


ces


the


items


when


examinees


are


defined


gender


Let


PIL(G)


represent


the


correlation


between


the


IRT-UA


logis


tic


regres


sion


indices


the


items


when


examinees


are


defined


gender


Comparable


notation


will


repre


sent


examinee


subpopulations


defined


mathematics


background


and


test


anxiety


(TA).


In a similar


manner,


three


families


states


tical


tests


each


with


two


a priori


hypotheses


were


defined


answer


the


first


research


question


the


alternate


methods.


They


were


follows


Hla: PMI(M)


PHI(G),




Full Text

PAGE 1

GENDER DIFFERENCES ON COLLEGE ADMISSION TEST ITEMS: EXPLORING THE ROLE OF MATHEMATICAL BACKGROUND AND TEST ANXIETY USING MULTIPLE METHODS OF DIFFERENTIAL ITEM FUNCTIONING DETECTION By THOMAS E. LANGENFELD A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1995

PAGE 2

ACKNOWLEDGEMENTS I would like to express my sincerest appreciation to the individuals who have assisted me in completing this study. I am extremely indebted to Dr. Linda Crocker, chairperson of my doctoral committee, for helping me in the conceptualization, development, and writing of this dissertation. Her assistance and encouragement were extremely important in enabling me to achieve my doctorate. I also want to thank the other members of my committee, Dr. James Algina, Dr. Jin-win Hsu, Dr. Marc Mahlios, and Dr. Rodman Webb, for patiently reading the manuscript, offering constructive comments, providing editorial assistance, and giving continuous support. I further wish to thank Dr. David Miller, Dr. John Hall, and Scott Behrens for their assistance related to different aspects of this study. I want to express my deepest gratitude to my family for providing the emotional support that was so vital during my graduate experience. I want to thank my wife, Ann--in many ways this degree is as much hers as mine, for understanding and encouraging me during my graduate ii

PAGE 3

studies. Space limitations do not allow me to express the many personal sacrifices made by my wife so that I could complete this study. I also want to thank my daughter, Kathryn Louise, who was born in the early stages of this study and has come to provide a special type of support. iii

PAGE 4

TABLE OF CONTENTS ACKNOWLEDGEMENTS LIST OF TABLES LIST OF FIGURES ................................. ABSTRA.CT CHAPTERS 1 2 3 4 INTRODUCTION Statement of Problem ................ The Measurement Context of the The Research Theoretical Limitations Problem ....... Rationale. of the Study. Study .. REVIEW OF LITERATURE ....................... DIF Methodology ................. Gender and Quantitative Aptitude ........ Potential Explanations of DIF ........... Summary ..................... METHODOLOGY Examinees ... Instruments. Analysis .. Summary ........ RESULTS AND DISCUSSION .. Descriptive Statistics ......... Research Findings ... Additional Findings ....... iv ii vi viii ix 1 1 5 6 13 17 19 19 57 64 72 74 75 76 83 96 98 98 101 140

PAGE 5

5 SUMMARY AND CONCLUSIONS APPENDICES A B C SUMMARY STATISTICAL TABLES DIFFERENTIAL ITEM FUNCTIONING QUESTIONNAIRE INCLUDING THE REVISED TEST ANXIETY SCALE ............ THE CHIPMAN, MARSHALL, AND SCOTT (1991) INSTRUMENT FOR ESTIMATING MATHEMATICS BACKGROUND 147 160 195 200 REFERENCES. . . . . . . . . . . . 201 BIOGRAPHICAL SKETCH. . . . . . . . 213 V

PAGE 6

LIST OF TABLES Table 1 Proposed Multitrait-Multimethod Correlation Matrix: Uniform DIF Indices.............. 9 2 Proposed Multitrait-Multimethod Correlation Matrix: Alternate DIF Indices............ 10 3 Item Data for the 2 Groups and 2 Item Scores for the jth Ability Group................ 27 4 Descriptive Statistics for the 20-item RTA Scale .................................... 79 5 Correlations of Calculus Completion, SAT-M, GRE-Q, and College Mathematics Credits... 81 6 Frequencies and Percentages for Gender and Mathematics Background................... 84 7 Mean Scores of the Released GRE-Q and the Revised Test Anxiety Scale (RTA) by the Total Sample, Gender, and Mathematics Background............................... 99 8 Intercorrelations of the Released GRE-Q, RTA, and Mathematics Background for the Total Sample, Women, and Men................... 102 9 Multitrait-Multimethod Correlation Matrix: uniform DIF Indices...................... 105 10 Percent-of-Agreement Rates of Inferential Tests by Gender, Mathematics Background, and TA Between DIF Methods: 30-Item GRE-Q.................................... 110 11 Multitrait-Multimethod Correlation Matrix: Alternate DIF Indices.................... 112 vi

PAGE 7

12 Tetrachoric Correlations and Standardized Estimates for the Four Problematic Items: Exploratory Sample................ 119 13 Multitrait-Multimethod Correlation Matrix of the Valid 26 Test Items: Uniform DIF Indices.............................. 130 14 Percent-of-Agreement Rates of Inferential Tests by Gender, Mathematics Background, and TA Between DIF Methods: 26-Item Valid Test. . . . . . . . . . . . . . . . 13 3 15 Multitrait-Multimethod Correlation Matrix of the Valid 26 Test Items: Alternate DIF Indices. . . . . . . . . . . . . . . 134 vii

PAGE 8

Figure 1 2 3 4 LIST OF FIGURES The Four Problematic Test Questions ....... LRCs for Women and Men on Item 2 ....... LRCs for Women and Men on Item 10 ......... LRCs for Women and Men on Item 6 .......... 122 123 123 124 5 LRCs for Examinees with Substantial and Little Mathematics Background on Item 6................................... 124 6 LRCs for Women and Men on Item 11.......... 125 7 LRCs for Examinees with Substantial and Little Mathematics Background on Item 11 .. 125 8 LRCs for Women and Men on Item 8 Illustrating the Symmetrical Nonuniform DIF Condition ................... 145 9 LRCs for women and Men on Item 20 Illustrating the More Typical Nonuniform DIF Condition............................. 145 viii

PAGE 9

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy GENDER DIFFERENCES ON COLLEGE ADMISSION TEST ITEMS: EXPLORING THE ROLE OF MATHEMATICAL BACKGROUND AND TEST ANXIETY USING MULTIPLE METHODS OF DIFFERENTIAL ITEM FUNCTIONING DETECTION By Thomas E. Langenfeld August, 1995 Chairperson: Linda Crocker Major Department: Foundations of Education The purpose of this study was to discover whether defining examinee subpopulations by relevant educational or psychological variables, rather than gender, would yield item statistics that were more consistent across five methods for detection of differential item functioning (DIF). A subsidiary purpose of this study was to assess how the consistency of DIF estimates were affected when structural validation findings were incorporated into the analysis. The study was conducted in the context of college admission quantitative examinations and gender issues. Participants consisted of 1263 university students. For purposes of this study, their responses to a 30-item quantitative aptitude test ix

PAGE 10

were analyzed by categorizing examinees by their gender, mathematics backgrounds, and levels of test anxiety. The hypothesis that defining subpopulations by mathematics background or test anxiety would yield higher consistency of DIF estimation than defining subpopulations by gender was not substantiated. Results indicated that using mathematics background to define subpopulations and explain gender DIF had potential usefulness; however, in this study, the use of test anxiety to define subpopulations and explain DIF was ineffectual. The findings confirmed the importance of structural validation for DIF analyses. Results from using the entire test revealed that nonuniform DIF methods had low inter-method consistency and variance related to methods. When structural validation findings were used to define a valid subset of items, highly consistent DIF indices resulted across all methods and minimal variance related to methods. Results further suggested the need to use nonuniform DIF methods and the importance of jointly interpreting both DIF indices and significant tests. Implications and recommendations for research and practice are included. X

PAGE 11

CHAPTER 1 INTRODUCTION Statement of the Problem Differential item functioning (DIF), a statistical indication of item bias, occurs when equally proficient individuals, from different subpopulations, have different probabilities of answering an item correctly (R. L. Linn, Levine, Hastings, & Wardrop, 1981; Scheuneman, 1979; Shepard, Camilli, & Williams, 1984). Historically, researchers studying DIF have addressed two principal concerns. The first concern of researchers has been the development and evaluation of statistical methods for detecting "biased" items. The second concern has been to identify plausible explanations of item bias. In this study, both methodological and substantive educational issues concerning item bias and DIF were addressed. During the past four decades, a plethora of detection methods has been developed (for a comprehensive review of advances in item bias detection methods over the past ten years see Millsap and Everson, 1993). DIF methods have two major distinctions: (a) whether the conditioning 1

PAGE 12

variable is formed from an observed conditional score or an unobserved conditional estimate of latent ability and (b) whether they can detect nonuniform as well as uniform DIF. Researchers applying methods using an observed conditional score commonly sum the number of correct responses on the test or subsection of the test to estimate the ability of each examinee. Researchers using unobserved conditional estimates most frequently apply a unidimensional item response theory (IRT) model for estimating the latent ability of each examinee. 2 Uniform DIF occurs when there is no interaction between ability level and group membership. That is, the probability of answering an item correctly is greater for one group than the other group uniformly over all ability levels. Nonuniform DIF, detectable only by some methods, occurs when there is interaction between ability level and group membership. That is, the difference in the probabilities of a correct response for the two groups is not the same at all ability levels. In IRT terms, nonuniform DIF is indicated by "nonparallel" item characteristic curves. Among the various DIF procedures, the Mantel-Haenszel (MH) statistic, as applied by Holland and Thayer (1988),

PAGE 13

3 has emerged as the most widely used procedure (more because of the Educational Testing Service's usage than as a result of theoretical consensus), and it is frequently the method to which others are compared (Hambleton & Rogers, 1989; Raju, 1990; Shealy & Stout, 1993a; Swaminathan & Rogers, 1990). The appeal of the MH procedure is its simple conceptualization, relative ease of use, chi-square test of significance, and desirable statistical properties (Dorans & Holland, 1993; Millsap & Everson, 1993). Researchers applying MH employ an observed score as the conditioning variable and recognize that MH is sensitive to only uniform DIF. Other methods compared with the MH procedure in this study included logistic regression (Swaminathan & Rogers, 1990), !RT-Signed Area (IRT-SA), !RT-Unsigned Area (IRT UA) (Raju, 1988, 1990), and the Simultaneous Item Bias Test (SIBTEST) (Shealy & Stout, 1993a, 1993b). Logistic regression was designed to condition on observed scores analyzing item responses. With logistic regression, the user can detect both uniform and nonuniform DIF. IRT-SA and IRT-UA were devised to condition on latent ability estimates and assess the area between an item characteristic curve (ICC) estimated for one subgroup against an ICC estimated for a second subgroup. IRT-SA

PAGE 14

4 was developed to detect only uniform DIF, whereas IRT-UA was developed to detect both uniform and nonuniform DIF. SIBTEST was designed to conceptualize DIF as a multidimensional phenomenon where nuisance determinants adversely influence item responses (Shealy & Stout, 1993a, 1993b). Researchers using SIBTEST apply factor analysis to define a valid subtest and a regression correction procedure to estimate the criterion variable. SIBTEST was developed to detect only uniform DIF. In assessing different DIF indices with data from a curriculum-based, eighth grade mathematics test, Skaggs and Lissitz (1992) found that the consistency between methods was low, and no reasonable explanation for items manifesting DIF could be hypothesized. They posited that categorizing subpopulations by demographic characteristics such as gender or ethnicity for DIF studies was "not very helpful in conceptualizing cognitive issues and indicated nothing of the reasons for the differences" (p. 239). A number of researchers have suggested the need to explore DIF using subpopulations categorized by psychologically and educationally significant variables that correlate with gender or ethnicity and potentially influence item performance (R. L. Linn, 1993; Schmitt & Dorans, 1990;

PAGE 15

Skaggs & Lissitz, 1992; K. K. Tatsuoka, R. L. Linn, M. M. Tatsuoka, & Yamamoto, 1988). Thus, a major concern of the study was the consistency of results from different DIF estimation procedures when subpopulations are conceptualized by psychological or educational variables. Three methods of conceptualizing subpopulations were combined with five fundamentally different state-of-the-art procedures for assessing DIF. The Measurement Context of the Study 5 The substantive issue was the investigation of gender differences on a sample test containing items similar to those found on advanced college admission quantitative examinations. Generally, men tend to outperform women on the Scholastic Aptitude Test-Math (SAT-M), the American College Testing Assessment Mathematics Usage Test (ACT-M), and the Graduate Record Examination-Quantitative (GRE-Q). However, from a predictive validity perspective, these differences are problematic. For example, men tend to score approximately 0.4 standard deviation units higher on the SAT-M (National Center for Education Statistics, 1993), although women tend to perform at nearly the same level as men in equivalent college mathematics courses (Bridgeman & Wendler, 1991; Wainer & Steinberg, 1992) and

PAGE 16

6 tend to outperform men in general college courses (Young, 1991, 1994). A possible explanation for quantitative test score differences between men and women is background experience. Men tend to enroll in more years of mathematics (National Center for Education Statistics, 1993). A second explanation that could potentially explain the differential validity of such tests is test anxiety. Test anxiety relates to examinees' fears of negative evaluation and defensiveness (Hembree, 1988). Women generally report higher levels of test anxiety than men (Everson, Millsap, & Rodriquez, 1991; Hembree, 1988; Wigfield & Eccles, 1989). Thus, for high-stakes tests of mathematical aptitude, mathematics background and test anxiety could influence item responses differentially for each gender. The Research Problem In this study, I explored the feasibility of conceptualizing subpopulations by relevant educational or psychological variables in contrast to the use of traditional demographic variables. The vehicle to achieve this purpose was a released form of the GRE-Q. In this study, subpopulations were conceptualized by (a) gender groups, one of the traditional demographic groups, (b)

PAGE 17

7 examinees with substantial and little mathematics background, and (c) examinees high and low in test anxiety. DIF was assessed using five different measures. The DIF measures were MH, logistic regression, IRT-SA, IRT-UA, and SIBTEST. The DIF methods were classified into two groups--methods measuring uniform DIF and alternate methods. The uniform methods were MH, IRT-SA, and SIBTEST. Alternate methods included logistic regression and IRT-UA, along with MH. Logistic regression and IRT-UA were designed to measure both uniform and nonuniform DIF. Mantel-Haenszel was placed into both analysis groups because of its widespread use by test practitioners. Regarding the study's methodological issues, the results of five methods of estimating DIF will be contrasted within each of the three modes of defining subpopulation groups. The observation of interest was the DIF indices estimated for each item under a particular combination of subpopulation definition and DIF method. Replications were the 30 items on a released form of the GRE-Q test. For the research questions that follow, trait effects refer to the three subpopulation conceptualizations and method effects refer to the five methods of estimating DIF indices.

PAGE 18

8 The first four research questions address the consistency of DIF indices between methods when subpopulations are conceptualized using different traits. The uniform methods of MH, IRT-SA, and SIBTEST were combined with the traits gender, mathematics background, and test anxiety to yield a multitrait-multimethod (MTMM) matrix of correlation coefficients. (See Table 1 for an illustration of a MTMM matrix with uniform measures.) Similarly, the alternate DIF estimation methods of MH, IRT-UA, and logistic regression were combined with the traits gender, mathematics background, and test anxiety to yield a second multitrait-multimethod (MTMM) matrix of correlation coefficients. (See Table 2 for an illustration of a MTMM matrix with alternate measures.) Each of the following research questions was addressed twice; each question was answered for the uniform methods and the alternate methods, respectively: 1. Among the three sets of convergent coefficients, often termed the monotrait-heteromethod coefficients, (e.g., the correlation between the DIF indices obtained from the MH and IRT-SA methods when subpopulations are defined by the trait gender), will the coefficients base upon the subpopulations of mathematics background or test anxiety be significantly larger than the corresponding

PAGE 19

Table 1 Proposed Multitrait-Multimethod Correlation Matrix: Uniform DIF Indices IRT-SA SIBTEST-b I.MH-D A.Gender ( ) B.MathBkd H-M ( ) C.TA H-M H-M ( ) II.IRT-SA A.Gender M-H* H-H H-H ( ) B.MathBkd H-H M-H* H-H H-M ( ) C.TA H-H H-H M-H* H-M H-M ( ) III.SIBTEST-b A.Gender M-H* H-H H-H M-H* H-H H-H ( ) B.MathBkd H-H M-H* H-H H-H M-H* H-H H-M ( ) C.TA H-H H-H M-H* H-H H-H M-H* H-M H-M ( Note. ( ) = reliability coefficients. M-H* = monotrait heteromethod or the convergent validity coefficients. H-M = heterotrait-monomethod coefficients. H-H = heterotrait-heteromethod coefficient. 9 )

PAGE 20

Table 2 Proposed Multitrait-Multimethod Correlation Matrix: Alternate DIF Indices IRT-UA Log Reg A _JL_ _A_ _JL_ C _A_ _JL_ C I.MH-D A.Gender ( ) B.MathBkd H-M ( ) C.TA H-M H-M ( ) II.IRT-UA A.Gender M-H* H-H H-H ( ) B.MathBkd H-H M-H* H-H H-M ( ) C.TA H-H H-H M-H* H-M H-M ( ) III.Log Reg A.Gender M-H* H-H H-H M-H* H-H H-H ( ) B.MathBkd H-H M-H* H-H H-H M-H* H-H H-M ( ) C.TA H-H H-H M-H* H-H H-H M-H* H-M H-M ( ) Note. ( ) = reliability coefficients. M-H* = monotrait heteromethod or the convergent validity coefficients. H-M = heterotrait-monomethod coefficients. H-H = heterotrait-heteromethod coefficient. 10

PAGE 21

coefficients when subpopulations are defined by gender? 11 2. Will the monotrait-heteromethod coefficients be higher than the coefficients for different traits measured by the same method (i.e., heterotrait-monomethod coefficients)? 3. Will the convergent correlation coefficients be higher than the discriminant coefficients measuring different traits by different methods(i.e., heterotrait heteromethod coefficients)? 4. Will the patterns of correlations among the three traits be similar over the three methods of DIF estimation? The final research question addressed the consistency of DIF procedures in identifying aberrant items when subpopulations are conceptualized in different ways. The question was applied twice; it was answered for the uniform methods and alternate methods. It was as follows: 5. For each DIF detection method respectively, using standard decision rules, what is the percent agreement about aberrant items when subgroups are based on gender and when subgroups are based on (a) mathematics background and (b) test anxiety?

PAGE 22

12 Following the analysis of uniform and alternate DIF methods, I conducted a structural analysis of the 30-item quantitative test. Shealy and Stout (1993a, 1993b) stressed that practitioners must carefully identify a valid subset of items prior to conducting DIF analyses. They argued that DIF occurs as a consequence of multidimensionality. The potential for DIF occurs when one or more nuisance dimensions interact with the valid dimension of a test (Ackerman, 1992; Camilli, 1992). Messick (1988) stressed the structural component of construct validation. The structural component concerned the extent to which items are combined into scores that reflect the structure of the underlying latent construct. Loevinger (1957) termed the purity of the internal relationships as structural fidelity, and it is appraised by analyzing the interitem structure of a test. I employed factor analytic procedures to define a structurally valid subset of unidimensional items and to identify problematic and multidimensional items. I hoped to define items measuring both the intended dimension and nuisance. After the identification of a structurally valid subset of items, I assimilated the findings and repeated the DIF analysis described above. Again, I assessed DIF

PAGE 23

13 using the five methods with subpopulations defined by gender, mathematics background, and test anxiety. Using DIF indices as the unit of analysis, two MTMM matrices of correlation coefficients were generated--one matrix for uniform methods and one matrix for alternate methods. I applied the five research questions to the MTMM matrices and inferential statistics using the structurally valid subset of items. I contrasted the findings of the analysis for the entire test with the findings of the analysis for the subset of test items. Theoretical Rationale The process of ensuring that high-stakes tests contain no items that function differentially for specific subpopulations is a fundamental concern of construct validation. Items that contain nuisance determinants correlated with an examinee subpopulation membership threaten the construct interpretations derived from test scores for that subpopulation. Psychometric researchers continue to examine the merits of numerous DIF detection procedures and explore theoretical explanations of DIF. However, to date, they have failed to reach consensus on methodological issues or to develop meaningful insight concerning its causes. In part, this failure is the consequence of inconsistent DIF

PAGE 24

14 identification with actual test data (R. L. Linn, 1993; Shepard et al., 1984; Skaggs & Lissitz, 1992). These concerns were investigated from both a practical and theoretical perspective that has been suggested (R. L. Linn, 1993; Schmitt & Dorans, 1990; Skaggs & Lissitz, 1992; K. K. Tatsuoka et al., 1988) but rarely tested. Two significant premises underlie the study. The first premise is that there is nothing inherent in being a female examinee or a member of a specific ethnic group that predisposes an individual to find a particular item troublesome. Educational and psychological phenomena function in unique ways to disadvantage an individual on a specific item. Traditional DIF occurs when phenomena correlate with the demographic group of interest. Consequently, gender or ethnicity can be interpreted as a surrogate for educational or psychological variables that potentially explain DIF's causes. Skaggs and Lissitz (1992) posited that educational and psychological variables that influence item performance and correlate with ethnic or gender groups would be useful for conceptualizing subpopulations. Millsap and Everson (1993) commented that modeling variables such as educational background and test anxiety might assist in understanding DIF's causes. The

PAGE 25

15 educational and psychological variables in the study that were hypothesized as potentially explaining gender DIF on quantitative test items were mathematics background and test anxiety. Mathematics background was selected because it influences quantitative reasoning and problem solving. Further, high school and college men tend to enroll in more mathematics courses and study mathematics at more abstract levels than women (National Center for Educational Statistics, 1993). Researchers assessing overall SAT-M performance have found that gender differences decrease substantially when differences in high school mathematics background are taken into account, although background does not entirely explain score differences (Ethington & Wolfle, 1984; Fennema & Sherman, 1977; Pallas & Alexander, 1983). Quantitative aptitude test scores can be contaminated by familiarity with item context, the application of novel solutions, and the use of partial knowledge to solve complex problems (Kimball, 1989). These types of skills frequently are developed through background experiences. Test anxiety was selected because of its well documented debilitating influence on examinees' performance. (Hembree, 1988; Hill & Wigfield, 1984;

PAGE 26

16 Liebert & Morris, 1967; Tryon, 1980). For individuals possessing high levels of test anxiety, test scores frequently are depressed and construct interpretations become problematic (Everson et al., 1991; Hembree, 1988; Sarason, 1980). Consequently, test anxiety exemplifies a psychological variable that potentially contaminates the construct interpretations of scores. For examinees with high levels of test anxiety, tests of mathematical ability tend to induce extreme levels of anxiety. (Richardson & Woolfolk, 1980). Female students tend to report higher levels of test anxiety than male students at all grade levels including college (Everson et al., 1991; Hembree, 1988; Wigfield & Eccles, 1989). Over the past 20 years, several self-reported measures of test anxiety have been developed that demonstrate high reliability and well defined theoretical properties (Benson, Moulin-Julian, Schwarzer, Seipp, & El Zahhar, 1991; Sarason, 1984; Spielberger, Gonzalez, Taylor, Algaze, & Anton, 1978). Researchers have used the self-reported instruments to measure test anxiety and assess the efficacy of treatment programs (Sarason, 1980; Spielberger et al., 1978; Wine, 1980). For studying gender DIF on college admissions quantitative items, it was hypothesized that test anxiety possessed strong explanatory potential because of its

PAGE 27

threat to valid score interpretation, negative influence on tests of mathematics ability, and gender effects. 17 A second premise underlying the study is a fundamental tenet of educational measurement. Item responses are products of complex interactions between examinees and a set of items. In part, because of this complex interaction, examinees of approximately equivalent abilities who belong to different subpopulations occasionally have different likelihoods of answering a question correctly. This fascinating finding currently is understood only crudely. Before it can be better understood, the effect of DIF detection methods and different means of conceptualizing subpopulations on item responses must be examined. Limitations of the Study A salient limitation of the study was the nature of the performance task. Participants in the study were administered a sample GRE-Q and were told they would have 30 minutes to complete the test. They were told to perform to the best of their ability and they would be able to learn their results following testing. Although every effort was made to simulate the conditions of a high-stakes examination, if participants felt that the performance had little meaning for them, then their

PAGE 28

18 performance might not accurately reflect their performance on a high-stakes college admissions test. Further, if the participants believed that the examination had low-stakes, the levels of test anxiety felt by examinees while answering the sample GRE-Q would not be equivalent to levels of test anxiety experienced by examinees while answering a college admissions test. Finally, examinees in the study were predominantly undergraduate students taking classes in the colleges of education and business at a large, southern state university. For this reason, although the design, methodology, and analysis were conceived and executed to maximize the generalizability of findings, a degree of caution is recommended in generalizing to other populations or settings.

PAGE 29

CHAPTER 2 REVIEW OF LITERATURE The four central aspects of this study were Differential Item Functioning (DIF) methodology, gender differences in mathematical college-level aptitude testing, gender differences in mathematics background, and test anxiety. These four topics constitute the major themes for the organization of the literature review presented in this chapter. DIF Methodology A Conceptual Framework of DIF Tests for placement in education and selection in employment require scores be fair and representative for all individuals. Since the mid-1960s, measurement specialists have been concerned explicitly with the fairness of their instruments and the possibility that some tests may be biased (Cole & Moss, 1989). Bias studies initially were designed to investigate the assertions that disparities between various subpopulations on cognitive ability test scores were a product of cultural bias inherent in the measure (Angoff, 1993). Test critics charged that bias occurred whenever mean scores for two subpopulations were not equivalent. This position accepted a priori that 19

PAGE 30

20 subpopulations had equivalent score distributions on the construct measured and dismissed the possibility that actual differences may exist. Measurement specialists, however, have resolved that mean score differences do not necessarily reflect bias but indicate test impact (Dorans & Holland, 1993). Concerns about measurement bias are inherent to validity theory (Cole & Moss, 1989). A test score inference is considered sufficiently valid when various types of evidence justify its usage and eliminate other counterinterpretations (Messick, 1989; Moss, 1992). Bias has been characterized as "a source of invalidity that keeps some examinees with the trait or knowledge being measured from demonstrating that ability" (Shepard, Camilli, & Williams, 1985, p.79). If score-based inferences are not equally valid for all relevant subgroups, decisions derived from score inferences will not be fair for all individuals. Therefore, measurement bias occurs when score interpretations are differentially valid for any subgroup of test takers (Cole & Moss, 1989). To investigate the potential for measurement bias, researchers have examined test items as a source and explanation. The supposition is that biased items require knowledge and skills that examinees from a specified subgroup are less familiar with and possess fewer opportunities to learn (Angoff, 1993). The goals of item

PAGE 31

21 bias research are to identify and remove items detected as biased (Angoff, 1993) and to provide test developers with guidelines making future construction of biased items less likely (Scheuneman, 1987; Schmitt, Holland, & Dorans, 1993). Measurement specialists have defined item bias as occurring when individuals, from different subpopulations, who are equally proficient on the construct measured have different probabilities of successfully answering the item (Angoff, 1993; R. L. Linn, Levine, Hastings, & Wardrop, 1981; Scheuneman, 1979; Shepard et al., 1985). Researchers apply statistical methods to equate individuals on the construct, utilizing either observed scores or latent ability scores, and estimate for examinees of each group the probability of a correct response. These methods provide statistical evidence of bias. When a statistically biased item is identified, it might be interpreted as unfairly disadvantageous to a minority group for cultural and social reasons. On the other hand, the item might be interpreted as unrelated to cultural and social factors but related to an important educational outcome that is not equally known and understood by all groups. In this latter case, deleting the item for strictly statistical reasons may reduce validity. Consequently, social and statistical definitions of item bias have created considerable confusion within the debate over test fairness (Cole & Moss, 1989; Angoff, 1993).

PAGE 32

22 Researchers discovered that statistical analyses of item bias raised expectations and created confusion for an already obscure and volatile topic. The term differential item functioning (DIF) gradually has replaced item bias as the preferred term in research because of its more neutral and technical connotations (Angoff, 1993; Dorans & Kulick, 1986). Holland and Wainer (1993) distinguished between item bias and DIF stating, item bias refers to "an informed judgment about an item that takes into account the purpose of the test, the relevant experiences of certain subgroups of examinees taking it, and statistical information about the item" (p. xiv). DIF is a "relative term'' (p. xiv) and is a statistical indication of a differential response pattern. Shealy and Stout (1993a) proposed that the difference between item bias and DIF is ''the degree the user or researcher has embraced a construct validity argument" (p. 197). Shealy and Stout (1993a, 1993b) conceptualized DIF as a violation of the unidimensional nature of test items. They classified the intended dimension as the target ability and unintended dimensions as nuisance determinants. DIF occurred because of nuisance determinants existing in differing degrees among subgroups. Crocker and Algina (1986) postulated that DIF occurred if (a) for subgroups, items are affected by different sources of variance; and (b) among test takers who are at the same point on the

PAGE 33

construct, the distributions of irrelevant sources of variation are different for subgroups. Therefore, DIF can be conceptualized as a consequence of multidimensionality with differing sources of variation influencing subgroups' item responses. A Formal Definition of DIF 23 All DIF detection methods rely on assessment of response patterns of subgroups to test items. The subgroups, conceptualized in most studies on the basis of demographic characteristics (i.e., blacks and whites, women and men), form a categorical variable. When two groups are contrasted, the group of interest (e.g., blacks or women) is designated the focal group, and the group serving as the group for comparison (e.g., whites or men) is designated the reference group. Examinees are matched on a criterion variable, assumed to be a valid representation of the purported construct, and DIF methods assess differential group response patterns for individuals of equal ability. Denote the item score as Y, frequently scored as a dichotomous variable O or 1; denote X as the conditioning criterion; and denote gas the categorical variable of group membership. Lack of measurement bias or DIF for an item is defined as

PAGE 34

24 for all values of X for the reference and focal groups. In this definition, P 9 (Y=l:x) is the conditional probability function for Y at all levels of X (Millsap & Everson, 1993). Although all DIF procedures operate from this definition, they differ on the basis of statistical models and possess various advantages. DIF procedures can be characterized as models using observed conditional invariance or models utilizing unobserved conditional invariance (Millsap & Everson, 1993). When observed conditional invariance is used, the criterion variable is the sum of the total number of correct responses on the test or a subset of the test. When unobserved conditional invariance is used, a unidimensional item response theory (IRT) model estimates a e parameter for each examinee that functions as the criterion variable. Other differences in DIF detection procedures are the capacity to detect nonuniform DIF, to test statistical significance, and to conceptualize DIF as a consequence of multidimensionality. Uniform DIF occurs when there is no interaction between group membership and the conditioning criterion regarding the probability of answering an item correctly. In other words, DIF functions in a uniform fashion across the ability spectrum. Nonuniform DIF refers to an interaction between group membership and the conditioning criterion. In this case, an item might differentially favor a subgroup of examinees at one end of

PAGE 35

25 the ability spectrum and disfavor the subgroup at the other end of the spectrum. All DIF procedures are used to estimate an index describing the magnitude of the differential response pattern for the groups on the item. Some procedures also provide statistical tests to detect if the DIF index differs significantly from zero. Finally, although DIF is perceived as a consequence of multidimensionality, every procedure except Shealy and Stout's Simultaneous Bias Test (SIBTEST) functions within an unidimensional framework. Many DIF detection methods have been developed during the past three decades. In this review, they are categorized as based upon observed conditional invariance or unobserved latent conditional invariance. Related issues, research problems, and potential usage are evaluated. Following the review of DIF detection methods, research efforts to explain the underlying causes of DIF are presented. DIF Methods Based Upon Observed Scores Angoff and Ford (1973) offered the first widely used DIF detection method called the delta-plot. The delta-plot procedure was problematic due to its tendency, under conditions of differing ability score distributions, to identify the most discriminating items as aberrant (Angoff, 1993). Scheuneman (1979) proposed a chi-square procedure for assessing DIF. This procedure was irrelevantly affected

PAGE 36

26 by sample size and was not based upon a chi-square sampling distributions, in effect, not a chi-square procedure at all (Baker, 1981). The full chi-square procedure (Bishop, Fienberg, & Holland, 1975) was a valid technique for testing DIF but required large sample sizes at each ability level to sustain statistical power. Holland and Thayer (1988) built upon these chi-square techniques when they applied the Mantel and Haenszel (1959) statistic, originally developed for medical research, to the detection of DIF. Mantel-Haenszel procedure. The Mantel-Haenszel (MH) statistic has become the most widely used method of DIF detection (Millsap & Everson, 1993). The MH procedure assesses the item data in a J-by-2-by-2 contingency table. At each score level j, individual item data are presented for the two groups and the two levels of item response, right or wrong (see Table 3). The null hypothesis for the MH procedure can be expressed as the odds of answering an item correctly at a given ability level are the same for both groups across all j ability levels. The alternative hypothesis is that the two groups do not have equal probability of answering the item correctly at some level of j.

PAGE 37

Table 3 Item Data for the 2 Groups and 2 Item Scores for the jth Ability Group Score on Studied Item 1 0 Total R ARj BRj nRj Group F CFj DFj nFj Total 27 The MH statistic uses a constant odds ratio (a~) as an index of DIF. The estimate of the constant odds ratio is aHH J C:E A Rj D Fj IT j ] == j = 1 J [ L C Fj B Rj IT j ] j = 1 The constant odds ratio ranges in value from zero to infinity. The estimated value of a~ is 1 under the null condition. It is interpreted as the average factor by which the odds that a reference group examinee will answer the item correctly exceeds that of a focal group examinee. Consequently, an estimated constant odds ratio greater than 1 indicates the case where the item is functioning differentially against the focal group.

PAGE 38

28 The estimated value of aM frequently is transformed to the more easily interpreted ~metric via MH D DIF = 2. 35 ln [ aHH] Positive values of MH D-DIF favor the focal group, whereas negative values favor the reference group. where and The chi-square test of significance for MH is J J [: L A Rj L E (AR) : 5 J2 2 j = 1 j = 1 MH -x = -------------, var (AR) J L var (AR) j = 1 = [nRJm1JnFJmoJ] 2 [TJ (T I 1 )] The MH chi-square is distributed approximately as a chi square with one degree of freedom. Holland and Thayer (1988) asserted that this test is "the uniformly most powerful unbiased test of H 0 versus Ha" ( p. 134).

PAGE 39

29 The advantages of MH are its computational simplicity (Holland & Thayer, 1988), statistical test of significance, and lack of sensitivity to subgroup differences in the distribution of ability (Donoghue, Holland, & Thayer, 1993; Shealy & Stout, 1993a; Swaminathan & Rogers, 1990). The most frequently cited disadvantage is its lack of power to detect nonuniform DIF (Swaminathan & Rogers, 1990). It is further limited by its unidimensional conception and assumption that total test score provides a meaningful measure of the construct purported to be estimated. The standardization procedure. The standardization procedure (Dorans & Kulick, 1986) is based upon the nonparametric regression of test scores on item scores for two groups. Let ER(Y:x) define the expected item test nonparametric regression for the reference group, and let EF(Y:x) define the expected item test nonparametric regression for the focal group, where Y is the item score and Xis the test score. The DIF analysis at the individual score level is The statistic, Dj, is the fundamental measure of differences in item performance between the focal and reference group members who are matched at equivalent observed score levels. These differences are unexpected

PAGE 40

30 differences and cannot be explained by differences in the attribute tested. The standardization procedure derived its name from the standardization group that functions to supply a set of weights, one at each ability level, that will be used to weight each individual Dj. The standardized p-difference (STD P-DIF) is or STD-P = J L wJ (EFJ ER) j = l STD P = -=-j_ = l __ The essence of standardization is the weighting function. The specific weight implemented for standardization depends upon the nature of the study (Dorans & Kulick, 1986). Plausible options of weighting include the number of examinees in the total group at each level of j, the number of examinees in the focal group at each level of j, or the number of examinees in the reference group at each level of j. In practice, the number of examinees in the focal group

PAGE 41

31 is used; thus, STD-Pis defined as the difference between the observed performance and the expected performance of the focal group on an item (Dorans & Kulick, 1986). The standardization procedure contains a significance test. The standard error using focal group weighting is SE(STD-P)= where Pp is the proportion of focal group members correctly answering the item, and where PF* is thought of as the performance of the focal group members predicted from the reference group's item test regression curve and The standardization procedure is a flexible method of investigating DIF (Dorans & Holland, 1993), and it has been applied to assessing differential functioning distractors (Dorans, Schmitt, & Bleistein, 1992) and the differential effect of speededness (Schmitt & Dorans, 1990). DIF findings from the standardization procedure will be in close agreement with the MH procedure (Millsap & Everson, 1993) with the choice of we~ghting creating slight variations (Dorans & Holland, 1993). Because the two procedures are nearly identical, the advantages and disadvantages for the

PAGE 42

32 standardization method are much the same as for MH. The most commonly cited deficiency of both methods is their inability to detect nonuniform DIF. Donoghue et al. (1993) determined that both methods require approximately 19 or more items in the conditioning score, the studied item should be included in determining the conditioning score, and extreme ranges in item difficulty can adversely influence DIF estimation. R. L. Linn (1993) observed that DIF estimates using these procedures appear to be confounded with item discrimination. Logistic regression models. Swaminathan and Rogers (1990) applied logistic regression to DIF analysis. Logistic regression models, unlike least squares regression, permit categorical variables as dependent variables. Thus, it permits the analysis of dichotomously scored item data. It has additional flexibility by including the analysis of interaction between group and ability, as well as allowing for the inclusion of other categorical and continuous independent variables in the model. A fundamental concept of analysis with linear models is the assessment of the consistency between a model and a set of data (Darlington, 1990). Consistency between the model and the data set is measured by the likelihood or probability that the mo?el correctly represents the observed data. When the dependent variable is measured dichotomously, scored 0 or 1, a model asserts that each

PAGE 43

33 examinee will have a probability between 0 and 1 of answering an item correctly. By the multiplicative law of independent probabilities, an overall probability for a group of examinees answering in a specific pattern can be estimated. For example, if the probability of four individuals each answering an item correctly is 0.9, and three of the subjects answer correctly, the overall probability of this pattern occurring is 0.9 X 0.9 X 0.9 X (1-0.9) or 0.0729. Therefore, for item i, the likelihood function of a set of examinee responses each with ability level 0 is determined by N I 0 )U J 1 f U 1 L(Data 1 ) = TI P(ui [ P(u) n = 1 where u i has a value of 1 for a correct response and a value of 0 for an incorrect response. The logistic regression model for predicting the probability of a correct answer is where u is the response to the item given the ability level 0, ~ 0 is the intercept parameter, and ~ 1 is the slope parameter. If categorical group variable g is added to the model for the analysis of DIF, the model becomes

PAGE 44

exp (/30 + f310 + f32q + f330g') P(u = 1: 0) = --------------, [l + exp (/3 0 + {3 1 0 + f3 2 q + f3 3 0q)] where {3 2 is the estimate of uniform difference between groups, and {3 3 is the estimated interaction between group and ability. If only {3 0 and {3 1 deviate from zero, the item is interpreted as containing no DIF. If {3 2 does not equal zero, and {3 3 equals zero, uniform DIF is indicated. If {3 3 does not equal zero, nonuniform DIF is inferred. 34 Estimation of the parameters /3 0 /3 1 /3 2 and {3 3 is carried out for each item using a maximum likelihood procedure. The two null hypotheses can be tested jointly by where C = [~ O 1 0 0 ~] The test has a chi-square distribution with 2 degrees of freedom. When the test is significant, the hypothesis of no DIF is rejected (Swaminathan & Rogers, 1990).

PAGE 45

35 The logistic regression procedure offers a powerful approach for testing the presence of both uniform and nonuniform DIF. In sample sizes of 250 and 500 examinees per group and with 40, 60, and 80 test items serving as the criterion, Swaminathan and Rogers (1990) concluded under conditions of uniform DIF that the logistic regression procedure had power similar to the MH procedure and controlled Type I errors almost as well. The logistic regression procedure also had effective power in identifying nonuniform DIF, whereas the MH procedure was virtually powerless to do so. In demonstrating the ineffectiveness of the MH procedure to detect nonuniform DIF, Swaminathan and Rogers (1990) simulated data keeping item difficulties equal and varying the discrimination parameter. In effect, they simulated nonuniform symmetrical DIF. Their simulation created a set of conditions where theoretically the MH procedure has no power. Researchers must ask whether such symmetrical interactions occur with actual test data. Millsap and Everson (1993) commented that Swaminathan and Rogers (1990) utilized large numbers of items, and they conjectured that in cases with a small number of homogeneous items forming the criterion variable, false positive rates would increase unacceptably above nominal levels. Under conditions of uniform and nonuniform DIF, logistic regression facilitates the plotting of data and examining the differential response patterns to determine at

PAGE 46

36 which ability levels DIF is observed. The logistic procedure, although developed from a unidimensional perspective, provides a flexible model that can incorporate a diversity of independent categorical and continuous variables. Millsap and Everson (1993) observed that the procedure "allows for the inclusion of curvilinear terms and other factors--such as examinee characteristics like test anxiety or instructional opportunity--that may be relevant factors for exploring the possible causes of DIF" (p. 306). DIF Methods Based Upon Latent Ability Estimation DIF detection methods conditioning on latent ability are developed through various IRT models. IRT approaches describe the relationship between individual item responses and the construct measured by a test or subtest. When applied to DIF analyses, IRT permits the use of estimates of true ability as the criterion variable as opposed to the more subjective measure of observed scores. Despite its theoretical appeal, IRT approaches possess the inherent disadvantages of requiring large sample sizes, being computationally complex and costly, and including the stringent assumption of unidimensionality (Oshima, 1989). The most widely-used IRT models are the Rasch model or one parameter model, the two-parameter logistic model (2PL), and the three-parameter logistic model (3PL). Holland and Thayer (1988) demonstrated that when the Rasch model's assumptions are met and none of the items constituting the

PAGE 47

37 ability score, except possibly the studied item, contain DIF, MH provides a DIF index proportional to the index estimated by the Rasch model. Therefore, methods based upon the Rasch model will not be reviewed, and the more complex 2PL and 3PL models will be reviewed regarding their potential. The central components of IRT models are the unobserved latent trait estimate, termed 0, and a trace line for each item response, often termed the item characteristic curve (ICC). The ICC will take a specified monotonically increasing function. In the 2PL model, the probability of a correct response to Item i as a function of 0 is exp [Da 1 (0 b)] P(u 1 = 1:0) = ---------, 1 exp [Da 1 (0 b)] where the item parameters ai and bi are item discrimination and difficulty, respectively, and Dis a constant set at 1.7 in order to convert the logistic scale into an approximate probit scale (Hambleton & Swaminathan, 1985). In the 3PL model, the probability of a correct response is exp [Da 1 (0 b)] P(u 1 = 1: 0) = S + (1 c .. 1--------, .1 1 exp [Da 1 (0 b)] and includes the pseudo-chance or guessing parameter, ci.

PAGE 48

38 The general procedure for estimating DIF using a 3PL IRT model includes (a) combining both groups and estimating item parameters utilizing either a maximum likelihood or Bayesian procedure, (b) fixing the ci parameter for all items, (c) after dividing the examinees into reference and focal group members, estimating the ai and bi parameters, (d) equating the parameters from the focal group scale to the reference group scale or vice versa, (e) calculating the DIF index and significance test, and (f) utilizing a purification procedure (Lord, 1980; Park & Lautenschlager, 1990) to further examine and enhance the analysis. Purification procedures, which extract potential DIF items and reestimate ability levels without the potential DIF items included, will not be elaborated upon. DIF indices and statistical tests based upon latent ability proceed by either analyzing the difference between the item parameters (ai, bi) or analyzing the area between the groups' recs. Lord's chi-square and IRT-LR. Lord's (1980) chi-square and !RT-Likelihood Ratio (IRT-LR) simultaneously tests the dual hypothesis of aRi = aFi and bRi = bFi. Because the pseudo chance parameter and its standard errors are not accurately estimated for the separate groups (Kim, Cohen, & Kim, 1994), it is usually not tested with either procedure. For a single parameter, Lord's chi-square contrasts the difference between the estimated b's and their standard errors. Since sample sizes used in IRT estimation are so

PAGE 49

39 large to effectively assume an infinite number of degrees of freedom, the test becomes (bR bF) z = -~---_-_-_-_-_-_-_-_-_-_-_-_ Jvar(bR ) + var(bF) Alternately, z 2 will be distributed as a chi-square statistic with one degree of freedom (Thissen, Steinberg, & Wainer, 1988). The simultaneous test of the discrimination and difficulty parameters is based upon Mahalanobis distance (D 2 ) between the parameter vectors for the groups. The test statistic becomes in which Vis the vector of differences between the parameter estimations ( aR aF and bR bF) and :E is the estimated covariance matrix. The test is distributed as a chi-square with two degrees of freedom. The same hypothesis tested by Lord's chi-square can be tested with IRT-LR (Thissen, Steinberg, & Wainer 1993). The null hypothesis with IRT-LR is tested through three steps. The IRT model is fitted simultaneously for both groups to the data. A set of valid "anchor" items, containing no DIF, are utilized as the conditioning variable. In the first step, no constraints are placed on the data concerning the

PAGE 50

equality of the b's or a's. The model fit is assessed by maximum likelihood statistics and G/ = -2(loglikelihood). The IRT model is refitted under the constraint that the b and a parameters are equal for both groups and Gl = -2(loglikelihood). 40 The likelihood ratio test of significance is the difference between the fit of the two models and is The likelihood ratio test assesses significant improvement in model fit as a consequence of allowing the two parameters to fluctuate. If the likelihood ratio is significant, either the b parameter or the a parameter is different for the two groups, and DIF is detected. In this example, of simultaneously testing for differences in both parameters, the test statistic is distributed as a chi-square with two degrees of freedom. In the situation of testing for significance only in item difficulty, the statistic would be distributed as a chi-square with one degree of freedom. Both Lord's chi-square and IRT-LR assume multivariate normality. The two procedures differ in the estimation of the covariance matrix. Lord's chi-square is based upon

PAGE 51

41 second-derivative approximations of the standard errors of estimated item parameters as a part of the maximum likelihood estimation. The IRT-LR procedure does not require estimated error variances and covariances. It results from computing the likelihood at the overall mode under the equality constraints placed upon the data and then estimating the probability under the null hypothesis (Thissen et al., 1988). Lord's chi-square and IRT-LR are capable of detecting nonuniform DIF and possess good statistical power (Cohen & Kim, 1993). Because of the requisite large sample sizes, they tend to be expensive and yield false positive rates above the nominal levels (Kim et al., 1994). R. L. Linn et al. (1981), with simulated data, and Shepard, Camilli, and Williams (1984), with actual data, demonstrated that significant differences detected by Lord's chi-square occurred even when plotted recs were nearly identical. An additional problem when employing IRT-LR is the need for a set of truly unbiased anchor items (Millsap & Everson, 1993). Procedures estimating area between recs. Eight different DIF procedures have been developed to estimate the area between the reference group's ICC and the focal group's ICC. Area measures differ by whether they employ (a) signed or unsigned areas, (b) bounded or unbounded ability

PAGE 52

42 intervals, (c) continuous integration or discrete approximation, and (d) weighting (Millsap & Everson, 1993). The first area procedures utilized bounded intervals with discrete approximations. Rudner (1977) suggested an unsigned index 3 R = L [ :PR(0) PF(e): n], 0 = 3 with discrete intervals from ej = -3 to ej = 3. Rudner (1977) used small interval distances (e.g.; n = .005) and summed across 600 intervals. The estimated R is converted to a signed index by removing the absolute value operator. Shepard et al. (1984) extended the signed and unsigned area procedures by introducing four techniques that included sums of squared values, weights based upon number of examinees in each interval along thee scale, and weighting initial differences by the inverse of the estimated standard error of the difference. They determined that distinctively different interpretations occurred when signed area indices were estimated as compared to unsigned indices. They further found that various weighting procedures influence interpretations only slightly, and they concluded that item interpretations were only moderately influenced by decisions related to using a weighting procedure.

PAGE 53

43 All of the area indices proposed by Shepard et al. (1984) utilized discrete approximations and lacked sample standard errors to permit significant tests. Raju (1988, 1990) augmented these procedures by devising an index to measure continuous integration over unbounded intervals and derived standard errors permiting significant tests. Raju (1988) proposed setting the c parameter equal for both groups and estimating the signed area by The unsigned area is estimated by UA Raju (1990) derived asymptotic standard error formulas for the signed and unsigned area measures that can be used to generate z tests to determine significance levels of DIF under conditions of normality. Theoretically, Raju's procedure for measuring and testing the significance of the area between the ICCs for two groups is a significant advancement over procedures utilizing discrete intervals. Raju (1990) interpreted results derived from this procedure as sensible and found that the signed index and significant test provided results consistent with the MH procedure. Raju, Drasgow, and Slinde

PAGE 54

44 (1993), analyzing data from a 45-item vocabulary trial test contrasting girls and boys and black and white students, found that the significance tests of the area measures identified the identical set of aberrant items as Lord's chi-square. Raju et al. (1993) set the alpha rate at 0.001 to control for Type I errors. Cohen and Kim (1993) found that in comparing Lord's chi-square to Raju's SA and UA, the two procedures produced similar results, although Lord's chi-square appeared slightly more powerful in identifying simulated DIF. DIF as a Consequence of Multidimensionality In all of the procedures thus far reviewed, researchers have either conditioned an item response on an observed test score or a latent ability estimate. Procedures using observed scores assumed that the total score has valid meaning in terms of the purported construct measured. The IRT procedures assumed responses to a set of items are unidimensional even though examinees' scores may reflect a composite of abilities. The potential for DIF can be conceptualized as occurring when a test consists of targeted ability, e, and item responses are influenced by one or more nuisance determinants, ry (Shealy & Stout, 1993a, 1993b). Under this circumstance, an item may be misinterpreted due to IRT model misspecification (Ackerman, 1992; Camilli, 1992; Oshima, 1989). If a misspecified, unidimensional IRT model is employed, the potential for DIF occurs if (a) thee

PAGE 55

45 means are not equal, (b) the ry means are not equal, (c) the ratio ary/a 9 are not equal, and (d) the correlations between the valid and nuisance dimensions are not equal (Ackerman, 1992). The presence of multidimensionality in a set of items does not necessarily lead to DIF. For example, a quantitative ability test used to predict future college achievement may contain mathematical word problems requiring proficiency in reading skills. The test contains one primary dimension--quantitative ability; however, a second requisite measured skill--reading ability--is valid for the specific usage. A unidimensional analysis applied to such multidimensional data would weight the relative discriminations of the multiple traits to form a reference composite (Ackerman, 1992; Camilli, 1992). If the focal and reference groups share a common reference composite, DIF is not possible. Since any test containing two or more items will to a degree be multidimensional, practitioners should define a validity sector to identify test items measuring approximately the same composite of abilities (Ackerman, 1992). In DIF studies, the conditioning variable should consist only of items measuring the same composite of abilities. If the dimensionality of the conditioning variable is not carefully defined, the DIF analysis is matching focal and reference group examinees on different

PAGE 56

composites of ability. This creates, in essence, the problem of trying to compare apples to oranges. The potential effect of this is to confound DIF with impact resulting in spurious interpretations (Camilli, 1992). 46 The effect of multidimensionality in DIF analyses has resulted in limited consistency across methods (Skaggs & Lissitz, 1992) and across differing definitions of the conditioning variable (Clauser, Mazor, & Hambleton, 1991). Further, R. L. Linn (1993) observed that rigorous DIF implementation to identify a proper set of test items may restrict validity. For example, on the SAT-Verbal (SAT-V), items with large biserial correlations to total score were more likely to be flagged than items with average or below average biserial correlations using MH. This finding suggested that traditional unidimensional DIF analyses, in part, might be statistical artifacts confounding group ability differences and item discrimination. Differential item functioning procedures based upon a multidimensional perspective and conditioning on items clearly defined from a validity sector have the potential to reduce these problems (Ackerman, 1992). Further, a multidimensional approach should also facilitate DIF explanation (Camilli, 1992). Careful evaluation and conceptualization of multiple dimensions influencing item responses could aid in the identification and isolation of DIF's causes.

PAGE 57

SIBTEST. Shealy and Stout (1993a, 1993b) have formulated a DIF detection procedure within a multidimensional conceptualization. They conceptualize a test as measuring a unidimensional trait or reference composite--the target ability--that is influenced periodically by nuisance determinants. DIF is interpreted as the consequence of the differential effect of nuisance determinants functioning on an item or set of items. The SIBTEST procedure employs factor analysis to identify a set of items that adheres to a defined validity sector. These items constitute the valid subtest, and the remaining items become the studied items. Examinees are divided into j strata based upon the valid subtest score, and the DIF index is estimated by J {3 u = L p j ( y R f y F ) j = 1 47 where p j is the pooled weighting of focal and reference group examinees who achieve X = j. The value of P u is identical to the value of STD P-DIF when the total number of examinees are the weighting group. Shealy and Stout (1993a) have referred to the standardization procedure as "progenitor" (p. 161) of SIBTEST. They present a significance test with the standard errror estimated by

PAGE 58

48 SE(J3 ) = With SIBTEST the total score on the valid subset serves as the conditioning criterion. The SIBTEST procedure resembles methods on which an observed test score is the criterion; although, it incorporates an adjustment to the item mean prior to comparing groups on these means. This adjustment is an attempt to remove that portion of group mean difference attributable to group mean differences on the valid targeted ability. When the matching criterion is an observed score and the studied item is not included in the criterion score, group differences in target ability will tend to statistically inflate p. Consequently, SIBTEST employs a correctional procedure based upon regression and IRT theory. In effect, the purpose is to transform each observed mean group and ability level score, Y 9 j, into a transformed mean so that the transformed score, Y* 9 j, is a valid estimate of ability level score mean. This adjustment attempts to remove that portion of group mean differences that is attributable to group differences in the underlying targeted ability. Thus, y Rj y Fj

PAGE 59

49 is an estimate of the difference in subtest true scores for the referenced and focal groups with examinees matched on ability levels. For this transformation to yield an unbiased estimate, the valid subtest must contain a minimum of 20 items (Shealy & Stout, 1993a). SIBTEST is the only procedure based on conceptualizing DIF as a result of multidimensionality. Although it resembles the procedures that condition on observed scores, it offers a regression correction procedure that allows for conditioning on estimated true scores. Under simulated conditions it demonstrates good adherence to nominal error rates even when group target ability distribution differences are extreme, and it has been shown to be as powerful as MH in the detection of uniform DIF (Shealy & Stout, 1993a). Its multidimensional conceptualization potentially can lead to the identification of different nuisance determinants and greater understanding of DIF's causes (Camilli, 1992). The major weaknesses of SIBTEST are its inability to assist the user in detecting nonuniform DIF and the need for 20 or more items to fit a unidimensional validity sector. With a relatively short test or subtest, this latter weakness would be problematic under some pract i cal testing situations.

PAGE 60

so Methods Summary After 20 years of development, a plethora of sophisticated DIF procedures have been devised. Each method approaches DIF identification from a fundamentally different perspective, and each method contains advantages and limitations. Currently, no consensus among DIF researchers exits regarding a single theoretical or practical best method. The design of this study reflected this lack of consensus. I selected five different procedures, each possessing theoretical or practical appeal, to assess item responses of examinees. The design of the study was not to compare the reliability and validity of the methods themselves, but to assess the similarity of results obtained from the methods when subpopulations were define in conceptually different ways. Uncovering the Underlying Causes of DIF The overwhelming majority of DIF researchers have focused on designing statistical procedures and evaluating their efficacy in detecting aberrant items. Few researchers have attempted to move beyond the methodological issues and examine DIF's causes. The researchers broaching this topic have experienced few successes and many frustrations. Schmitt et al. (1993) proposed that explanatory DIF studies should begin with post hoc explorations of aberrant items and proceed to confirmatory experimental analyses. Researchers assessing DIF's causes use procedures that can

PAGE 61

51 be classified as (a) post hoc speculations, (b) hypothesis testing of item categories, (c) hypothesis testing using item manipulations, and (d) manipulation of other variables. DIF can be attributed to a complex interaction between the item and the examinee (Scheuneman & Gerritz, 1990). Researchers are unlikely to find a single identifiable cause of DIF since it stems from both differences within examinees and item characteristics (Scheuneman, 1987). Researchers examining DIF from the perspective of examinee differences may uncover significant findings with implications for test takers, educators, and policy makers. Scheuneman and Gerritz (1990) suggested that "prior learning, experience, and interest patterns between males and females and between Black and White examinees may be linked with DIF" (p. 129). Researchers examining DIF from the perspective of item characteristics may discover findings with strong implications for test developers and item writers. Test developers may need to balance content and item format to ensure fairness. Post hoc evaluations, despite their limitations, dominate the literature (Freedle & Kostin, 1990; R. L. Linn & Harnisch, 1981; O'Neill & McPeek, 1993; Shepard et al., 1984; Skaggs & Lissitz, 1992). Speculations for causes of DIF begin with an interpretation of content similarities and patterns. For many researchers, the interpretation fails to go beyond a description of the observed item patterns

PAGE 62

52 (O'Neill & McPeek, 1993; Shepard et al., 1984; Skaggs & Lissitz, 1992). Hypothesis testing of item categories is a second, more sophisticated, means of uncovering explanations of DIF. Doolittle and Cleary (1987) and Harris and Carlton (1993) evaluated several DIF hypotheses on math test items. Doolittle and Cleary (1987) employed ACT Assessment Mathematics Usage Test (ACT-M) items and a pseudo-IRT detection procedure to analyze differences across item categories and test forms. Male examinees performed better on geometry and mathematical reasoning items, whereas female examinees performed better on computation items. Harris and Carlton (1993), using SAT-Mathematics (SAT-M) items and the MH procedure, concluded that male examinees did better on application problems and female examinees did better on more textbook-type problems. Scheuneman (1987) analyzed 16 separate hypotheses concerning potential causes of DIF for black and white examinees by manipulating test items on the experimental portion of the GRE general test. The hypotheses, analyzed through log linear models, included examinee characteristics, such as test wiseness, and item characteristics, such as format. Complex interactions across groups, item pairs, and test content were observed. Schmitt (1988) manipulated SAT-V items for white and Hispanic examinees to test four hypotheses derived from an

PAGE 63

53 earlier post hoc review. She employed the STDP-DIF index with ANOVA and found that Hispanic examinees were favored on antonym items that included a true cognate, a word with a common root in English and Spanish, and on reading passages containing material of interest to Hispanics. False cognates, words spelled similarly in both languages but containing different meanings, and homographs, words spelled alike in English but containing different meanings, tended to be more difficult for Hi spanics. The differences were greater for Puerto Rican examinees, a group generally more dependent on Spanish, as compared to Mexican-American examinees. K. K. Tatsuoka, R. L. Linn, M. M. Tatsuoka, and Yamamoto (1988) studied DIF on a 40-item fractions test. They initially analyzed examinees by dividing them into two groups based upon instructional methods. This procedure failed to provide an effective means of detecting DIF. However, upon subsequent review and analysis, they divided examinees into groups based upon solution strategies used in solving problems. With this grouping variable, they found DIF indices consistent with their a priori hypotheses. They concluded that the use of cognitive and instructional subgroup categories, although counter to traditional DIF research, contained potential for explaining DIF and diagnosing examinees' misunderstandings.

PAGE 64

54 Miller and R. L. Linn (1988) considered the invariance of item parameters for the Second International Mathematics Study (SIMS) examination across different levels of mathematical instructional coverage. Although their principal concern was the multidimensionality of achievement test data as related to instructional differences and IRT model usefulness, they found that instructional differences could explain a significant portion of observed DIF. Using cluster analysis, they divided students into three instructional groups based upon teacher responses to an opportunity-to-learn questionnaire. The size of the differences in the recs for groups based upon instructional groups was much greater than differences observed in previously reported comparisons of black and white examinees. They interpreted these findings as supportive of R. L. Linn and Harnisch's (1981) postulation that what appears as item bias may in reality be "'instructional bias'" (p. 216). Despite Miller and R. L. Linn's (1988) straightforward interpretation of instructional experiences, Doolittle (1984, 1985) found that instructional differences did not account for or parallel gender DIF on ACT-M items. He dichotomized high school math background into strong and weak, and compared a gender DIF analysis to a math background DIF analysis. In each analysis, approximately an equal number of items were detected; however, items that

PAGE 65

tended to favor female examinees did not favor low background examinees and vice versa. Correlations of DIF indices were negative, suggesting that gender DIF was unrelated to math background DIF. 55 Muthen, Kao, and Burstein (1991), analyzing the 40 core items of the SIMS test, found several items to be sensitive to instructional effects. In approaching DIF from an alternative methodological perspective, they employed linear structural modeling to assess the effects of instruction on latent mathematics ability and item performance. They found that instructional effects had negligible effects on math ability, but had significant influence on specific test items. Several items appeared particularly sensitive to instructional influences. They interpreted the identified items as less an indicator of general mathematics ability and more an indicator of exposure to a specified math content area. In using linear structural modeling, Muthen et al. (1991) avoided the arbitrariness of defining group categories in a situation where group membership varied across items. The SIMS data permitted the estimation of instructional background for each of the 40 core items. Under most testing conditions, estimating examinee background differences to each item is impossible. However, general educational and psychological background variables can be modeled, and their relationship to unintended or

PAGE 66

56 nuisance dimensions estimated. Analyzing the relationship of theoretical causes of DIF to nuisance dimensions combines the approaches of Muthen et al. (1991) with Shealy and Stout (1993a, 1993b). Summary Researchers investigating the underlying causes of DIF have produced few significant results. After more than 10 years of DIF studies, conclusions of test wiseness (Scheuneman, 1987) or Hispanic tendencies on true and false cognates (Schmitt, 1988) must be interpreted as meager guidance for test developers and educators. These limited results can be explained by problems inherent in traditional DIF procedures (Skaggs & Lissitz, 1992; K. K. Tatsuoka et al., 1988). Indices derived using observed total scores as the conditioning variable have been observed to be confounded with item difficulty (Freedle & Kostin, 1990) and item discrimination (R. L. Linn, 1993; Masters, 1988). Indices derived from IRT models are conceptualized from an unidimensional perspective, yet DIF is a product of multidimensionality (Ackerman, 1992; Camilli, 1992). Consequently, DIF detection procedures have been criticized for a lack of reliability between methods and across samples (Hoover & Kolen, 1984; Skaggs & Lissitz, 1992). The traditional conceptualization of dividing examinees by demographic characteristics limits DIF's explanatory potential (Skaggs & Lissitz, 1992; K. K. Tatsuoka et al.,

PAGE 67

57 1988). The uninterpretability of findings may be because group membership is only a weak surrogate for variables of greater psychological or educational significance. For example, demographic categories (e.g., women or blacks) lack any psychological or educational explanatory meaning. Moving beyond demographic subgroups to more meaningful categories would expedite understanding of DIF's causes (R. L. Linn, 1993; Schmitt & Dorans, 1990; Skaggs & Lissitz, 1992; K. K. Tatsuoka et al., 1988). Although this conceptualization has been advocated, it has been used sparingly. Doolittle (1984, 1985), Miller and R.L. Linn (1988), Muthen et al. (1991) and K. K. Tatsuoka et al. (1988) used this conception and appeared to have reached promising, if incompatible, interpretations. Future researchers need to apply alternative approaches to DIF analyses to achieve explanatory power. Approaches advocated by Muthen et al. (1991) and Shealy and Stout (1993a, 1993b) provide sound methods that potentially permit the modeling of differing influences on item responses. Gender and Quantitative Aptitude Educational and psychological researchers have been concerned with gender differences in scores on quantitative aptitude tests (Benbow, 1988; Benbow & Stanley, 1980; Friedman, 1989; Hyde, 1981; Maccoby & Jacklin, 1974), and their implications for career opportunities (M. c. Linn & Hyde, 1989). Mathematics has been termed the "critical

PAGE 68

58 filter" that prohibits many women from having access to high-paying and prestigious occupations (Sells, 1978). Although gender differences in quantitative ability interact with development, with elementary children demonstrating no difference or differences slightly favoring girls, by late adolescence and early adulthood, when college entrance examinations are taken and critical career decisions are made, slight differences appear favoring boys (Fennema & Sherman, 1977; Hyde, Fennema, & Lamon, 1990). In studies linking gender differences in quantitative test scores with the underrepresentation of women in prestigious technical careers, analyses should be limited to tests taken in late adolescence or early adulthood that significantly influence career decisions and opportunities. Significant and Important Test Score Differences Standardized achievement tests utilizing representative samples (e.g., National Assessment of Educational Progress, High School and Beyond) and college admissions tests utilizing self-selected samples (e.g., SAT, ACT, and GRE) have been analyzed to ascertain gender differences. Gender differences found in representative samples are systematically different from those found in self-selected samples (Feingold, 1992). Women appear less proficient, relative to men, in tests of self-selected samples of applicants as compared to representative samples. However, female students interested in technical professions must

PAGE 69

59 successfully matriculate through a process that relies heavily upon admissions test scores. Therefore, in studying quantitative differences with the primary concern related to career decisions and opportunities, self-selected admission test scores are the most germane measures for analysis. M. c. Linn and Hyde (1989) concluded from meta-analytic studies (Friedman, 1989; Hyde et al., 1990) that "average quantitative gender differences have declined to essentially zero" (p.19), and differences in quantitative aptitude can no longer be used to justify the underrepresentation of women in technical professions. Feingold (1988) assessing gender differences in several cognitive measures on the Differential Aptitude Test (DAT) and the SAT concluded that gender differences are rapidly diminishing in all areas. The one exception to this finding was the SAT-M (Feingold, 1988). Although mean differences had either substantially diminished or vanished on DAT measures of numerical ability, abstract reasoning, space relations, and mechanical reasoning, during the past 30 years, SAT-M differences have remained relatively constant. Despite the finding that gender differences are disappearing on many mathematical ability tests, on the major college entrance examinations gender differences remain large (Halpern, 1992). over the past three decades, gender differences on the SAT-M have remained between 40 and 50 points. During this period, men averaged 46.5 points

PAGE 70

60 higher on the SAT-M than women (National Center for Education Statistics, 1993). This difference can also be stated in units of an effect size of 0.39 d (in which d represents the difference between the means divided by the pooled standard deviation). The trends regarding gender differences on the ACT-M are similar. The ACT-M scale ranges from 1 to 39 points, and the mean difference favoring male examinees from 1978 to 1987 was 2.33 points or 0.33 d (National Center for Education Statistics, 1993). This score differential has been relatively consistent and provides no indication of disappearing. The greatest disparity between men's and women's mean scores occurs on the GRE-Quantitative (GRE-Q). For the 198687 and 1987-88 testing years, U.S. male examinees averaged 86 and 80 points higher than U.S. female examinees (Educational Testing Service, 1991). Transformed into effect sizes, these differences are 0.67 d and 0.62 d, respectively. Gender mean score differences on the GRE-Q, in large part, reflect gender differences in choice of major field. Particularly in the case of graduate admissions tests, mean scores are confounded with gender differences in choice of undergraduate major. Analyzing GRE-Q data by intended field of study provides a more accurate comparison. For examinees intending to major in mathematics, the sciences, or engineering in 1986-1987, mean score

PAGE 71

61 differences favoring men were 37 and 18 points, respectively (d = .35 and .19). For examinees intending to major in the humanities and education in the same testing year, mean score differences favoring men were 44 and 37 points, respectively (d = .36 and .31). Averaging across 11 identified intended fields of study, mean score differences favoring men were 40 points (d = .35) (Educational Testing Service, 1991). Although data was available for only the 1986-87 testing years, mean score differences and effect sizes appear to indicate that U.S. male examinees tend to score higher than U.S. female examinees on the GRE-Q in a pattern consistent with the SAT-Mand ACT-M. Despite changes in the curriculum and text materials that depict both genders in less stereotypic manners (Sherman, 1983) and reductions in gender differences on many mathematics tests (Feingold, 1988), on college admissions quantitative tests gender differences are significant and appear not to be diminishing. Due to the importance of these tests regarding college admission decisions and the awarding of financial aid, the disparity in scores tends to reduce opportunities for women (Rosser, 1989). Predictive Validity Evidence Although mean scores on quantitative admission scores are higher for men than women, women tend to earn higher grades in high school and college (Kimball, 1989; Young, 1994). Test critics cited this paradox as principal

PAGE 72

62 evidence that admission tests are biased against women (Rosser, 1989). Defenders of the use of college admission tests argued that other relevant factors explain this phenomenon (Mccornack & McLeod, 1988; Pallas & Alexander, 1983). They postulated that women tend to enroll in major fields where faculty tend to grade less rigorously (e.g., women are more likely to major in the humanities whereas men are more likely to major in the sciences). Investigators analyzing differential predictive validity of college admissions exams, therefore, must consider gender differences in course enrollment patterns. Mccornack and McLeod (1988) and R. Elliot and Strenta (1988) generally found that, when differential course taking patterns were considered, SAT-V and -M coupled with high school grades were not biased in predicting achievement for men and women. Mccornack and McLeod (1988) considered performance in introductory level college courses at a state university and used SAT composites with high school grade point average. They found no predictive bias when analyzing data at the course level. R. Elliot and Strenta (1988) considered performance in various college-level courses at a private university and utilized SAT composites with scores from a college placement examination and high school rank. They also found no gender bias in prediction. However, they interpreted the SAT-M, when used in isolation, as underpredictive of women's college achievement. Both studies

PAGE 73

63 were flawed in that they combined various predictors and found no bias. Had they separately studied SAT-Mand high school grades, they might have arrived at a different interpretation. Bridgeman and Wendler (1991) and Wainer and Steinberg (1992) conducted more extensive studies and concluded that, for equivalent mathematics courses, the SAT-M tends to underpredict college performance for women. Bridgeman and Wendler (1991) studied the SAT-Mas a predictor of college mathematics course performance at nine colleges and universities. They divided mathematics courses into three categories and found that, in algebra and pre-calculus courses, women's achievement was underpredicted and, in calculus courses, no underprediction occurred. The most extensive study to date concerning the predictive validity of the SAT-M was conducted by Wainer and Steinberg (1992). Analyzing nearly 47,000 students at 51 colleges and universities, they concluded that, for students in the same relative course receiving the same letter grade, the SAT-M underpredicted women's achievement. Using a backward regression model, they estimated that women, earning the same grades in similar courses, tended to score roughly 25-30 points less on the SAT-M. Summary Researchers and test users have been troubled by the consistent findings than men tend to outperform women on

PAGE 74

64 quantitative admissions exams, although women generally outperform men in high school and college courses. The principal explanation offered for this paradox is gender differences in course taking. Researchers investigating the relationship of quantitative admission tests and subsequent achievement, controlling for course taking patterns and course performance, have concluded that, in equivalent mathematics courses, the tests underpredict women's achievement. Although the underprediction is not as large as mean score differences, quantitative admission tests do appear to be biased in underpredicting women's college achievement. It is recognized that predictive bias and DIF are fundamentally distinct; however, the determination of predictive bias in quantitative admission tests makes them an evocative instrument for analysis. Potential Explanations of DIF This study will approach DIF from the perspective of examinee characteristics. When analyzing DIF explanations from this perspective, theoretical explanations of predictive bias offer a reasonable point of departure. Kimball (1989) presented three theorectical explanations for the paradoxical relationship of gender differences on admissions test scores and college grades: (a) men have greater mathematical experience which enables them to more easily solve novel problems, (b) women tend to develop rote learning styles whereas men tend to develop autonomous

PAGE 75

65 learning styles, and (c) men tend to prefer novel tasks whereas women tend to prefer familiar tasks. To these three theorectical explanations, I would submit a fourth explanation related to test-taking behavior--differences between men in women in test anxiety. Differences in Mathematics Background It is well documented that as students enter high school and proceed toward graduation boys tend to take more mathematics courses than girls (Fennema & Sherman, 1977; Pallas & Alexander, 1983). During the 1980s, high school boys averaged 2.92 Carnegie units of mathematics whereas high school girls averaged 2.82 Carnegie units (National Center for Education Statistics, 1993). Although high school girls entered the upper-track ninth grade mathematics curriculum in slightly greater numbers than boys, by graduation, boys outnumbered girls in advanced courses such as calculus and trigonometry. High school boys were more likely to study computer science and physics than girls (National Center for Education Statistics, 1993). These trends continue as students enter college. During the 1980s, men slightly outnumbered women in achieving undergraduate mathematics degrees, and overwhelmingly outnumbered women in attaining undergraduate degrees in engineering, computer science, and physics. Gender disparities became even greater in the attainment of graduate degrees in mathematics, engineering, computer

PAGE 76

science, and physics (National Center for Education Statistics, 1993). 66 Researchers investigating the relationship between mathematics background and test scores have found that, when enrollment differences are controlled, gender differences on mathematical reasoning tests are reduced (Fennema & Sherman, 1977; Pallas & Alexander, 1983; Ethington & Wolfle, 1984). Gender score differences on the SAT-M, when high school course taking was controlled, were reduced approximately by two-thirds (Pallas & Alexander, 1983) and by one-third (Ethington & Wolfle, 1984). These studies analyzed total score differences controlling for course background. Miller and R. L. Linn (1988) and Doolittle (1984, 1985) analyzed item differences controlling for instructional differences, but their results were contradictory. Background differences offer a plausible explanation for DIF that implores additional investigation. Rote Versus Autonomous Learning Styles Boys tend to develop a more autonomous learning style which facilitates performance on mathematics reasoning problems and girls tend to develop a rote learning style which facilitates classroom performance (Fennema & Petersen, 1985). Socialization patterns at home and in school tend to create these two distinct, gender-based, learning styles. Students displaying an autonomous learning style tend to

PAGE 77

67 do better, are more motivated, and are more likely to persevere on difficult tasks presented in a novel and independent format. Students displaying rote learning behavior tend to do well applying memorized algorithms learned in class and are heavily dependent upon teacher direction. Often, these students tend to choose less challenging tasks when given an option. This dichotomy is congruent with the finding that girls tend to perform better on computational problems and boys tend to perform better on application and reasoning problems (Doolittle & Cleary, 1988; Harris & Carlton, 1992). The autonomous versus rote learning style theory is consistent with the literature addressing gender socialization patterns and standardized test performances. Before it can be further applied, however, it must be more completely operationalized (Kimball, 1989). To validate this theory, researchers must demonstrate that boys and girls approach the study of mathematics differently, and then relate learning styles to achievement on classroom assessments and standardized tests (Kimball, 1989). Novelty Versus Familiarity Kimball (1989) hypothesized that girls tend to be more motivated to do well and are more confident when working with familiar subject matter. Boys, on the other hand, tend to work harder and are more confident on novel tasks. Subsequently, girls tend to demonstrate higher achievement

PAGE 78

68 on familiar classroom assessments and boys tend to demonstrate higher achievement on novel standardized tests. This theory is based on the work of Dweck and her colleagues (Dweck, 1986; E. s. Elliot & Dweck. 1987; Licht & Dweck, 1983) who related attributions to learning and achievement. Students with a performance orientation and low confidence tend to avoid difficult and threatening tasks. They prefer familiar, non-threatening tasks and seek to avoid failure. Students with a performance orientation and high confidence are more likely to select moderately challenging tasks. Consistent findings demonstrate that girls tend to have less confidence in their mathematical abilities than boys (Eccles, Adler, & Meece, 1984; Licht & Dweck, 1983). Girls are also more likely on standardized tests to leave items unanswered or mark "I don't know" when given this option (M. c. Linn, DeBenedictis, Delucchi, Harris, & Stage, 1987). Girls, more so than boys, attribute their success in mathematics to effort rather than ability and their failures to lack of ability (Fennema, 1985; Ryckman & Peckham, 1987). Therefore, due to less confidence in their abilities, girls generally are less motivated on novel mathematical tasks, find them more threatening, and perform less well. Test Anxiety Test anxiety has been hypothesized to adversely influence examinees' total scores on IQ tests, aptitude, and

PAGE 79

69 achievement tests. High test anxiety individuals tend to score lower than low test anxiety individuals of comparable ability (Hembree, 1988; Sarason, 1980). Because aptitude and achievement tests are not intended to include test anxiety as a component of total score, and because an estimated 10 million elementary and secondary pupils have substantial test anxiety (Hill & Wigfield, 1984), it exemplifies a nuisance factor influencing item responses. Test anxiety has been theorized in both cognitive and behavioral terms (Hembree, 1988; Sarason, 1984; Spielberger, Gonzales, Taylor, Algaze, & Anton, 1978; Wine, 1980). Liebert and Morris (1967) proposed a two dimensional theory of test anxiety, consisting of worry and emotionality. Worry includes any expression of concern about one's performance and consequences stemming from inadequate performance. Emotionality refers to the autonomic reactions to test situations (e.g., increased heartrate, stomach pains, and perspiration). Hembree (1988) used meta analysis for 562 test anxiety studies and found that, although both dimensions related significantly to performance, worry was more strongly correlated to test scores. The mean correlations for worry and emotionality to aptitude/achievement tests were -0.31 and -0.15, respectively. Based upon a two dimensional model of test anxiety, Spielberger et al. (1978) proposed the Test Anxiety Inventory (TAI).

PAGE 80

70 Wine (1980) proposed a cognitive-attentional interpretation of test anxiety in which examinees who are high or low on test anxiety experience different thoughts when confronted by test situations. The low test anxious individual experiences relevant thoughts and attends to the task. The high test anxious individual experiences self preoccupation and is absorbed in thoughts of failure. These task irrelevant cognitions not only create unpleasant experiences, but act as major distractions. Sarason (1984) proposed the Reactions to Test (RTT) scale based upon a cognitive, emotional, and behavioral model. The 40-item Likert-scaled questionnaire operationalized a four dimensional test anxiety model of (a) worry, (b) tension, (c) bodily symptoms, and (d) test-irrelevant thinking. Benson and Bandalos (1992), in a confirmatory cross validation, found the four-factor structure of the RTT problematic. They speculated that misfit resulted from the large number of similarly worded items. Through a process of item deletion, they found substantial support for a 20item four-factor model. To further validate the structure of test anxiety, Benson, Moulin-Julian, Schwarzer, Seipp, and El Zahhar (1991) combined the TAI and the RTT to formulate a new scale. The Revised Test Anxiety scale (RTA) was validated with multi-national samples and further refined (Benson & El Zahhar, 1994).

PAGE 81

71 The cognitive and emotional structure of math anxiety is closely related to test anxiety. Richardson and Woolfolk (1980) demonstrated that math anxiety and test anxiety were highly related, and mathematical testing provided a superb context for studying test anxiety. They reported correlations between inventories of test anxiety and math anxiety ranging near 0.65. They commented that "(t)aking a mathematics test with a time limit under instructions to do as well as possible appears to be nearly as threatening as a real-life test for most mathematics-anxious individuals" (p. 271). Children in first and second grade indicate inconsequential test anxiety levels, but by third grade test anxiety emerges and increases in severity until sixth grade. Female students at all age levels tend to possess higher test anxiety levels than male students at all grade levels (Everson, Millsap, & Rodriguez, 1991; Hembree, 1988). Some behavioral and cognitive-behavioral treatments have been demonstrated to effectively reduce test anxiety and lead to increases in performance (Hembree, 1988). This finding supports the causal direction of test anxiety producing lower performance and test anxiety's multidimensional structure. The preponderance of research on test anxiety has focused on the relationship of test anxiety to total score performance. Harnisch & R. L. Linn (1981) speculated that

PAGE 82

72 in cases of model misfit, forces such as test anxiety might unduly influence performance at the item level. High test anxiety individuals may find some items differentially more difficult than other test items. Summary I have reviewed several different methods of identifying DIF. The MH, in large part because of its computational efficiency, has emerged as the most widely used method. It is limited in terms of its flexibility, and as researchers continue to search for underlying explanations of DIF, it limitations will become more apparent. Logistic regression models (Swaminathan & Rogers, 1990) provide an efficient method that has greater flexibility than MH and potentially models theoretical causes of DIF. Raju's (1988) IRT signed and unsigned area measures supply a theoretically sound method of contrasting item response patterns. Shealy and Stout's SIBTEST (1993a, 1993b) conceptualizes DIF as a multidimensional phenomenon and defines a validity sector as the conditioning variable. Its sound theoretical foundation coupled with it computational efficiency and explanatory potential makes it perhaps the most comprehensive DIF procedure. These five approaches were employed for the study. Linear structural modeling was used to factor analyze item responses and define a valid subset of test items. The five methods of DIF were applied before validation and incorporating the

PAGE 83

73 findings of the validation study. Thus, the significance of validation on the consistency of DIF estimation was considered. Gender DIF on quantitative test items will serve as the context for this study. The context was taken because of the paradoxical finding that men tend to score higher on standardized tests of math reasoning, although women tend to achieve equivalent or higher course grades. Gender, a common categorical variable in DIF studies, will be supplement by dichotomizing examinees into substantial and weak mathematics background and high and low test anxiety. This study is based on the premise that gender differences serve as a surrogate for differences in background and test anxiety. The two variables were selected in an effort to explain DIF in terms consistent with theoretical explanations of gender differences in mathematics test scores and course achievement. Mathematics background has been applied in other DIF studies with inconsistent interpretations. Test anxiety is of interest to both educators and cognitive psychologists and is highly related to performance. The study is an attempt to determine if the use of these variables serves to improve the consistency of DIF indices, detection methods, and aid in illuminating its causes.

PAGE 84

CHAPTER 3 METHODOLOGY The present study was designed to investigate the inter-method consistency of five separate differential item functioning (DIF) indices and associated statistical tests when defining subpopulations by educationally significant variables as well as the commonly used demographic variable of gender. The study was conducted in the context of college admission quantitative examinations and gender issues. The study was designed to evaluate the effect on DIF indices of defining subpopulations by gender, mathematics background, and test anxiety. Factor analytic procedures were used to define a structurally valid subtest of items. Following the identification of a valid subtest, the DIF analysis was repeated. The findings of the DIF analysis before validation were contrasted with the DIF analysis based on the valid subset. A description of examinees, instruments, and data analysis methods is presented in this chapter. 74

PAGE 85

75 Examinees The data pool to be analyzed consisted of test scores and item responses from 1263 undergraduate college students. The sample consisted of 754 women and 509 men. I solicited the help of various instructors in the colleges of education and business, and in most cases, students participated in the study during their class time. Of the total sample of examinees, 658 individuals were tested in classes of the college of education, 483 individuals were tested in classes in the college of business, and 122 individuals were tested at other sites on campus. women and examinees with little mathematics background were the largest groups in the college of education classes, and men and examinees with substantial mathematics background were the largest groups in the college of business classes (see Table A.1 of Appendix A for examinee frequencies by test setting, gender, and mathematics background). The majority of students received class credit for participating. No remuneration was provided to any participant. All students had previously taken a college admissions examination, and some of the students (approximately 37 percent) had taken the Graduate Record Examination-Quantitative Test (GRE-Q).

PAGE 86

76 Instruments The operational definition of a collegiate-level quantitative aptitude test was a released form of the GRE Q. Test anxiety was operationally defined by a widely used, standardized measure, the Revised Test Anxiety Scale {RTA). The mathematics background variable was measured using the dichotomous response to an item concerning whether or not students had completed a particular advanced mathematics class at the college level (i.e., calculus). In the following sections, a more detailed description of each of these instruments is presented accompanied by technical information that supports use of the particular instruments or item for the purpose of the study. Released GRE-Q Each examinee completed a released form of the GRE-Q. The 30-item test, supplied by Educational Testing Service (ETS), was a 30-minute timed examination. The sample test contained "many of the kinds of questions that are included in currently used forms" {ETS, 1993, p. 39) of the GRE-Q. The test was designed to measure basic mathematical skills and concepts required to solve problems in quantitative settings. It was divided into two sections. The format of the first section, quantitative comparison items, measured the ability to

PAGE 87

reason accurately in comparing the relative size of two quantities or to recognize when insufficient information had been provided to make such a comparison. The format of the second section, employing multiple choice items, assessed the ability to perform computations and manipulations of quantitative symbols and to solve word problems in applied or abstract contexts. The instructional background required to answer items was described as "arithmetic, algebra, geometry, and data analysis," and as ''content areas usually studied in high school" (ETS, 1993, p. 18). 77 The internal consistency of the test for the 1263 participants was relatively good, KR-20 = 0.79. In a pilot study, the sample test correlations with the GRE-Q for 55 examinees and with the Scholastic Aptitude Test Mathematics (SAT-M) for 58 examinees were 0.67 and 0.79, respectively. Thus, the scores on the released GRE-Q were similar to scores examinees earned on other college admissions quantitative examinations. Revised Test Anxiety Scale (RTA} The RTA scale (Benson, Moulin-Julian, Schwarzer, Seipp, & El-Zahhar; 1991) was formed by combining the theoretical framework of two recognized measures of test anxiety--the Test Anxiety Inventory (TAI) (Spielberger, Gonzales, Taylor, Algaze, and Anton, 1978) and the

PAGE 88

78 Reactions to Tests (RTT)(Sarason, 1984). The TAI, based upon a two-factor theoretical conception of test anxietyworry and emotionality (Liebert & Morris, 1967), contained 20 items. Sarason (1984) augmented this conceptualization with a four-factor model of test anxiety--worry, tension, bodily symptoms, and test irrelevant thinking. To capture the best qualities of both scales, Benson et al. (1991) combined the instruments to form the RTA scale. They intended that the combined scale would capture Sarason's four proposed factors. From the original combined set of 60 items, using a sample of more than 800 college students from three countries, they eliminated items on the basis of items (a) not loading on a single factor, (b) having low item/factor correlations, and (c) having low reliability. They retained 18 items each loading on the intended factor and containing high item reliability. The bodily symptoms subscale, containing only 3 items, was problematic due to low internal reliability. Consequently, Benson and El-Zahhar (1994) further refined the RTA scale and developed a 20item scale with four factors and relatively high subscale internal reliability (see Table 4). With a sample of 562 college students from two countries, randomly split into two samples, they cross-validated the RTA scale and found approximately equivalent item-factor loadings, factor

PAGE 89

79 correlations, and item uniquenesses. Descriptive stattstics for each subscale of the RTA for Benson and El Zahhar's (1994) American sample and this study's sample are reported in Table 4. The instrument was selected because evidence of its reliability and construct validity compared favorably with that of other leading test anxiety scales used with college students. Table 4 Descriptive Statistics for the 20-item RTA Scale Scales Benson El Zahhar American Sample N = 202 Total Scale 38.31 10.40 .91 Worry (6) 11.61 3.59 .81 Tension (5) 12.81 Test Irrelevant Thinking (4) Bodily Symptoms (5) 3.85 .87 6.61 2.53 .81 7.54 2.79 .76 Study Sample N = 1263 39.17 9.37 .89 12.03 3.50 .80 13.01 3.68 .84 6.79 2.60 .83 7.35 2.55 .76 Note. Number of items per subscale is in parentheses. First entry in each column is the mean, second entry is the standard deviation, and the third entry is Cronbach's alpha.

PAGE 90

80 Mathematics Background Researchers have experienced problems selecting the best approach to measure subjects' mathematics background (Doolittle, 1984). Typically, methods for classifying subjects' background include (a) asking subjects to report the number of mathematics credits earned or semesters studied (Doolittle, 1984, 1985; Hacket & Betts, 1989; Pajares & Miller, 1994) or (b) asking subjects a series of questions related to specific courses studied (Chipman, Marshall, & Scott, 1991). Asking subjects questions concerning their course background implies that one or two "watershed" mathematics courses qualitatively capture subjects' instructional background. To decide which of these two options to employ in this study, I conducted a pilot study to ascertain whether measuring examinees' mathematics background by quantitatively counting mathematics credits earned or by qualitatively identifying a watershed mathematics course was more useful. In a pilot study, 121 undergraduates were asked to answer the five questions posed by Chipman et al. (1991) and report the number of college credits earned in mathematics (see Appendix C for questions and the scoring scheme used with Chipman et al., 1991). Subjects were divided at the median into two groups and classified as possessing substantial or little mathematics background.

PAGE 91

81 The subjects were then divided by using their responses to the single question about successful completion of a college calculus course. The two methods of dividing the 121 subjects into two background groups had an 84% agreement rate; however, correlations of these two predictors with performance on the GRE-Q and SAT-M indicated that the dichotomous calculus completion question was more valid for students in this study. The pattern of relationships between these tests, the calculus question, and the number of mathematics credits earned indicated that for these college students, calculus completion had a stronger relationship to the test scores (r = .50 .51) than the number of mathematics credits earned (r = .08 .40) (see Table 5). In a continuation of the pilot study, 41 examinees reported they had successfully taken a college calculus Table 5 Correlations of Calculus Completion, SAT-M, GRE-Q, and College Mathematics Credits SAT-M Calculus Completion .51(58) Total Credits .08(58) GRE-Q .50(55) .40(55) Credits .49(141) Note. The number in parentheses represents the number of subjects each correlation is based upon.

PAGE 92

82 course, and 100 examinees reported they had not successfully taken a college calculus course. The 41 examinees reporting successful completion of a college calculus course had earned an average of 13.3 college mathematics credits. The 100 students reporting they had not successfully completed a college calculus course had earned an average of 5.7 college mathematics credits. Therefore, for this sample there was substantial evidence that calculus courses serve as a watershed to other more advanced mathematics courses, and that completion of a calculus course could be used to differentiate students in terms of mathematics background. Subsequently, mathematics background was operationalized by having each examinee answer the following question: "Have you successfully completed a college-level calculus course?" Examinees responding yes were classified as having a substantial background, and examinees responding no were classified as having little background. Utilizing examinee responses to the question of calculus completion was justified because of (a) the high degree of agreement between calculus completion and students' college course backgrounds, (b) the higher correlation of calculus completion to students' SAT-Mand GRE-Q scores than total mathematics credits to students' SAT-Mand GRE-Q scores, and (c) the need to dichotomous

PAGE 93

the sample by mathematics background in applying the DIF procedures. Analysis Testing Procedures and Subpopulation Definitions 83 Prior to taking the released GRE-Q, examinees answered the Differential Item Function Questionnaire (see Appendix B). It contained demographic questions and the RTA scale. Examinees provided information regarding their gender, mathematics background, and test anxiety. Examinees were classified as having substantial or little mathematics background by answering the question concerning completion of a college calculus course. Of the 1263 participants, 626 reported that they had completed a college calculus course and 637 reported that they had not completed a college calculus course. Frequency counts and percentages of mathematics background by gender are presented in Table 6. Men and women did not possess similar mathematics backgrounds. In the sample, 64% of the men reported completing a college calculus class, whereas 40% of the women reported completing a college calculus class. High and low test anxious groups were formed in the following manner. Examinees scoring in approximately the highest 45 percent of the distribution on the RTA scale were defined as possessing high levels of test anxiety.

PAGE 94

Table 6 Frequencies and Percentages for Gender and Mathematics Background Mathematics Background Total Substantial Little Women !1 301 453 754 Pct. 23.8 35.9 59.7 Men !1 325 184 509 Pct. 25.7 14.6 40.3 Total !1 626 637 1263 Pct. 49.6 50.4 100 84 Examinees scoring in the middle 10 percent of the distribution were defined as possessing moderate levels of test anxiety. Examinees scoring in approximately the lowest 45 percent of the distribution were defined as possessing low levels of test anxiety. For the analysis, examinees classified as possessing moderate levels of test anxiety were deleted, and item responses of high test anxiety examinees were compared to item responses of low

PAGE 95

85 test anxiety examinees. Women tended to be classified as having high test anxiety at greater rates than men. Following the completion of the questionnaire, examinees answered the 30-item GRE-Q. Examinees received a standard set of instructions and were told they had 30 minutes to complete the test. Examinees were requested to do their best, and following the test, if they desired, they could learn their results. DIF Estimation The five different methods for estimating DIF were Mantel-Haenszel (MH) (Holland & Thayer, 1988), Item Response Theory-Signed Area (IRT-SA) and Item Response Theory-Unsigned Area (IRT-UA) (Raju, 1988, 1990), Simultaneous Item Bias Test (SIBTEST) (Shealy & Stout, 1993b), and logistic regression (Swaminathan & Rogers, 1990). A distinction was made between uniform and alternate measures. Uniform and nonuniform methods estimate DIF in fundamentally different ways. If nonuniform DIF exits, the two approaches produce unique findings (Shepard, Camilli, & Williams, 1984). Consequently, the five methods were divided into two groups. Mantel-Haenszel, IRT-SA, and SIBTEST formed the uniform measures of DIF. Logistic Regression and IRT-UA, methods capable of detecting nonuniform DIF, coupled with MH, formed the alternate measures of DIF. Although MH was

PAGE 96

86 not designed to measure nonuniform DIF, test practitioners have used it extensively indicating that in actual testing circumstances they assume nonuniform DIF is either trivial or a statistical artifact. By examining the relationships between the DIF indices estimated by MH to those estimated by IRT-UA and logistic regression, researchers will be able to determine if important information is lost when only uniform methods are used. Mantel-Haenszel indices and tests of significance were estimated using SIBTEST (Stout & Roussos, 1992). Item Response Theory signed and unsigned indices and tests of significance were estimated using PC-BILOG 3 (Mislevy & Bock, 1990) in combination with SAS 6.03 (SAS Institute, Inc., 1988). SIBTEST indices and tests of significance were estimated using SIBTEST (Stout & Roussos, 1992). Logistic regression indices and tests of significance were estimated through SAS 6.03 (SAS Institute Inc., 1988). Thus, each of the 30 test items was analyzed with three different subpopulation definitions and five different DIF procedures, producing for each item 15 distinct indices and significance tests. Structural Validation The structural component of construct validation concerned the extent to which items are combined into scores that reflect the underlying latent construct

PAGE 97

87 (Messick, 1988). The structural component is appraised by analyzing the interrelationships of test items. The released GRE-Q was structurally validated through factor analysis of the matrix of tetrachoric coefficients for the 30-item test for a subsample of examinees. Initially, the sample of 1263 examinees was randomly split into two subsamples. The first subsample was used for the exploratory study, and the second subsample was used to cross-validate the findings derived from the exploratory analysis. The tetrachoric coefficient matrix was generated with PRELIS (Joreskog & Sorbom, 1989a). Factor analytic models using an unweighted least squares solution through LISREL 7 (Joreskog & Sorbom, 1989b) were used to assess item dimensionality and potential nuisance determinants. Research Design Prior to validation, I assessed the consistency of the combination of five DIF methods and three subpopulation definitions. The inter-method consistency of DIF indices was assessed through a multitrait multimethod (MTMM) matrix. The inter-method consistency of DIF significant tests was assessed by comparing percent-of-agreement rates between DIF methods when subpopulations are defined by gender, mathematics background, and test anxiety.

PAGE 98

88 A subset of unidimensional items was identified by applying factor analytic procedures. Problematic items and items contaminated by nuisance determinants were identified. Following structural validation, the DIF analysis was repeated. Utilizing the combination of DIF methods and subpopulation definitions, DIF indices and significant tests were generated for the subset of items. The consistency of the DIF indices and associated inferential statistics was assessed. The findings assimilating validation were compared to the preceding findings to appraise the effect of structural validation on DIF analyses. DIF Research Questions Research questions one through four addressed the consistency of DIF indices through two MTMM matrices of correlation coefficients. Research questions one through four were first applied to the analysis of uniform DIF procedures and the MTMM matrix derived from these coefficients (see Table 1 on page 9). The same set of questions were then applied to the alternate DIF procedures and the MTMM matrix derived from these coefficients (see Table 2 on page 10). The first question applied to the uniform DIF procedures focused on the convergent validity coefficients often termed the monotrait-heteromethod coefficients

PAGE 99

89 (e.g., the correlation of DIF indices when the subgroup or trait is gender and the methods are MH and IRT-SA). Were the convergent coefficients based upon the subpopulations of mathematics background and test anxiety greater than the convergent coefficients based upon gender subpopulations? Specific statistical hypotheses were formulated to provide criteria for addressing the research questions. Let Pm represent the correlation between the MH and IRT SA DIF indices for the 30 items when examinee subpopulations are defined by gender. Let pMS represent the correlation between the MH and SIBTEST indices for the 3 0 i terns when examinees are defined by gender. Let Prs c G> represent the correlation between the IRT-SA and SIBTEST indices for the 30 items when examinees are defined by gender. Comparable notation will represent examinee subpopulations defined by mathematics background (M) and test anxiety (TA). Three families of statistical tests each with two a priori hypotheses were defined to answer the first research question for the uniform methods. They were as follows: Hla: PMI (M) > Pm(G)' Hlb: PM I(TA) > Pm(G)' H2a: PMS > PMS, H2b: PMS (TA) > PMS ( G )

PAGE 100

H3a: P1s < M J > Pis < G J and H3b: P 1 s (T A J > P1s ( G J The first question applied to the alternate DIF procedures also addressed the convergent or monotrait heteromethod coefficients. Were the convergent coefficients based upon the subgroups of mathematics background and test anxiety greater than the convergent coefficients based upon gender subpopulations? 90 Similarly, for the alternate procedures let pMI P MI< G J Hlb: PMI(TA) > PMI(G) H2a: P MLCMJ > PML< G >

PAGE 101

H2b: PML ( T A) > PML ( G ) H3a: PrL ( M ) > Pn ( G ) and H3~: PrL (T A ) > Pn ( G) 91 The most efficient statistical test of two dependent correlations within a correlational matrix that do not share a common variable is where zjk and z hm are Fisher z transformations of values taken from the MTMM matrix, and sjk,hm is the asymptotic covariance of r j k and rhm (Steiger, 1980). This statistic has a z distribution and is easily interpreted. Steiger's modified z* was combined with a Bonferroni Holm procedure to control Type I errors. Using directional hypotheses, nominal Type I error rates were set within each family of hypotheses at .025 (.05/2) for the larger z*-value and .05 for the smaller z* value. The second research question addressed whether the convergent coefficients (monotrait-heteromethod coefficients) were higher than the discriminant validity coefficients measuring different traits by identical methods. Campbell and Fiske (1959) maintained that when heterotrait-monomethod coefficients become larger than convergent coefficients a strong method effect is

PAGE 102

92 apparent. This criterion required each convergent coefficient to be higher than the four comparison coefficients of the corresponding triangular submatrices. The analysis of this question was applied to the uniform MTMM matrix and the alternate MTMM matrix. The third research question focused on whether the convergent coefficients (monotrait-heteromethod coefficients) were higher than the discriminant validity coefficients measuring different traits by different methods. Convergent coefficients lower than heterotrait heteromethod coefficients imply that agreement on a particular trait is not independent of agreement on other traits (Campbell & Fiske, 1959). This criterion required each convergent coefficient to be higher than the other four coefficients in the same row and column of the square submatrix. The analysis of this question was applied to the uniform MTMM matrix and the alternate MTMM matrix. The fourth research question required the pattern between the three traits to be similar for the same and different methods. When the number of traits is small (e.g., three or four), this criterion is usually examined by inspection of the rank order of the correlations (Marsh, 1988). Fulfillment of this criterion provided evidence of true trait correlations independent of the method of assessment (Campbell & Fiske, 1959). Again, the

PAGE 103

analysis of this question was applied to both MTMM matrices. 93 The final research questions involved the consistency of DIF significance testing between methods when subpopulations are defined in different ways. Each of the five DIF methods included a statistical test to determine items exhibiting significant levels of DIF. Using the conventional alpha level of .OS, the 30 items were classified as differentially functioning or non differentially functioning in each of the 15 cases. Within each of the three ways of conceptualizing subpopulations, the percent-of-agreement in classifying items was determined between the three uniform methods and the three alternate methods. Three percent-of-agreement rates for the uniform methods and three percent-of-agreement rates for the alternate methods were calculated for gender. Percent-of agreement rates for the uniform and alternate methods also were calculated for the mathematics background and test anxiety DIF analyses. It was hypothesized that for both the uniform and alternate methods the percent of agreement rates between methods was higher for the mathematics background analysis and the test anxiety analysis as compared to the gender analysis.

PAGE 104

94 Structural Validation Study Since the purpose of the study was to investigate the structure of the released GRE-Q and to validate these relationships, the sample was randomly split into two subsamples. The first subsample consisted of 669 examinees--393 women and 276 men, and the second subsample consisted of 594 examinees--361 women and 233 men. The first subsample was used to investigate the dimensionality of the GRE-Q and to identify a subset of unidimensional items. The second subsample was used to cross-validate the findings derived from the exploratory study. Linear structural equation modeling (LISREL) was implemented to assess item dimensionality. My objective was to identify a subset of unidimensionally valid items and problematic items for the GRE-Q. Problematic items were defined as items that possessed substantial loadings on more than a single factor or items that did not have an adequate loading on the dominant factor. An adequate loading was operationalized as 0.30 or greater. A matrix of item tetrachoric coefficients initially was analyzed with a one-factor solution using an unweighted least squares procedure. To learn which items might be potentially problematic, I analyzed the item standardized estimates and residuals. For items seen as possibly problematic, I appraised their relationship to

PAGE 105

95 the remaining items. To evaluate the goodness-of-fit for the unidimensional model, I interpreted the Bentler-Bonett GFI (1980) and the Tucker-Lewis GFI (1973). Based upon the findings derived from the unidimensional model, several alternate models were hypothesized. The alternate models contained multidimensional items representing nuisance. To assess the accuracy of the hypothesized multidimensional models, I evaluated goodness of fit indices, interfactor correlations, item standardized estimates, and item residuals. I classified items as unidimensional if, throughout the analyses, they maintained adequate loadings on the dominant factor and low loadings on other factors. Following the analysis of the hypothesized multidimensional models, I defined a subset of unidimensional items. After the subset of unidimensional items was defined, the DIF analysis was repeated. DIF indices and significant tests were generated for the structurally valid subtest using the five methods and three subpopulation definitions. To assess the consistency of DIF indices, I generated a MTMM matrix for the uniform methods and a MTMM matrix for the alternate methods. I applied the four research questions evaluating DIF indices through the MTMM matrix. To assess the consistency of DIF

PAGE 106

96 significance test, I compared the percent-of-agreement rates between methods for gender, mathematics background, and test anxiety. The findings from the DIF analysis following validation then were contrasted with the findings of the DIF analysis using the full test with no structural validation. The primary objective was to identify the differences between the two analyses and the influence of validation on DIF methods. Summary The MTMM matrices were used to investigate the relationship between three measures of uniform DIF and three alternate measures of DIF with subpopulations defined by gender, mathematics background, and test anxiety. A primary concern was to detect if conceptualizing DIF in terms of relevant educational and psychological variables improved the consistency of DIF methods. DIF significance tests were assessed by contrasting item classification percent-of-agreement rates within subpopulation definitions and between DIF methods. A second concern of the study was to learn the influence of validation on the consistency of DIF methods. A structural validation procedure was conducted to identify unidimensional items and problematic items. Problematic items were defined as items possessing large

PAGE 107

97 loadings on more than a single factor or items having inadequate loadings on the dominant factor. Following validation, using the valid, unidimensional items, the DIF analysis was repeated to determine the consequence of validation on the results of the study.

PAGE 108

CHAPTER 4 RESULTS AND DISCUSSION In this chapter, I first provide descriptive statistics and frequency distributions of the principal variables. Second, I present and discuss findings of the data analyses relevant to each hypothesis. Last, I offer results of data analyses not directly related to the major questions but that were insightful and informed the theoretical implications in the study. Descriptive Statistics Table 7 presents the means and standard deviations of the released Graduate Record Examination-Quantitative (GRE-Q) and the Revised Test Anxiety Scale (RTA) for the total sample, men and women, and examinees possessing substantial and little background in mathematics. The released GRE-Q contained 30 dichotomously scored items. The RTA contained 20 items, each scored on a 4-point Likert scale. The mean score on the released GRE-Q for men was 2.87 points higher than the mean score for women or stated in units of an effect size of 0.60 d (in which d represents the difference between the means divdied by the pooled 98

PAGE 109

Table 7 Mean Scores of the Released GRE-Q and the Revised Test Anxiety Scale (RTA) by the Total Sample, Gender, and Mathematics Background variable n M Total Sample 1263 GRE-Q 17.65 4.96 RTA 39.17 9.37 Women 754 GRE-Q 16.49 4.61 RTA 40.28 9.59 Men 509 GRE-Q 19.36 4.99 RTA 37.52 8.80 Substantial Math Bkd 626 GRE-Q 19.58 4.62 RTA 38.43 9.39 Little Math Bkd 637 GRE-Q 15.76 4.55 RTA 39.92 9.31 99

PAGE 110

100 standard deviation). This finding was consistent with GRE published data (Educational Testing Service, 1993). The mean score on the released GRE-Q was 3.82 points higher for examinees possessing substantial mathematics background as compared to those possessing little mathematics background (d = .83). The mean score and standard deviation on the released GRE-Q for the 542 examinees classified as having low test anxiety were 18.97 and 4.92, respectively. The mean score and standard deviation on the released GRE-Q for the 558 examinees classified as having high test anxiety were 16.30 and 4.76, respectively. (Examinee score frequencies on the GRE-Q are presented in Table A.2 of Appendix A by gender, mathematics background, and test anxiety.) The 30 items on the released GRE-Q had an average item difficulty of 0.59. Item biserial correlations ranged from 0.09 to 0.59 with a mean of 0.39. The mean biserial correlation for women was 0.36, and the mean biserial correlation for men was 0.42. Generally, the item biserial correlations were above 0.30, although four items had lower biserial correlations (rb 2 = 0.23, rb 6 = 0.28, rb 10 = 0.23, and rb 11 = 0.09). Item difficulties and biserial correlations are reported in Table A.3 of Appendix A.

PAGE 111

101 The mean score on the RTA for women was 2.76 points higher than the mean score for men (d = .30). The mean score on the RTA for examinees with little mathematics background was 1.49 points higher than the mean score for examinees with substantial mathematics background (d = .16). Thus, women tended to score lower than men on the GRE-Q and tended to possess higher levels of test anxiety. Furthermore, examinees with substantial mathematics background tended to score considerably higher on the released GRE-Q than examinees with little background, although, they tended to possess only slightly less test anxiety. The intercorrelations of the released GRE-Q, RTA, and mathematics background for the total sample, and for women and men are presented in Table 8. Performance on the released GRE-Q was negatively related to test anxiety and positively related to mathematics background, but the relationship of test anxiety and mathematics background, although statistically significant, was comparatively weaker. Across all three groups, the relationship between GRE-Q performance and mathematics background was the strongest. Research Findings It was of principal concern to learn whether defining subpopulations by relevant educational or psychological

PAGE 112

Table 8 Intercorrelations of the Released GRE-0, RTA, and Mathematics Background for the Total Sample, women, and Men Subscale 1. GRE-Q 2. RTA 3. Math Background 1. GRE-Q 2. RTA 3. Math Background 1. GRE-Q 2. RTA 3. Math Background 1 2 3 Total Sample (n = 1263) -.28 Women (n = 754) -.24 Men (n = 509) -.28 .39 -.08 .35 -.06 .34 -.04 102

PAGE 113

103 variables, rather than by gender, would yield results that were more consistent in magnitude across various DIF detection methods and more consistent in decisions regarding items that were classified as ''biased." Secondarily, because theorists argue that DIF is a consequence of item multidimensionality, it was important to determine the effect of structural validation (i.e., unidimensionality) on the consistency of DIF estimation. I designed four unique contexts to examine this problem. In the first two contexts, I evaluated uniform and alternate DIF estimates prior to structural validation. In the second two contexts, I evaluated uniform and alternate DIF estimates applying the findings of the validation study. Within each context, DIF estimation results were assessed through five research questions. A multitrait multimethod (MTMM) matrix was employed to answer the first four questions. The observation of interest was the DIF index estimated for each item under a combination of subpopulation definition and DIF method. Trait effects were the three subpopulation definitions, and method effects were the five DIF procedures. DIF procedures were divided into uniform procedures and alternate procedures. A MTMM matrix was estimated for the uniform DIF procedures, and a MTMM matrix was estimated for the

PAGE 114

104 alternate DIF procedures. The fifth question involved the comparison of percent-of-agreement rates in detecting aberrant items for each method when subpopulations were defined in succession by gender, mathematics background, and test anxiety. Following uniform and alternate DIF estimation under each trait and subpopulation definition, the dimensionality of the 30 items was evaluated through factor analysis. I attempted to create a more unidimensional test and identify potential nuisance dimensions that detracted from item validity. The goal was to devise a more unidimensional subtest and apply the five research questions incorporating the findings. Thus, the first facet of the study focused on the influence of using educationally relevant variables on the consistency of DIF estimation; and the second facet of the study evaluated the significance of structural validation on the consistency of DIF estimation. DIF Analysis Prior to Validation Uniform DIF procedures. For the three uniform DIF procedures, Mantel Haenszel (MH), Item Response Theory Signed Area (IRT-SA), and SIBTEST, indices were estimated for subpopulations defined by gender, mathematics background, and test anxiety. The item DIF indices are reported in Tables A.4, A.5, and A.6 of Appendix A. The

PAGE 115

105 nine uniform indices for the 30 items were correlated and formed the MTMM matrix presented in Table 9. Table 9 Multitrait-Multimethod Correlation Matrix: Uniform DIF Indices I.MH-D A.Gender B.MathBkd .10 C.TA -.18 -.20 II.IRT-SA A.Gender .80 .12 -.03 B.MathBkd .26 .72 -.14 C.TA -.07 -.32 .59 III.SIBTEST-b A.Gender .93 .10 -.18 B.MathBkd .14 .94 -.30 C.TA -.11 -.30 .91 IRT-SA .42 .00 -.21 .77 .27 -.09 .13 .65 -.34 SIBTEST-b .15 .00 -.21 .61 -.08 -.36 Note: Because the sign of the MH-D is diametrically reversed from SIBTEST-b and IRT-SA, the positive and negative signs for values of MH-D are reversed when used with the other two methods.

PAGE 116

106 As I noted earlier, DIF theorists have speculated that conceptualizing subpopulations by relevant educational or psychological variables would enhance the consistency of procedures. For this reason, I hypothesized that the procedures would demonstrate greater consistency when subpopulations were defined by mathematics background and test anxiety than by gender. The first research question was that there would be stronger relationships evidenced by convergent validity coefficients based on mathematics background and test anxiety than by the corresponding convergent validity coefficients based on gender. Convergent coefficients for MH and IRT-SA methods were 0.80, 0.72, and 0.59 with subpopulations defined by gender, mathematics background, and test anxiety, respectively. Convergent coefficients for MH and SIBTEST methods were 0.93, 0.94, and 0.91 with subpopulations defined by gender, mathematics background, and test anxiety, respectively. Convergent coefficients for the IRT-SA and SIBTEST methods were 0.77, 0.65, and 0.61 with subpopulations defined by gender, mathematics background, and test anxiety, respectively. Steiger's modified z* was use to test, within each pairwise method combination, whether the coefficients for mathematics background and test anxiety were higher than the corresponding

PAGE 117

107 coefficients for gender. None of the tests was significant. The test statistics are reported in Table A.7 of Appendix A. Therefore, within each of the three possible pairwise combinations of DIF estimation methods, defining subpopulations by mathematics background or test anxiety as compared to gender failed to produce more consistent DIF index estimation. The second research question required that each convergent coefficient be higher than the four comparison coefficients of the corresponding triangular submatrices. All nine convergent coefficients were higher than their four comparison heterotrait-monomethod coefficients. For example, the convergent coefficient using MH and IRT-SA methods for the trait mathematics background was 0.72. The comparison heterotrait-monomethod coefficients were 0.10, -0.20, 0.42, and -0.21. Finding all nine convergent coefficients to be higher than the comparison coefficients indicated that uniform DIF indices exhibited minimal variance related to methods of DIF estimation. The third research question required each convergent coefficient to be higher than the other four coefficients in the same row and column of the square submatrix. All nine convergent coefficients were higher than the other four coefficients in the same row and column of the square submatrix. For example, the convergent coefficient using

PAGE 118

108 IRT-SA and SIBTEST methods for the trait test anxiety was 0.61. The four comparison heterotrait-heteromethod coefficients were -0.09, -0.34, 0.00, and -0.21. Finding all nine convergent coefficients to be higher than the comparison coefficients provided strong evidence of agreement on particular traits. The fourth research question required the pattern between the three traits to be similar for the same and different methods. This question was answered by analyzing the rank order of the correlations in the heterotrait submatrix triangles. The heterotrait coefficients of gender and mathematics background ranked highest in all submatrix triangles (M = 0.19); the heterotrait coefficients of gender and test anxiety ranked second highest in all submatrices (M = -0.08); and the heterotrait coefficients of mathematics background and test anxiety ranked lowest in all submatrices (M = -0.26). The low values of the heterotrait coefficients indicated minimal method variance. Utilizing the three subpopulation definitions and a MTMM matrix for analysis, the uniform DIF indices demonstrated good consistency and were minimally influenced by method variance. The full MH-SIBTEST submatrix illustrated the consistency of estimation and the limited method influence. Examining this submatrix

PAGE 119

109 indicated that all three convergent coefficients were high (0.93, 0.94, 0.91), and the heterotrait coefficients were low ranging from -0.30 to 0.14. Campbell and Fiske (1959) commented that often when researchers observe high convergent coefficients, the heterotrait coefficients are correspondingly large. They posited that such a finding indicates convergent validity coefficients inflated by method variance. In the MH-SIBTEST submatrix, however, the heterotrait coefficients were generally low and the high convergence coefficients indicated true agreement on defined traits. Although the other convergent coefficients are not as high as the ones estimated for the MH-SIBTEST submatrix, they are relatively high and indicate good consistency and little method variance. The analysis of the MTMM matrix of uniform DIF indices with the released GRE-Q supported the use of the methods with the subpopulation definitions. The final research question for uniform methods was designed to assess the consistency of DIF significant tests between methods when subpopulations are defined respectively by gender, mathematics background, and test anxiety. Inferential statistics for the 30 items are reported in Tables A.8, A.9, and A.10 of Appendix A. With subpopulations defined by gender, MH chi-square identified 5 aberrant items, IRT-SA z-statistic identified 5 aberrant

PAGE 120

110 items, and SIBTEST z-statistic identified 6 aberrant items. With subpopulations defined by mathematics background, MH chi-square identified 8 aberrant items, IRT-SA z-statistic identified 8 aberrant items, and SIBTEST z-statistic identified 6 aberrant items. With subpopulations defined by test anxiety, each uniform procedure identified 2 items as aberrant. The item classification percent-of-agreement rates are presented in Table 10. Table 10 Percent-of-Agreement Rates of Inferential Tests by Gender, Mathematics Background, and TA Between DIF Methods: 30-Item GRE-Q Procedure Mathematics Combination Gender Background TA I. Uniform Methods MH IRT-SA 80.0 73.3 96.6 MH SIBTEST 96.7 86.7 100 IRT-SA SIBTEST 83.3 73.3 96.6 II. Alternate Methods MH IRT-UA 83.3 70.0 93.3 MH Log Reg 76.7 70.0 90.0 IRT-UA Log Reg 76.7 56.7 96.7

PAGE 121

111 The highest percent-of-agreement rate occurred when subpopulations were defined by test anxiety; the next highest rate occurred when subpopulations were defined by gender; and the lowest rate occurred when subpopulations were defined by mathematics background. Not by chance, the rank ordering of percent-of-agreement rates was the reverse of the rank ordering of aberrant items detected by subpopulation definitions. The interpretation of percent of-agreement rates in this context is confounded by the number of items detected within each subpopulation definition. Consequently, it is impossible to disentangle the findings and conclude that greater detection consistency occurred with subpopulations defined by test anxiety. An unanticipated finding of the study was that, when subpopulation were defined by mathematics background, all methods identified a larger number of differentially functioning items. This unanticipated finding will be addressed later in the chapter. Alternate DIF procedures. For the three alternate DIF methods, MH, Item Response Theory-Unsigned Area (IRT UA), and logistic regression, indices were estimated for subgroups defined by gender, mathematics background, and test anxiety. The item DIF indices are reported in Tables A.4, A.5, and A.6 of Appendix A. The nine alternate

PAGE 122

indices for the 30 items were correlated and formed the MTMM matrix presented in Table 11. Table 11 Multitrait-Multimethod Correlation Matrix: Alternate DIF Indices IRT-UA Log Reg I.MH-D A.Gender B.MathBkd .03 C.TA -.17 .11 II. IRT-UA A.Gender .55 .11 -.05 B.MathBkd -.05 .49 .12 .34 C.TA -.22 -.05 .32 -.01 .05 III.Log Reg A.Gender 2 0 .10 21 .44 .21 .64 B.MathBkd -.04 .21 2 0 .13 .41 .69 .86 C.TA -.17 -.09 .18 -.09 .05 .74 .79 .84 Note. The absolute value of the MH-D index for each of the 30 items was used in estimating the correlation coefficients for this table. 112

PAGE 123

113 The first set of hypotheses for alternate procedures was that there would be a stronger relationship for convergent validity coefficients based on mathematics background and test anxiety than convergent validity coefficients based on gender. Convergent validity coefficients for MH and IRT-UA methods were 0.55, 0.49, and 0.32 withsubpopulations defined by gender, mathematics background, and test anxiety, respectively. Convergent validity coefficients for MH and logistic regression methods were 0.20, 0.21, and 0.18 with subpopulations defined by gender, mathematics background, and test anxiety, respectively. Convergent coefficients for IRT-UA and logistic regression methods were 0.44, 0.41, and 0.74 with subpopulations defined by gender, mathematics background, and test anxiety, respectively. Steiger's modified z* was used to test, within each pairwise method combination, whether the coefficients for mathematics background and test anxiety were higher than the corresponding coefficient for gender. None of the tests was significant. The test statistics are reported in Table A.7 of Appendix A. Thus, for each of the three possible pairwise combinations of the DIF estimation methods, defining subpopulations by mathematics background or test anxiety failed to improve the consistency of the alternate DIF indices.

PAGE 124

114 The second research question required that each convergent coefficient be higher than the four comparison coefficients of the corresponding triangular submatrices. The coefficient pattern of the MTMM matrix did not meet this requirement. Convergent validity coefficients for all three traits measured by MH and logistic regression and by IRT-UA and logistic regression were problematic. For example, the coefficient estimated for the trait mathematics background using MH and logistic regression was 0.41 and was less than the heterotrait-monomethod coefficients of 0.86 and 0.84. The heterotrait-monomethod coefficients for logistic regression of 0.86, 0.79, and 0.84 were higher than any of the comparable convergent coefficients. The high heterotrait-monomethod coefficients indicated considerable variance related to logistic regression. However, this finding will change dramatically with the inclusion of the validation study. The third research question required that each convergent validity coefficient be higher than the other four coefficients in the same row and column of the square submatrix. Using this criteria, only 5 of the 9 convergent coefficients were acceptable. This inadequate pattern of convergent coefficients indicated a lack of convergence of DIF indices related to specific subpopulation definitions.

PAGE 125

115 The fourth research question required the pattern between the three traits to be similar for the same and different methods. The question was answered by analyzing the rank order of the correlations in the heterotrait submatrix triangles. Assessing the rank order of the correlations produced no discernible pattern. For example, discriminant coefficients related to the traits gender and test anxiety ranged in value from -0.22 to 0.79. The most stable discriminant coefficients were related to the traits gender and mathematics background, but their coefficients demonstrated an extreme range from -0.05 to 0.86. Consequently, the MTMM matrix for alternate DIF indices did not meet the requirements of the fourth research question. The alternate MTMM matrix lacked the clear trait convergence that was apparent in the MTMM matrix of uniform DIF indices. The low convergent coefficients indicated poor consistency of DIF estimation. Variance related to methods was problematic particularly with logistic regression. A lack of agreement upon specific traits further indicated poor consistency within subpopulation definitions. The general pattern of the coefficients in the MTMM matrix for the alternate methods raised serious questions concerning their application when

PAGE 126

subpopulations are defined by gender, mathematics background, or test anxiety. 116 The final research question for the alternate methods was designed to assess the consistency of DIF significant tests between methods when subpopulations are defined by gender, mathematics background, and test anxiety. Inferential statistics for the 30 items are presented in Tables A.8, A.9, and A.10 of Appendix A. When subpopulations were defined by gender, MH chi-square identified five aberrant items, IRT-UA z-statistic identified five aberrant items, and logistic regression chi-square identified four items. When subpopulations were defined by mathematics background, MH chi-square identified 8 aberrant items, IRT-UA z-statistic identified 11 aberrant items, and logistic regression chi-square identified 12 aberrant items. When subpopulations were defined by test anxiety, MH chi-square procedure identified two aberrant items, IRT-UA z-statistic identified no aberrant items, and logistic regression chi square identified one aberrant item. The item classification percent-of-agreement rates for the alternate methods are reported in Table 10. The highest percent-of-agreement rate occurred when subpopulations were defined by test anxiety; the next highest rate occurred when subpopulations were defined by

PAGE 127

117 gender; the lowest rate occurred when subpopulations were defined by mathematics background. The rank ordering was the reverse as the number of aberrant items detected by subpopulation definition. This phenomenon was observed when percent-of-agreement rates were assessed for the uniform methods. Once again, analyzing the percent-of agreement rates was confounded with the number of aberrant items detected within each subpopulation definition. For this reason, no accurate conclusions can be made concerning the differences of DIF estimation under different subpopulation definitions. Test Validation and Dimensionality Prior to conducting test validation, the total sample was randomly divided into two subsamples. The first subsample consisted of 669 examinees, and the second subsample consisted of 594 examinees. The first subsample provided information to explore the dimensionality of the test and define a valid subtest. The second subsample supplied information to cross-validate interpretations derived from the first subsample. The exploratory study. I hoped to define through the exploratory study a subset of items that assessed the intended-to-be-measured dimension of the GRE-Q and potential nuisance determinants that hindered item validity. Theorists have posited that DIF is a

PAGE 128

118 consequence of one or more nuisance determinants interacting with the intended-to-be-measured dimension differentially for a defined subpopulation (Ackerman, 1992; Camilli, 1992; Shealy & Stout, 1993b). To initially study the dimensionality of the 30-item test, I factor analyzed the matrix of item tetrachoric coefficients. To assess the fit of various models and potential nuisance determinants, I analyzed goodness-of-fit indices, item standardized estimates, and item residuals. A unidimensional model for the 30-item test indicated limited model fit (Bentler-Bonett GFI = 0.81, Tucker-Lewis GFI = 0.82). The unidimensional standardized estimates are reported in Table A.11 of Appendix A. Four problematic items (2, 6, 10, and 11) exhibited small loadings and large residuals. The tetrachoric coefficients of the four items to the remaining items were extremely small and, at times, negative. The low and negative coefficients of the problematic items impeded model fit and indicated that examinees responded to these items differently than to the remaining items. Furthermore, the interrelationships of the four items were low or negative indicating no single nuisance determinant. Table 12 presents the interrelationships of the four problematic items plus the unidimensional standardized estimates. Combinations of the four items were tried as

PAGE 129

Table 12 Tetrachoric Correlations and Standardized Estimates for the Four Problematic Items: Exploratory Sample Item 2 Item 6 Item 10 Item 11 Item 2 (. 15) .126 -.068 -.062 Item 6 (.24) .115 .170 Item 10 (.23) .120 Item 11 (. 16) Note. Numbers in the diagonal of the matrix are the unidimensional standardized estimates. 119 nuisance determinants, but none contained adequate model fit. For this reason, each item detracted independently from unidimensionality and was a self-contained source of unique nuisance variation. I factor analyzed the remaining 26 items with a unidimensional model and assessed the goodness-of-fit indices, standardized estimates, and residuals to determine potential nuisance dimensions. The unidimensional model indicated improved fit over the total test (Bentler-Bonett GFI = 0.87, Tucker-Lewis GFI = 0.87). Standardized estimates for all items were greater than 0.30 and are reported in Table A.12 of Appendix A. Based upon item content, estimates, and residuals, various

PAGE 130

120 models with hypothesized nuisance determinants were factor analyzed, but none indicated better fit than the unidimensional model. Consequently, the most interpretable solution was that the GRE-Q contained 26 unidimensional valid items and 4 problematic items each being an independent source of nuisance variation. Cross-validation. To examine the stability of the finding that the 4 items were problematic and the remaining 26 items formed a unidimensional valid test, a cross-validation was done using the second subsample. In the cross-validation, the same factor pattern was tested with the second subsample. The general item and factor patterns suggested by the exploratory study were specified, but item loadings and residuals were estimated without constraints. Initially, employing all 30 items, a unidimensional model was generated. Fit for this model was comparable to that for the exploratory sample (Bentler-Bonett GFI = .80, Tucker-Lewis GFI = .82). The standardized estimates are reported in Table A.11 of Appendix A. Although some item estimates changed in cross-validation, the estimates for three of the four problematic items remained low(< 0.30). Notwithstanding, the standardized estimate of Item 6 was 0.32. The four problematic items from the exploratory study were deleted, and a unidimensional confirmatory factor

PAGE 131

121 analysis was generated for the 26 items with the cross validation sample. Model fit improved compared to the full test, but degenerated slightly in comparison to the exploratory study (Bentler-Bonett GFI = 0.84, Tucker-Lewis GFI = 0.85). The standardized estimates are reported in Table A.12 of Appendix A. The estimates ranged from 0.33 to 0.65. The general findings of the cross validation confirmed that items 2, 10, and 11 were problematic and impeded test interpretation. The remaining items had substantial loadings on the dominant factor. Item 6 was borderline problematic and of questionable validity. The problematic items. Test items 2, 6, 10, and 11 are presented in Figure 1. To better understand these items, Figures 2, 3, 4, 5, 6, and 7 present plotted logistic regression curves {LRC). The total score for the 26 valid items was plotted on the horizontal axis, and the probability of a correct response was plotted on the vertical axis. LRCs were used instead of IRT item characteristic curves (ICC) because they are less restrictive than the twoor three-parameter recs. recs assume the lower asymptotes are zero or equal for both subpopulations, the upper asymptote is one, and the function of the ICC is monotonically increasing.

PAGE 132

Directions: Each of the questions ... consist of two quantities, one in column A and one in column B. You are to compare the two quantities and choose A if the quanitity in Column A is greater; B if the quanitiy in Column Bis greater; C if the two quanities are equal; D if the relationship cannot be determined from the information given. 2 6. Column A Column B 100.010-0.009 100.000+0.002 pis the probability that a certain event will occur, and p(l-p) 0 p p2 The average (arithmetic mean) of 2 positive integers is equal to 31 and each of the integers is greater than 26. 10. The greater of the two integers 36 15 11. X y Figure 1: The Four Problematic Test Questions. 122

PAGE 133

1 _c 0 9 r-1 .-t 0 8 r-1 0. 7 {5 0.6 H P-< 0.5 'fil 0.4 0.3 j O .2 C/l r.::i 0.1 Item 2 ---------------------M e n ----women 0 +--+---+----lf--+--+--+---+----lf--+-+--+---+----lf--+--+--+---+-t-+-+--+-~ 00 rl Score 0 N Figure 2: LRCs for Women and Men on Item 2. 1 _c 0 9 r-1 .-t 0 8 r-1 ..0 0.7 co ..0 0.6 0 H p.. 0.5 'O 0.4 (1) 0.3 r-1 0.2 C/l 0.1 ::j 0 ,;;:)' 00 N rl Item 10 Score 0 N ----women Figure 3: LRCs for Women and Men on Item 10. 123

PAGE 134

124 Item 6 1 0. 9 ---Men .w r-1 0.8 rl r-1 ----women ..0 co 0.7 ..0 0.6 0 H p., 0 5 'O 0.4 Q.J .w co 0 3 E r-1 0 2 .w (/} :J 0.1 0 "<:!' 00 rl Score --0 N ---....,. ---Figure 4: LRCs for Women and Men on Item 6. 1 0 .9 .w ;.:: 0 8 r-1 ..o O 7 co {3 0 6 H P.. 0 5 'O O.l 0.4 .w 0 3 r-1 .w 0.2 (/} ri:i 0 1 0 co N rl Item 6 Score ----Little Math ---substantial Math 0 N ----Figure 5: LRCs for Examinees with Substantial and Little Mathematics Background on Item 6.

PAGE 135

1 :>t O. 9 .I,.) 0 8 r-1 .o 0.7 Ctl 0 6 o.. 0 5 "O 0.4 Q) .I,.) m 0.3 r-1 0.2 .I,.) [/) li:i 0.1 0 <::jt co N .-I Item 11 I..O .-I Score ---Men ----women 0 N Figure 6: LRCs for Women and Men on Item 11. 1 :>t0 9 .I,.) 0 8 r-1 .o 0.7 Ctl 0 6 o.. 0 5 "O 0 4 Q) .I,.) Ctl 0.3 !:; r-1 0 2 .I,.) [/) li:i 0 1 0 <::jt 00 N .-I Item 11 I..O .-I Score ----Little Math ---substantial Math 0 N Figure 7: LCRs for E x aminees with Substantial and Little Mathematics Background on Item 11. 125

PAGE 136

126 Figures 2 and 3 present the LRCs for Item 2 and Item 10 by gender. Neither item was identified as containing significant levels of DIF, but both items had low discrimination as indicated by their low biserial correlations and plotted LRCs. Examinees of all abilities found Item 2 relatively easy (p = 0.85). Despite the difficulty of Item 10 being ideal (p = 0.57), as Figure 3 illustrates, examinees at the lowest ability level had a 35 percent probability of answering it correctly and examinees at the highest ability levels had less than a 70 percent probability of answering it correctly. Thus, Item 10 discriminated little between high ability and low ability examinees. Figures 4 and 5 present the LRCs for Item 6 by gender and mathematics background. Conventional DIF analysis by gender and mathematics background found large and significant effects. The two figures are nearly identical and show little DIF for average to below average examinees, but, as ability increases, men and examinees with substantial mathematics background have higher likelihoods of correctly answering the item. The item is difficult (p = 0.24), however, it has questionable structural validity, is differentially functioning, and exemplifies gender DIF that can be explained by

PAGE 137

127 mathematics background. It should be noted, however, that in interpreting Figure 5 few examinees with substantial mathematics background scored 10 or less points on the GRE-Q. Consequently, the LRC for examinees with substantial mathematics background at low ability levels is estimated with extremely sparse data (see Table A.2 of Appendix A). From an educational perspective, students who have experience working with probabilities and performing mathematical operations on numbers between zero and one should be able to solve the item. Thus, specific curricular experiences should facilitate answering it. Men being more likely to enroll in mathematics and statistics courses possessed greater likelihoods, at above average ability levels, of correctly answering the item. Figures 6 and 7 present the LRCs for Item 11 by gender and mathematics background. Item 11 had the highest difficulty and the lowest discrimination of all test items. High ability examinees were as unlikely to correctly answer it as low ability examinees. As Figure 7 illustrates, examinees of low ability with little mathematics background had slightly higher probabilities of answering it correctly than examinees of high ability with little mathematics background. Educationally, this

PAGE 138

128 item required knowledge of a geometric concept usually taught but not emphasized in high school geometry. Nearly all examinees failed to recall this concept and were forced to guess. Due to its content and statistical properties, Item 11 is a poor item. It does not correlate with other items on the test, and DIF methods fail to identify it as aberrant. Items so obscure that nearly all examinees are guessing lack any definable dimensionality and are problematic for conventional DIF analyses. DIF Analysis Incorporating Structural Validation In the follow-up DIF analysis, it was of principal interest to determine the influence of the structural validation study on DIF procedures and changes that might have occurred in the MTMM matrix and item detection. The five different DIF indices for the 26 unidimensional items with subpopulation definitions of gender, mathematics background, and test anxiety were generated. Utilizing the 26 items as the valid criterion, DIF indices were also estimated using MH, SIBTEST, and logistic regression for the 4 problematic items. DIF indices are reported in Tables A.13, A.14, A.ls of Appendix A. The estimated DIF indices for Item 11 using logistic regression were 110.45, 228.50, and 242.67 by gender, mathematics background, and test anxiety, respectively. The other methods did not

PAGE 139

129 produce such exceedingly high estimates for Item 11. (I will comment later in this chapter on logistic regression's unique values for Item 11). Because of the exceedingly high DIF index values with logistic regression for Item 11, in assimilating the findings of validation and assessing the consistency of DIF procedures, only the 26 unidimensional items were evaluated. Uniform DIF procedures. Using the 26 items as the unit of analysis, a MTMM matrix of correlational coefficients was generated for the uniform DIF methods of MH, IRT-SA, and SIBTEST by the traits gender, mathematics background, and test anxiety. The MTMM matrix is presented in Table 13. The pattern of coefficients using the 26-item test in the MTMM matrix for uniform DIF indices was similar to the matrix of the entire test. The answers to the specific research questions were generally the same. The first research question was that there would be a stronger relationship for convergent validity coefficients based on the subpopulation definitions of mathematics background and test anxiety than corresponding validity coefficients based on the subpopulation definition of gender. Convergent validity coefficients based on gender were equivalent to or slightly higher than convergent validity

PAGE 140

130 Table 13 Multitrait-Multimethod Correlation Matrix of the Valid 26 Test Items: Uniform DIF Indices I.MH-D A.Gender B.MathBkd -.03 C.TA -.04 II.IRT-SA A.Gender .82 B.MathBkd .19 C.TA .13 III.SIBTEST-b A.Gender .90 B.MathBkd -.05 -.03 -.06 .08 .70 .13 -.27 .53 .00 -.08 .95 -.12 C.TA .01 -.10 .90 IRT-SA SIBTEST-b 2 6 .11 -.19 .72 2 0 .02 -.06 .74 -.31 -.01 .08 -.14 .47 .00 -.15 Note. Because the sign of the MH-D is reversed from the SIBTEST-b and the IRT-SA, the positive or negative signs for values of the MH-D are reversed when used with the other two methods.

PAGE 141

131 coefficients based on mathematics background and test anxiety. Consequently, following validation with subpopulations defined by mathematics background and test anxiety as compared to gender, greater consistency of DIF estimation was not achieved. The convergent coefficients for the 26-item test were slightly lower than convergent coefficients for the entire test. The lower coefficients were most evident for the subpopulation group based on test anxiety. To a large degree, the lower observed coefficients were attributed to a reduction of variance in DIF indices when using the 26item test. The 26-item MTMM matrix for the uniform DIF indices successfully met the Campbell and Fiske (1959) guidelines for evaluating a MTMM matrix. All nine convergent coefficients were higher than their comparable heterotrait-monomethod coefficients. All nine convergent coefficients were higher than their comparable heterotrait-heteromethod coefficients. Like the MTMM matrix of uniform DIF indices based on the entire test, the 26-item MTMM matrix of uniform DIF indices indicated good convergence on the defined traits, minimal method variance, and supported the usage of DIF indices in the context of the subpopulation definitions.

PAGE 142

132 The final research question was designed to assess the consistency of DIF significant tests. The inferential statistics are reported in Tables A.16, A.17, and A.18 of Appendix A. With subpopulations defined by gender, MH chi-square identified four aberrant items, IRT-SA statistic identified three aberrant items, and SIBTEST statistic identified five aberrant items. With subpopulations defined by mathematics background, MH chi square identified seven aberrant items, IRT-SA z-statistic identified eight aberrant items, and SIBTEST z-statistic identified nine aberrant items. With subpopulations defined by test anxiety, MH chi-square and SIBTEST statistic both identified two items as aberrant, and IRT SA z-statistic identified one item as aberrant. Table 14 presents the item classification percent-of-agreement rates. After, assimilating the structural validation findings, rates were approximately the same as the findings before validation. Alternate DIF procedures. Utilizing the 26 items as the unit of analysis, a MTMM matrix of correlational coefficients was generated for the alternate DIF methods of MH, IRT-UA, and logistic regression by the traits gender, mathematics background, and test anxiety. The MTMM matrix is presented in Table 15.

PAGE 143

133 Table 14 Percent-of-Agreement Rates of Inferential Tests by Gender, Mathematics Background, and Test Anxiety Between DIF Methods: 26-Item Valid Test Procedure Combination I. Uniform Methods MH IRT-SA MH SIBTEST IRT-SA SIBTEST II. Alternate Methods MH IRT-UA MH Log Reg IRT-UA Log Reg Gender 80.8 96.2 84.6 80.8 76.9 96.2 Mathematics Background 80.8 84.6 80.8 84.6 69.2 76.9 TA 96.2 100 96.2 96.2 84.6 88.5 Several significant differences were observed when the MTMM matrix utilizing the validation study was compared to the MTMM matrix estimated before structural validation presented in Table 11. The convergent coefficients for subpopulations defined by gender remained approximately equivalent to or greater than the convergent validity coefficients for subpopulations defined by mathematics background and test anxiety. Thus, the consistency of DIF estimation with the alternate

PAGE 144

134 Table 15 Multitrait-Multimethod Correlation Matrix of the Valid 26 Test Items: Alternate DIF Indices I.MH-D A.Gender B.MathBkd C.TA II.IRT-UA A.Gender B.MathBkd C.TA III.LogReg A.Gender IRT-UA LogReg -.14 -.31 -.04 .53 -.15 -.16 -.01 .52 -.32 .00 -.31 -.02 .09 -.25 -.20 .52 -.22 -.19 .97 .03 -.21 B.MathBkd -.10 .51 -.28 .00 .95 -.14 -.03 C.TA -.25 .00 .03 -.04 -.05 .75 .01 .03 Note. The absolute value of the MH-D index for each of the 26 items was used in estimating the correlation coefficients.

PAGE 145

135 procedures did not improve when subpopulations were defined by mathematics background or test anxiety. Although the answer to this question remained the same following structural validation, the answer to the other questions changed dramatically. Eight of the nine convergent coefficients were higher than the comparable heterotrait-monomethod coefficients. Convergent coefficients higher than heterotrait-monomethod coefficients indicated an agreement upon traits and minimal method variance. The elimination of method variance attributed to logistic regression was a significant change following validation. The heterotrait monomethod coefficients for logistic regression prior to validation were 0.86, 0.79, and 0.84. These high coefficients were reduced to near zero levels (-0.03, 0.01, and 0.03) using the valid 26-item test indicating no method variance. Thus, the method variance that was observed with the total test disappeared when the validation study was incorporated into the DIF analysis. This significant finding is couple with the extremely high convergent coefficients for IRT-UA and logistic regression (0.97, 0.95, and 0.75) indicating a high level of consistency between two methods.

PAGE 146

136 The reasons for the dramatic changes in the convergent coefficients were related primarily to the elimination of Item 11 from the test and an increase the variance of DIF indices following structural validation. The elimination of Item 11 was significant because the logistic regression procedure consistently found it to be problematic while the other methods were incapable of detecting high levels of DIF under its unique conditions. Item 11 was the most difficult and least discriminating item on the GRE-Q {p = 0.13 and rb = 0.09). When its LRC was plotted by gender and mathematics background subpopulations (Figures 6 and 7 on page 135), it contained an interactio n between group membership and ability. Although the LRC for Item 11 indicated considerable nonuniform DIF, under its unique conditions, !RT-based methods were unable to identify it as problematic because of its restrictive model. For example using the 30-item test, DIF indices for Item 11 with subpopulations defined by mathematics background were 0.50 and 5.67 for IRT-UA and logistic regression, respectively. MH and SIBTEST were also inadequate because they were not designed to identify nonuniform DIF. Consequently, when DIF indices estimated through logistic regression were correlated with DIF indices estimated by the other methods, the inclusion

PAGE 147

of Item 11, a dramatic outlier, greatly reduced their magnitudes. 137 It is interesting to note that although the conditions of Item 11 produced problems in evaluating the magnitude of the convergent coefficients for IRT-UA and logistic regression, the conditions of Item 6 did not display a similar effect. Item 6 had similar but not as extreme item characteristics as Item 11 (p = 0.24 and rb = 0.28). Its LRC also suggested an interaction between group membership and ability (Figures 4 and 5 on page 124). Despite these similarities, the nonuniform DIF indices for Item 6 were highly consistent. Therefore, I must emphasize that IRT model restrictions become troublesome only under the extreme conditions of high or low item difficulty values and exceedingly low item biserial correlations. Incorporating the findings of the structural validation study and redefining the criterion measure produced greater variability among the estimated DIF indices. The greater variance of the DIF indices allowed for potentially higher coefficients. Thus, applying the findings of structural validation resulted in greater variance of DIF and potentially higher correlational coefficients.

PAGE 148

138 Returning to the application of the Campbell and Fiske guidelines for evaluating a MTMM matrix, all nine convergent coefficients were now higher than the comparable heterotrait-heteromethod coefficients. This finding indicated strong convergence on specific traits. Generally, the matrix based upon the 26 unidimensional items met the Campbell and Fiske (1959) guidelines for evaluating a MTMM matrix. The revised matrix provided evidence of convergence on specified traits and minimal method variance. The final question based on the 26-item valid test was designed to assess the consistency of DIF significant tests. The inferential statistics are reported in Tables A.16, A.17, and A.18 of Appendix A. With subpopulations defined by gender, MH chi-square identified four aberrant items, IRT-UA z-statistic identified three aberrant items, and logistic regression chi-square identified two aberrant items. With subpopulations defined by mathematics background, MH chi-square identified seven aberrant items, IRT-UA z-statistic identified nine aberrant items, and logistic regression chi-square identified nine aberrant items. With subpopulations defined by test anxiety, MH chi-square and logistic regression chi-square identified two aberrant items, and IRT-UA z-statistic identified one

PAGE 149

aberrant item. The item classification percent-of agreement rates are reported in Table 14. 139 The percent-or-agreement rates following structural validation tended to increase. The one exception to this trend occurred with subpopulations defined by test anxiety. This findings counters the substantially higher rates observed with IRT-UA and logistic regression by gender and mathematics background. Under these conditions, the percentages increased from 76.7 to 96.2 and 56.7 to 76.9, respectively. Two problems remained with the MTMM matrix for alternate methods despite the inclusion of structural validation. Low magnitudes of convergent coefficients based upon test anxiety raised questions about the trait's usefulness. Comparatively mediocre convergent coefficients for MH and IRT-UA and for MH and logistic regression illustrated the theoretical differences between uniform and nonuniform methods. The relatively low coefficients were an indication of the amount of nonuniform DIF present in the test data. These low coefficients demonstrated that strictly using uniform measures in this context would not be fully adequate. The findings regarding the nature of nonuniform DIF will be addressed under additional findings.

PAGE 150

140 Additional Findings Although not part of the focus of this investigation, other findings proved interesting. They are reported and discussed here because they serve to inform DIF theory and help clarify the interpretations of the study. Subpopulation Differences in DIF Detection Defining subpopulations by gender, mathematics background, and test anxiety affected the number of aberrant items detected. Regardless of DIF method, more aberrant items were detected defining subpopulations by mathematics background than by gender. Likewise, regardless of method, more aberrant items were detected defining subpopulations by gender than by test anxiety. When subpopulations were defined by test anxiety, with all methods, two or fewer items were detected as differentially functioning. Because the validation study failed to define a nuisance determinant, it was impossible to fully account for why mathematics background led to a greater number of items detected. Perhaps, the consistent finding of mathematics background resulting in more items detected can be attributable to multidimensionality creating the potential for DIF. As I stated in Chapter 2, theorists have suggested that if a misspecified, unidimensional

PAGE 151

141 model is employed, the potential for DIF is present with any of the following four conditions (Ackerman, 1992; Camilli, 1992). They were as follows: 1. True ability means differ by group. 2. Nuisance ability means differ by group. 3. The ratios of the standard deviation for nuisance ability to the standard deviation for true ability differ by group. 4. The correlations of true ability and nuisance ability differ by group. The first two conditions could have easily influenced the findings. The first condition of mean differences in true ability was apparent in that true ability mean differences were largest with subpopulations defined by mathematics background. A second condition of nuisance ability mean differences could relate to subpopulation definitions and the nature of the study. On the GRE-Q, a test designed to measure quantitative aptitude, college mathematics instruction interacted with specific items. Specific instructional experiences assisted examinees in answering some items. I posited that instructional experience explained much of the observed gender DIF for Item 6. If item multidimensionality was related to instructional

PAGE 152

142 experiences, dividing examinees by mathematics backgrounds would manifest vastly different means on the nuisance determinant. The finding of more aberrant items with subpopulations defined by mathematics backgrounds raises questions concerning the traditional DIF methods. Simple unidimensional mathematical models do not appear to represent the complexity of examinee responses to specific test questions adequately. Multidimensional models that incorporate demographic characteristics along with examinee background experiences, such as opportunity-to learn, might enable researchers to better evaluate the fairness of specific items. In regards to test anxiety and nuisance determinants, I must acknowledge the test had no real consequences for examinees. Consequently, it may not have provoked the intense levels of anxiety that normally are experienced by examinees taking college admissions tests. Because under these testing conditions high test anxiety examinees did not experience significantly higher levels of anxiety than low test anxious examinees, the mean differences on nuisance associated with test anxiety was slight. For this reason, the influence of test anxiety on performance

PAGE 153

was small, it interacted minimally with specific items, and appeared unrelated to nuisance. Dimensionality and DIF 143 Although the validation study identified 26 items that were significantly associated with the dominant dimension of the test and four items that were problematic, it failed to identify items that were multidimensional. DIF theorists have proposed models of multidimensionality (Shealy & Stout, 1993a, 1993b), and they have discussed validity sectors and theoretical causes of multidimensionality (Ackerman, 1992; Kok, 1988). However, when applied to the test data, the analysis of standardized estimates, validity sectors, and residuals to identify multidimensional items was not productive. In the context of the study, I was able to identify items that were highly associated with the dominant dimension of the test and items that were contained nuisance. Several multidimensional models were hypothesized and studied, and with the 26-item test, the interfactor correlations were 0.90 or greater. This finding indicated unidimensionality. Nonuniform DIF All DIF procedures are designed to assess uniform DIF, but only a few procedures are capable of detecting

PAGE 154

144 nonuniform DIF. Procedures assessing only uniform DIF imply that nonuniform DIF is either trivial or a statistical artifact. To study the efficacy of different methods, researchers simulate both uniform and nonuniform DIF. Swaminathan and Rogers (1990) in an attempt to demonstrate MH's inability to identify nonuniform DIF simulated data through a two-parameter IRT model. For items with nonuniform DIF, they set the difficulty parameter to zero and varied the discrimination parameter. In effect, they simulated nonuniform DIF with recs that crossed symmetrically. Such interactions created the worse case scenario where uniform methods have virtually no power. Although such simulations demonstrate conditions under which uniform procedures have little power, researchers must ask do symmetrical interactions occur with actual test data. In the context of this study, Item 8 with subpopulations defined by gender exhibited a symmetrical interaction. Item S's LRC is presented in Figure 8. Although Item 8 represented the nonuniform pattern simulated by Swaminathan and Rogers (1990), the more common nonuniform DIF found in the study had LRCs that crossed at extreme ability levels. This more typical nonuniform DIF is illustrated in Figure 9 by

PAGE 155

>, .w rl ,-i rl ,..Q co ,..Q 0 H A. 'O Q.J .w co E: rl .w (/) l >, .w rl rl ,..Q co ,..Q 0 H A. 'O Q.J .w co E: rl .w (/) l 1 0.9 0.8 0.7 0 6 0 5 0.4 0.3 0.2 0 1 0 s::14 00 N ..--i Item 8 =--:;;,, \.0 ..--i Score ., ., --_,,,,,--Men ----women 0 N 14 5 Figure 8 : LRCs for Women and Men on Item 8 Illustrating the Symmetrical Nonuniform D IF Condition. 1 0.9 0 8 0 7 0.6 0.5 0.4 0 3 0.2 0.1 0 s::14 00 N ..--i Item 20 \.0 ..--i Score Men ----women 0 N Figure 9: LRCs for Women and Men on Item 20 Illustrating t he More Typical Nonuniform DIF Condition

PAGE 156

Item 20. Under this more typical case, the uniform methods possess considerably more power to detect DIF. 146 To further understand nonuniform DIF with actual data, IRT-SA indices were compared to IRT-UA indices. The unsigned area between the recs must be equal to or greater than the absolute value of the signed area. If the unsigned area is equal to the absolute value of the signed area, the two recs are completely uniform. The difference between the value of the unsigned area and the absolute value of the signed area can be interpreted as a crude indicator of the degree of nonuniformity. The mean absolute value of IRT-SA for the 30 items estimated under the three conditions was 0.29. The mean value of IRT-UA for the 30 items estimated under the three conditions was 0.41. Comparing these values indicated that a substantial amount of area between two ICCs was unaccounted for by IRT-SA. In the context of this study, nonuniform DIF did not occur as frequently as uniform DIF, and symmetrical patterns were very unusual. However, in assessing the test data, important information would be lost if only uniform measures were used.

PAGE 157

CHAPTER 5 SUMMARY AND CONCLUSIONS The two primary purposes of this study were to compare the consistency of five differential item functioning (DIF) methods (a) when subpopulations were defined by gender, mathematics background, and test anxiety, and (b) when the findings of a structural validation study were incorporated in the analysis. The study was conducted in the context of college admission quantitative examinations and gender issues. I chose this context because of the problematic predictive validity associated with mean differences between men and women on such tests. Within this context, findings would inform DIF researchers to the usefulness of defining subpopulations by psychological or educational variables and the effects of structural validation on DIF estimation. The claim by Skaggs and Lissitz (1992) that defining subpopulations by psychologically or educationally relevant variables would yield more consistent DIF estimation was not substantiated; however, the findings indeed confirmed the importance of careful structural 147

PAGE 158

148 validation as a part of DIF studies (Ackerman, 1992; Camilli, 1992). Regarding the usefulness of defining subpopulations by psychological or educational variables, the attenuated coefficients of the multitrait-multimethod (MTMM) matrices suggested equivalent or better consistency when subpopulations were defined by gender as compared to mathematics background or test anxiety. The consistency of DIF indices with subpopulations defined by gender was approximately equivalent to the consistency of DIF indices with subpopulations defined by mathematics background. When the validation findings were incorporated, both modes of defining subpopulations were consistent. I had hypothesized that defining subpopulations by psychological or educational variables would result in higher consistency of results from various DIF estimation methods. In part, the failure to find significantly higher convergent validity coefficients with subpopulations defined by mathematics background was attributed to the reliability of measurement. The measurement of an examinee's gender possessed near-perfect reliability. Although the method for dividing examinees into two groups based on their mathematics background was validated, measurement reliability would not achieve the near-perfect reliability achieved in measuring gender.

PAGE 159

Consequently, the convergent coefficients related to mathematics background were attenuated, whereas the coefficients related to gender were unattenuated. 149 The analysis of examinee responses to Item 6 provided an exemplary case of gender DIF that could be explained by mathematics background. With subpopulations defined by gender and mathematics background, the DIF indices for Item 6 were large and significant. Comparing the logistic regression curves (LRCs) of Figures 4 and 5 indicated that the response patterns for men and for examinees with substantial mathematics backgrounds were nearly interchangeable. Similarly, comparing the LRCs for women and examinees with little mathematics backgrounds suggested that their response patterns were nearly identical. For Item 6, the observed gender DIF was attributed to differences in mathematics background. Unfortunately, only Item 6 contained such a straightforward interpretation. More frequently, items with significant gender DIF did not possess significant levels of DIF by mathematics background. As found in similar studies, the comparison of LRCs by gender and mathematics background was confusing and contradictory (Doolittle, 1985, 1986). Although the LRCs for Item 6 corroborated the positions of Miller and Linn (1988) and

PAGE 160

150 Muthen, Kao, and Burstein (1991) in the use of educational background as a mode for investigating DIF, the results were mixed and unclear. If the findings for studying DIF from the perspective of mathematics background produced conflicting interpretations, the findings for studying DIF from the perspective of test anxiety were straightforward. The consistency of DIF indices with subpopulations defined by gender was generally higher than the consistency of DIF indices with subpopulations defined by test anxiety. When structural validation findings were incorporated, each convergent coefficient by gender was higher than its corresponding coefficient by test anxiety. The comparatively low convergent coefficients by test anxiety had two probable causes. The first cause related to differences in measurement reliability for gender as opposed to test anxiety. Whereas gender was measured with near-perfect reliability, the Revised Test Anxiety Scale's estimated Cronbach's alpha was 0.89. The second and more detrimental cause was the lack of variance in DIF estimation with subpopulations defined by test anxiety. Item DIF indices for test anxiety subpopulations were extremely low, and they were rarely statistically significant. The low DIF indices suggested that, in the

PAGE 161

151 context of the study, defining subpopulations by test anxiety was of limited use for studying DIF. Because no actual consequences resulted from an examinee's test performance, participants did not feel levels of test anxiety comparable to those provoked by actual college admission tests. Consequently, many individuals who under true test conditions might have experienced extremely high levels of test anxiety, in the context of the study, felt only moderate or low levels of test anxiety. For these reasons, the consistency of DIF estimation by test anxiety was comparatively low, and test anxiety lacked explanatory power. The consistency of DIF significant testing, as measured by item classification percent-of-agreement rates by gender, mathematics background, and test anxiety ranged from mediocre to perfect. For uniform DIF methods before validation, percent-of-agreement rates ranged from 73.3 to 100. For alternate DIF methods before validation, rates ranged from 56.7 to 96.7. For uniform DIF methods assimilating the findings of validation, rates ranged from 80.8 to 100. For alternate methods assimilating the findings of validation, rates ranged from 69.2 to 96.2. Under all method combinations, the rates were the highest with subpopulations defined by test anxiety. With

PAGE 162

152 subpopulations defined by gender, the rates were generally the second highest. With subpopulations defined by mathematics background, the rates were generally the lowest. The rank ordering of percent-of-agreement rates was the reverse of the rank ordering of aberrant items detected by subpopulation definitions. Therefore, it was impossible to disentangle these relationships and conclude that one subpopulation definition produced more consistent detection than another. The finding that item classification percent-of agreement rates for the study averaged in the mid-80s was better than the item classification rates observed by Hambleton and Rogers (1988). In comparing the MH to a method similar to IRT-UA, they found item classification rates near 80 percent. The finding of item classification agreement rates between 80 and 85 percent supports the position of Skaggs and Lissitz (1992) that DIF detection methods are mediocre at best. To illustrate, for a 30item test using two methods each identifying 5 aberrant items, the worst-case scenario of no agreement on any aberrant items results in a rate of 66.7 percent. Statistical methods that improve this worst-case scenario rate to 80 or 85 percent have achieved limited improvement. Current DIF significance tests appear method

PAGE 163

153 dependent. For this reason, the findings supported researchers who advocate the importance of interpreting both the DIF index and significance test in evaluating the fairness of test items (Angoff, 1993; Burton & Burton, 1993; Dorans & Holland, 1993; Uttaro & Millsap, 1994). The most prominent finding of the investigation was the effect of structural validation on DIF estimation. Before validation, the MTMM matrix of uniform DIF indices met the validity guidelines of Campbell and Fiske (1959). Subsequently, I found that the three uniform methods possessed high consistency and minimal variance related to methods. However, I observed that when the guidelines of Campbell and Fiske were applied to the MTMM matrix of alternate DIF indices severe problems appeared. The three alternate methods appeared to have low consistency and high levels of variance related to methods. Factor analysis was employed to estimate item dimensionality and define a valid subset of items. In the exploratory study, 26 items were identified as having strong relationships to the dominant trait and 4 items were defined as problematic. Cross-validation generally supported this interpretation. Therefore, I deleted the 4 problematic items from the study and reevaluated DIF estimation.

PAGE 164

154 The MTMM matrix of uniform DIF indices for the 26item subtest was similar to the matrix before validation. The MTMM matrix met the validity guidelines of Campbell and Fiske (1959) indicating good consistency and minimal variance related to methods. The MTMM matrix for alternate methods changed dramatically after employing the validation findings. Variance related to methods as identified by the heterotrait-monomethod coefficients was reduced to near zero. The convergent validity coefficients between the IRT-UA and logistic regression methods went from 0.44, 0.41, and 0.74 by gender, mathematics background, and test anxiety, respectively, to 0.97, 0.95, and 0.75 for the same subpopulation definitions. Theoretically, IRT-UA and logistic regression are highly related. Nevertheless, findings indicated that if a researcher ignored test validation and passively utilized the full test, DIF indices using IRT-UA or logistic regression possibly would be inconsistent and lead to poor decision-making. Regarding the dramatic improvement of the MTMM matrix for alternate methods, it should be noted that the procedures I utilized differed from those employed by Ackerman (1992). Ackerman (1992) utilized structural validation to define a subset of valid items and then

PAGE 165

155 reestimated DIF indices for all items. He found for the valid subset generally low levels of DIF and concluded that DIF tended to disappear from the valid subset of items after validation. I utilized structural validation and defined a unidimensional subset of items, but reestimated DIF indices for only the valid items. Dropping the problematic items from the study resulted in more consistent nonuniform DIF indices. If I had maintained the problematic items in the analysis, the convergent coefficients between IRT-UA and logistic regression would have remained high and difficult to explain. This was a result of the erratic DIF estimates for the problematic items. For items unrelated to other items on a test, DIF indices appear to lack consistency. The findings that defining subpopulations by mathematics background led to the detection of more aberrant items along with the erratic DIF indices found after validation raises questions concerning traditional DIF methods. Traditional DIF methods that are based on simple unidimensional models with subpopulations defined by demographic characteristics tend to result in inadequate model fit with actual test data. The simplistic models do not represent the complex interactions between examinees and test items. Examinee

PAGE 166

156 responses to test items are influenced by differences in educational backgrounds, opportunities-to-learn, and other life experiences. Traditional DIF models that consider only one trait or background differences are likely to be inadequate representations of actual test data. Future DIF research should incorporate multidimensional models similar to those developed by Reckase (1985, 1986) and include operationalized variables similar to the educational background variables defined by Miller and Linn (1988) and Muthen et al. (1991). The findings further informed researchers concerning the need to measure and interpret nonuniform DIF. Numerous DIF studies have functioned from the perspective that nonuniform DIF is trivial or a statistical artifact (Burton & Burton, 1993; Freedle & Kostin, 1990; Harris & Carlton, 1993; Holland & Thayer, 1987; Scheuneman & Gerritz, 1990; Schmitt, 1988; Schmitt & Dorans, 1990; Zwick & Ercikan, 1989). Nonetheless, other researchers have simulated symmetrical interactions to demonstrate the need for nonuniform methods (Swaminathan & Rogers, 1990). In this study, examining the responses of examinees to the 30-item test, nonuniform DIF did not occur as frequently as uniform DIF and symmetrical nonuniform DIF occurred only once (Item 8 with subpopulations defined by gender).

PAGE 167

157 Despite this observation, more subtle cases of nonuniform DIF occurred throughout the analysis and ignoring these could result in poor item analysis. In analyzing items displaying nonuniform DIF, LRCs tended to intersect at either high or low levels of ability. In attempting to make valid decisions concerning the fairness of items, researchers need to evaluate possible nonuniform DIF in relationship to subpopulation ability distributions. Depending upon the differences in the distribution of subpopulations, ignoring nonuniform DIF could result in the inclusion of items unfair to a specific group. Recent efforts by Li and Stout (1994, 1995) to extend the SIBTEST to measure nonuniform DIF are significant advances in DIF theory and item validation. Finally, a simplistic interpretation of the study's findings might be that uniform methods are easier to use, result in higher levels of consistency, do not require extensive validation, and are preferable to the nonuniform methods. The comparison of nonuniform IRT-SA and IRT-UA along with the plotting of the LRCs suggested that significant information would be lost if only uniform methods were employed. A thorough analysis of the findings confirmed the importance of careful test validation before DIF estimation, the significance of

PAGE 168

158 interpreting nonuniform DIF, and the need to interpret both the DIF index and significance test before eliminating items. The findings as well suggested that unidimensional DIF models are inadequate representations of the complex interactions between examinees and test items. This study further indicated that using mathematics background to define subpopulations and explain gender DIF has potential usefulness, although, in the context of this study, the use of test anxiety to define subpopulations and explain DIF was ineffectual.

PAGE 169

APPENDIX A SUMMARY STATISTICAL TABLES

PAGE 170

Table A.1 Examinee Frequencies and Percentages for Gender and Mathematics Background by Test Location Women n Pct. Men n Total n Women n Men n Pct. Total n Pct. Mathematics Background Substantial Little Total College of Education Classes (n=658) 147 22.3 74 11.2 221 33.6 346 52.6 91 13.8 437 66.4 493 74.9 165 25.1 658 100 College of Business Classes (n=483) 128 26 .5 191 39.5 319 66.0 160 76 15.7 88 18.2 164 34.0 204 42.2 279 57.8 483 100

PAGE 171

161 Table A.1 -continued. Mathematics Background Total Substantial Little Other Test Locations (n=122) Women n. 26 31 57 Pct. 21.3 25.4 46.7 Men n. 60 5 65 Pct. 49.2 4.1 53.3 Total n. 86 36 122 Pct. 70.5 29.5 100

PAGE 172

Table A.2 GRE-Q Score Frequency Distributions by Gender, Mathematics Background, and Test Anxiety Score Freq Percent Score Freq Percent Women (n=754) 3 0 0.0 17 62 8.2 4 0 0.0 18 42 5.6 5 3 0.4 19 61 8.1 6 8 1.1 20 48 6.4 7 7 0.9 21 45 6.0 8 14 1.9 22 34 4.5 9 20 2.7 23 22 2.9 10 26 3.4 24 19 2.5 11 23 3.1 25 10 1.3 12 45 6.0 26 13 1.7 13 53 7.0 27 5 0.7 14 54 7.2 28 2 0.3 15 64 8.5 29 1 0.1 16 72 9.5 30 1 0.1 Score Freq Percent Score Freq Percent Men (n=509) 3 1 0.2 17 42 8.3 4 1 0.2 18 28 5.5 5 0 0.0 19 36 7 .1 6 2 0.4 20 34 6.7 7 4 0.8 21 43 8.4 8 4 0.8 22 44 8.6 9 5 1.0 23 35 6.9 10 11 2.2 24 34 6.7 11 5 1.0 25 24 4.7 12 17 3.3 26 21 4.1 13 20 3.9 27 14 2.8 14 17 3.3 28 10 2.0 15 21 4.1 29 6 1.2 16 29 5.7 30 1 0.2 162

PAGE 173

163 Table A.2 -continued. Score Freq Percent Score Freq Percent Substantial Math Background (n=626) 3 0 0.0 17 47 7.5 4 0 0.0 18 35 5.6 5 0 0.0 19 49 7.8 6 0 o.o 20 53 8.5 7 3 0.5 21 58 9.3 8 5 0.8 22 52 8.3 9 7 1.1 23 42 6.7 10 8 1.3 24 40 6.4 11 2 0.3 25 27 4.3 12 23 3.7 26 30 4.8 13 16 2.6 27 15 2.4 14 29 4.6 28 11 1.8 15 31 5.0 29 6 1.0 16 35 5.6 30 2 0.3 Score Freq Percent Score Freq Percent Little Math Background (n=637) 3 1 0.2 17 57 8.9 4 1 0.2 18 35 5.5 5 3 0.5 19 48 7.5 6 10 1.6 20 29 4.6 7 8 1.3 21 30 4.7 8 13 2.0 22 26 4.1 9 18 2.8 23 15 2.4 10 29 4.6 24 13 2.0 11 26 4.1 25 7 1.1 12 39 6.1 26 4 0.6 13 57 8.9 27 4 0.6 14 42 6.6 28 1 0.2 15 54 8.5 29 1 0.2 16 66 10.4 30 0 0.3

PAGE 174

164 Table A.2 -continued. Score Freq Percent Score Freq Percent Low Test Anxiety (n=542) 3 0 0.0 17 36 6.6 4 0 0.0 18 28 5.2 5 0 0.0 19 41 7.6 6 2 0.4 20 48 8.9 7 5 0.9 21 42 7.7 8 4 0.7 22 42 7.7 9 5 0.9 23 26 4.8 10 9 1.7 24 31 5.7 11 13 2.4 25 23 4.2 12 22 4.1 26 24 4.4 13 19 3.5 27 12 2.2 14 28 5.2 28 10 1.8 15 25 4.6 29 5 0.9 16 40 7.4 30 2 0.4 Score Freq Percent Score Freq Percent High Test Anxiety (n=558) 3 1 0.2 17 52 9.3 4 1 0.2 18 29 5.2 5 3 0.5 19 44 7.9 6 8 1.4 20 23 4.1 7 4 0.7 21 36 6.5 8 11 2.0 22 26 4.7 9 20 3.6 23 24 4.3 10 24 4.3 24 16 2.9 11 11 2.0 25 8 1.4 12 38 6.8 26 6 1.1 13 41 7.3 27 3 0.5 14 35 6.3 28 1 0.1 15 43 7.7 29 1 0.1 16 49 8.8 30 0 0.0

PAGE 175

165 Table A.3 Released GRE-Q Item Difficulty Values (p) and Biserial Correlations
PAGE 176

166 Table A.3 -continued. Total Sample Women Men Item 16 .84 .59 .81 .57 .89 .57 17 .80 .52 .77 .52 .85 .51 18 .81 .43 .78 .40 .84 .46 19 .76 .44 .72 .45 .83 .40 20 .69 .39 .66 .32 .72 .so 21 .77 .37 .73 .36 .83 .33 22 .64 40 .58 .34 .72 .47 23 .34 .38 .32 .35 .37 .42 24 .63 .56 .59 .53 .70 .60 25 .85 .so .83 .46 .88 .62 26 .29 .31 .25 .21 .36 .34 27 .55 .40 .so .33 .62 .45 28 .46 .42 .41 .34 .53 .46 29 .23 .36 .19 .31 .29 .35 30 .32 .41 2 6 .33 .41 .44

PAGE 177

167 Table A.4 DIF Indices for the 30 Items of the GRE-Q with Population Groups Defined by Gender Item MH-D SIBTEST-b IRT-SA IRT-UA LogReg 1 1.76 -.062 -0.80 0.80 0.84 2 -0.23 .020 0.22 0.28 0.23 3 -0.71 .050 1.41 1. 60 1.72 4 0.08 -.016 -0.15 0.32 0.07 5 0.3 1 -.021 -0.28 0.28 0.32 6 -1.02 .081 1.05 1.21 1.98 7 -0.03 .007 0.07 0.25 0.15 8 -0.17 .024 -0.15 0.48 0.75 9 -0.89 .088 0.15 0.30 0.37 10 0.22 -.013 -0.16 0.47 0.13 11 0.15 .002 -0.04 0.28 3.92 12 -0.22 .022 0.08 0.31 0.39 13 0.74 -.065 -0.36 0.36 0.28 14 0.51 -.043 -0.27 0.27 0.18 15 0.14 -.019 -0.21 0.31 0.07

PAGE 178

168 Table A.4 -continued. Item MH-D SIBTEST-b IRT-SA IRT-UA LogReg 16 -0.21 .005 0.18 0.31 0.32 17 0.18 -.012 0.11 0.32 0.21 18 0.04 -.008 -0.14 0.15 0.07 19 -0.38 .031 0.38 0.55 0.47 20 0.54 -.030 -0.44 0.51 0.54 21 -0.24 .030 0.39 0.54 0.50 22 -0.50 .046 0.00 0.23 0.35 23 0.88 -.066 -0.45 0.45 0.30 24 0.25 -.019 -0.23 0.23 0.12 25 0.48 -.034 -0.36 0.37 0.39 26 -0.20 .033 0.29 0.40 0.54 27 -0.08 .014 -0.08 0.18 0.24 28 0.05 -.006 -0.14 0.26 0.23 29 -0.09 .003 -0.18 0.22 0.78 30 -0.47 040 0.16 0.22 0.17

PAGE 179

169 Table A.5 DIF Indices for the 30 Items of the GRE-0 with Population Groups Defined by Mathematics Background Item MH-D SIBTEST-b IRT-SA IRT-UA LogReg 1 -0.32 .006 0.11 0.13 0.09 2 -0.06 .006 -0.69 0.78 1.09 3 0.27 -.012 0.60 0.97 0.65 4 -1.43 .145 0.52 0.87 0.76 5 0.13 -.009 -0.11 0.26 0.07 6 -1.42 .122 1.42 1.48 2.93 7 0.38 -.027 -0.15 0.30 0.23 8 0.71 -.050 -0.59 0.59 0.60 9 -0.26 .040 -0.06 0.19 0.40 10 -0.01 .018 -0.17 0.32 0.79 11 0.42 -.003 -0.10 0.50 5.67 12 0.86 -.076 -0.53 0.56 0.57 13 0.42 -.043 -0 40 0.41 0.42 14 0.63 -.061 -0.55 0.59 0.56 15 0.11 -.004 -0.34 0.40 0.12

PAGE 180

170 Table A.5 -continued. Item MH-D SIBTEST-b IRT-SA IRT-UA LogReg 16 -1.10 .030 0.39 0.45 0.41 17 -0.03 -.015 0.06 0.26 0.11 18 -0.88 .038 1.49 1.79 1.67 19 0.35 -.022 0.06 0.46 0.24 20 -1.49 .140 0.43 0.57 0.75 21 0.46 -.038 0.22 0.75 0.48 22 0.29 -.029 -0.19 0.35 0.13 23 0.26 -.007 -0.30 0.31 0.41 24 -0.36 .025 -0.06 0.10 0.12 25 1.16 -.065 -0.30 0.32 0.29 26 -0.17 .033 -0.03 0.03 0.67 27 0.39 -.024 -0.33 0.35 0.19 28 0.26 -.010 -0.31 0.31 0.11 29 0.58 -.044 -0.26 0.34 0.54 30 -0.29 .039 0.16 0.53 0.78

PAGE 181

171 Table A.6 DIF Indices for the 30 Items of the GRE-Q with Population Groups Defined by Test Anxiety Item MH-D SIBTEST-b IRT-SA IRT-UA LogReg 1 -0.20 .004 -0.08 0.20 0.26 2 0.30 -.009 0.25 0.43 0.14 3 -0.30 .015 0.06 0.27 0.11 4 0.25 -.031 -0.21 0.30 0.12 5 -0.25 .024 0.08 0.24 0.19 6 0.60 -.040 -0.38 0.45 0.42 7 0.25 -.017 0.44 0.52 0.58 8 -0.14 .039 0.07 0.31 0.24 9 -0.18 .013 -0.06 0.22 0.16 10 -0.03 .005 -0.19 0.36 1.37 11 -0.58 .038 0.67 0.74 6.20 12 -0.18 .016 -0.03 0.25 0.09 13 0.07 -.013 -0.09 0.23 0.06 14 -0.40 .025 0.10 0.20 0.14 15 0.42 -.036 -0.24 0.26 0.17

PAGE 182

172 Table A.6 -continued. Item MH-D SIBTEST-b IRT-SA IRT-UA LogReg 16 -0.82 .022 0 .20 0.23 0.20 17 -1.16 .054 0.40 0.41 0.39 18 0.37 -.018 -0.01 0.20 0.12 19 0.08 -.008 0.16 0.27 0.21 20 0.24 -.027 -0.18 0.27 0.28 21 -0.29 .013 -0.26 0.30 0.72 22 0.36 -.030 -0.28 0.33 0.47 23 -0.01 .000 -0.23 0.34 0.05 24 -0.32 .011 -0.03 0.14 0.24 25 -0.36 .006 0.33 0.35 0.24 26 0.24 -.009 -0.01 0.31 0 .58 27 0.71 -.066 -0.37 0.41 0.36 28 0.16 -.018 -0.15 0.26 0.19 29 -0.03 -.006 0.19 0.30 0.78 30 -0.07 .003 -0.10 0 25 0.17

PAGE 183

Table A.7 Inferential tests of convergent validity coefficients Methods and Traits I. Uniform DIF Procedures A.MH and IRT-SA Math Bkd. vs. Gender TA vs. Gender B.MH and SIBTEST Math Bkd. vs. Gender TA vs. Gender C.IRT-SA and SIBTEST Math Bkd. vs. Gender TA vs. Gender II. Alternate DIF Procedures A.M-H and IRT-UA Math Bkd. vs. Gender TA vs. Gender B.M-H and SIBTEST Math Bkd. vs. Gender TA vs. Gender C.IRT-SA and SIBTEST Math Bkd vs. Gender TA vs. Gender Steiger's z* -0.73 -1.54 0.30 -0.48 -0.92 -1.14 -0.30 -1.05 0.04 -0.07 -0.16 1. 65 .4654 .1236 .7642 .6312 .3576 .2542 .7642 .2938 .9680 .9442 .8728 .0990 173

PAGE 184

174 Table A.8 Inferential Statistical Tests for the 30 Items of the GRE-Q with Population Groups Defined by Gender SIBTEST Item chi-sq. z 1 10.49** -3.27** 2 0.21 0.97 3 2.64 2.28* 4 0.04 -0.56 5 0.78 -0.74 6 8.16** 3.18** 7 0.00 0.27 8 0.25 0.82 9 7.86** 3.09** 10 0.44 -0.42 11 0.07 0.08 12 0.41 0.77 13 4.85* -2.31* 14 1.87 -1.67 15 0.12 -0.69 Note. p < .05 ** p < .01 IRT-SA z -1.64 0.30 2.08* -0.96 -1.38 1.52 0.17 -0.68 1.03 -0.64 -0.03 0.11 -2.85** -1.90 -1.95 IRT-UA _z_ 1.63 0.36 2.16* 1.95 1.38 1.08 0.58 1.15 1.14 1.10 0.81 1.35 4.80** 1.90 4.27** LogReg chi-sq. 2.76 0.07 10.93** 0.15 2.37 9.01* 0.28 7.31* 1.98 0.04 4.21 4.76 1.34 0.37 0.36

PAGE 185

175 Table A.8 -continued. SIBTEST IRT-SA IRT-UA LogReg Item chi-sq. z_ z _z chi-sq. 16 0.11 0.22 0.82 1.25 4.97 17 0.13 -0.52 0.46 1.33 2.19 18 0.00 -0.37 -0.52 0.93 0.09 19 0.85 1.29 1.34 1.55 4.64 20 2.47 -1.06 -2.15* 1.33 7.54* 21 0.31 1.24 1.02 1.14 3.12 22 2.26 1.62 0.00 1.00 3.11 23 6.68* -2.45* -2.23* 2.26* 2.68 24 0.40 -0.70 -1.97* 1.97* 0.46 25 0.96 -1.61 -1.55 1.48 3.85 26 0.31 1.23 0.69 0.60 2.20 27 0.04 0.46 -0.52 0.54 2.10 28 0.01 -0.21 -0.89 0.99 2.59 29 0.03 0.12 -0.47 0.42 0.01 30 1.77 1.50 0.64 0.69 2.65 Note. p < .05 ** p < .01

PAGE 186

Table A.9 Inferential Statistical Tests for the 30 Items of the GRE-Q with Population Groups Defined by Mathematics Background SIBTEST Item chi-sq. z_ 1 0.22 0.34 2 0.00 0.27 3 0.33 -0.53 4 21.29** 4.92** 5 0.11 -0.34 6 14.95** 4.75** 7 1.02 -1.07 8 4.85* -1. 71 9 0.55 1.35 10 0.00 0.63 11 0.77 -0.15 12 6.70** -2.61** 13 1.50 -1.52 14 2.97 -2.37* 15 0.07 -0.15 Note. p < .05 ** p < .01 IRT-SA z 0.18 -0.82 1.06 2.26* -1.38 1.44 -0.41 -2.86** -0.36 -0.62 -0.07 -3.79** -2.98** -3.92** -3.08** IRT-UA _z 0.21 0.86 1.59 1.94 4.86** 1.09 2.70** 2.66** 0.79 0.80 0.36 2.60** 2.05* 2.58** 9.13** LogReg chi-sq. 0.07 2.77 3.50 9.40** 0.03 7.68* 0.16 5.12* 6.30* 3.74 8.61* 14.99** 9.56** 18.43** 1.36 176

PAGE 187

177 Table A.9 -continued. SIBTEST IRT-SA IRT-UA LogReg Item chi-sq. z_ z _z chi-sq. 16 4.95* 1.47 1.43 1.48 2.19 17 0.00 -0.64 0.27 1.14 0.40 18 4.61* 1. 67 2.47* 2.72** 25.31** 19 0.73 -0.95 0.23 1.73 2.02 20 20.26** 5.08** 1.52 1.98* 4.40 21 1.30 -1.55 0.58 1. 81 4.54 22 0.67 -1.07 0.99 0.96 0.43 23 0.51 -0.24 -1.20 1.65 6.13* 24 0.93 0.96 0.08 0.32 0.24 25 6.17* -3.16** -1.14 4.93** 0.26 26 0.18 1.17 -0.06 0.06 3.36 27 1.33 -0.84 -2.13* 0.76 0.43 28 0.56 -0.35 -1.94 2.64** 0.41 29 2.03 -1.58 -0.61 1.32 8.46* 30 0.61 1.35 0.48 1.39 13.79** Note. p < .05 ** p < .01

PAGE 188

Table A.10 Inferential Statistical Tests for the 30 Items of the GRE-Q with Population Groups Defined by Test Anxiety SIBTEST IRT-SA IRT-UA LogReg Item chi-sq. z_ z _z chi-sq. 1 0.06 0.21 -0.15 0.08 0.55 2 0.38 -0.41 0.32 0.48 0.05 3 0.42 0.61 0.12 0.06 0.10 4 0.49 -1.02 -1.28 0.21 1. 86 5 0.46 0.84 0.38 0.08 0.78 6 2.32 -1.45 -0.86 0.38 2.87 7 0.36 -0.64 1.09 0.87 4.03 8 0.15 1.26 0.31 0.08 0.38 9 0.21 0.45 -0.38 0.10 0.88 10 0.00 0.17 -0.67 0.96 9.65** 11 1.37 1. 73 0.46 0.68 3.26 12 0.23 0.55 -0.16 0.20 0.11 13 0.02 -0.42 -0.71 0.21 0.13 14 1.07 0.90 0.63 0.12 0.20 15 1.28 -1.27 -2.11* 0.23 2.83 Note. p < .05 ** p < .01 178

PAGE 189

179 Table A.lo -continued. SIBTEST IRT-SA IRT-UA LogReg Item chi-sq. z_ z _z chi-sq. 16 2.43 1.07 1.03 0.21 0.21 17 6.79** 2.40* 1.84 0.42 0.95 18 0.64 -0.75 -0.42 0.14 0.36 19 0.02 -0.32 0.23 0.37 1.38 20 0.39 -0.96 -0.88 0.19 1.90 21 0.50 0.51 -0.85 0.46 9.19 22 1.06 -1.03 -1.47 0.31 5.84 23 0.00 0.00 -1.02 0.41 0.10 24 0.70 0.39 -0.30 0.17 4.88 25 0.45 0.26 1.12 0.39 1.11 26 0.37 -0.34 -0.03 0.25 3.43 27 4.45* -2.21* -2.46* 0.37 2.64 28 0.18 -0.60 -0.95 0.15 1.60 29 o.oo -0.24 0.42 0.35 5.93 30 0.01 0.12 -0.44 0.10 0.69 Note. p < .05 ** p < .01

PAGE 190

Table A.11 Unidimensional Standardized Estimates for the 30-item Test: Exploratory and Cross-Validation Samples Item Item 1 0.45 (0.42) 16 0.66 (0.62) 2 0.15 (0.28) 17 0.61 (0.51) 3 0.36 (0.36) 18 0.49 (0.50) 4 0.44 (0.46) 19 0.52 (0.44) 5 0.46 (0.37) 20 0.45 (0.42) 6 0.24 (0.32) 21 0.40 (0.42) 7 0.35 (0.37) 22 0.49 (0.39) 8 0.34 (0.32) 23 0.51 (0.35) 9 0.48 (0.52) 24 0.65 (0.60) 10 0.23 (0.27) 25 0.58 (0.53) 11 0.16 (0.01) 26 0.31 (0.37) 12 a.so (0.41) 27 0.49 (0.40) 13 0.49 (0.52) 28 0.45 (0.51) 14 0.57 (0.55) 29 0.39 (0.43) 15 0.64 (0.64) 30 0.48 (0.43) Note. Estimates from the cross-validation sample are in parentheses. 180

PAGE 191

Table A.12 Unidimensional Standardized Estimates for the Amended 26-Item Test: Exploratory and Cross-Validation Samples Item Item 1 0.52 (0.40) 18 0.48 (0.51) 3 0.36 (0.37) 19 0.51 (0.45) 4 0.46 (0.44) 20 0.47 (0.42) 5 0.44 (0.36) 21 0.38 (0.43) 7 0.33 (0.37) 22 0.49 (0.37) 8 0.33 (0.33) 23 0.50 (0.34) 9 0.48 (0.52) 24 0.67 (0.61) 12 0.50 (0.41) 25 0.56 (0.54) 13 0.49 (0.52) 26 0.32 (0.36) 14 0.56 (0.55) 27 0.48 (0.39) 15 0.64 (0.65) 28 0.44 (0.51) 16 0.65 (0.62) 29 0.42 (0.42) 17 0.58 (0.51) 30 0.49 (0.44) Note. Estimates from the cross validation sample are in parentheses. 181

PAGE 192

Table A.13 DIF Indices Based Upon the 26 Valid Items of the GRE-Q with Population Groups Defined by Gender Item MH-D SIBTEST-b IRT-SA IRT-UA LogReg 1 1. 73 -.052 -0.95 0.95 0.92 2 -0.21 .015 0.52 3 -0.74 .052 1.39 1.54 1.73 4 0.05 -.008 -0.04 0.20 0.02 5 0.24 -.012 -0.29 0.30 0.42 6 -1.03 .080 5.80 7 -0.12 .005 0.04 0.13 0.14 8 -0.25 .014 -0.10 0.56 0.59 9 -0.91 .091 0.26 0.30 0.39 10 0.23 -.007 0.39 11 0.13 -.002 110.45 12 -0.28 .029 0.10 0.43 0.40 13 0.73 -.065 -0.31 0.31 0.27 14 0.45 -.034 -0.25 0.25 0.16 15 0.14 -.021 -0.13 0.25 0.09 182

PAGE 193

183 Table A.13 -continued. Item MH-D SIBTEST-b IRT-SA IRT-UA LogReg 16 -0.35 .005 0.20 0.29 0.34 17 0.09 -.009 0.12 0.28 0.24 18 0.14 .003 -0.17 0.36 0.06 19 -0.34 .033 0.34 0.44 0.40 20 0.51 -.029 -0.45 0.61 0.56 21 -0.28 .023 0.38 0.46 0.50 22 -0.49 .047 0.04 0.36 0.39 23 0.78 -.074 -0.31 0.32 0.32 24 0.26 -.027 -0.19 0.19 0.12 25 0.50 -.030 -0.42 0.43 0.39 26 -0.24 .032 0.48 0.61 0.57 27 -0.12 .014 -0.02 0.29 0.26 28 0.09 .000 -0.04 0.40 0.34 29 -0.13 -.001 0.02 0.03 0.14 30 -0.46 .031 0.30 0.39 0.42

PAGE 194

184 Table A.14 DIF Indices Based Upon the 26 Valid Items of the GRE-Q with Population Groups Defined by Mathematics Background Item MH-D SIBTEST-b IRT-SA IRT-UA LogReg 1 -0.27 .002 -0.24 0.27 0.23 2 -0.08 .017 2.63 3 0.15 -.017 0.35 0.66 0.55 4 -1.49 .141 0.75 0.84 0.77 5 0.04 -.006 -0.16 0.16 0.07 6 -1.43 .126 11. 75 7 0.37 -.043 -0.33 0.33 0.22 8 0.69 -.073 -0.64 0.75 0.60 9 -0.38 .032 0.10 0.44 0.40 10 -0.01 .032 2.65 11 0.54 -.011 228.50 12 0.80 -.076 -0.48 0.65 0.54 13 0.39 -.041 -0.37 0.52 0.41 14 0.61 -.060 -0.61 0.74 0.54 15 0.14 -.018 -0.23 0.23 0.05

PAGE 195

185 Table A.14 -continued. Item MH-D SIBTEST-b IRT-SA IRT-UA LogReg 16 -1.11 .025 0.33 0.36 0.49 17 -0.08 .002 -0.04 0.11 0.12 18 -0.96 .059 1.47 1.72 1.69 19 0.30 -.049 -0.07 0.28 0.22 20 -1.58 .142 0.41 0.87 0.78 21 0.42 -.025 0.02 0.47 0.34 22 0.24 -.041 -0.22 0.22 0.13 23 0.27 -.025 -0.05 0.45 0.47 24 -0.50 .014 -0.05 0.09 0.14 25 1.16 -.064 -0.52 0.52 0.32 26 -0.28 .027 0.38 0.52 0.81 27 0.38 -.030 -0.30 0.30 0.19 28 0.17 -.006 -0.18 0.27 0.21 29 0.60 -.058 0.19 0.68 0.68 30 -0.24 .031 0.47 0.95 0.77

PAGE 196

Table A.15 DIF Indices Based Upon the 26 Valid Items of the GRE-Q with Population Groups Defined by Test Anxiety SIBTEST-b IRT-SA IRT-UA LogReg 1 -0.18 .004 -0.18 0.20 0.31 2 0.38 -.012 0.48 3 -0.24 .001 -0.04 0.10 0.21 4 0.30 -.028 -0.15 0.21 0.27 5 -0.29 .040 0.12 0.12 0.15 6 0.56 -.021 0.76 7 0.22 -.023 0.36 0.72 0.42 8 -0.16 .028 0.07 0.18 0.30 9 -0.12 .018 0.01 0.03 0.16 10 -0.05 .026 4.80 11 -0.60 .037 242.67 12 -0.13 .017 0.02 0.14 0.09 13 0.15 -.011 -0.07 0.12 0.04 14 -0.40 .035 0.12 0.12 0.15 15 0.45 -.036 -0.19 0.19 0.21 186

PAGE 197

187 Table A.15 -continued. Item SIBTEST-b IRT-SA IRT-UA LogReg 16 -0.68 .026 0.18 0.18 0.19 17 -1.12 .068 0.39 0.40 0.37 18 0.27 -.017 -0.18 0.18 0.19 19 0.13 -.011 0.15 0.33 0. 21 20 0.23 -.018 -0.21 0.26 0.33 21 -0.22 .032 -0.31 0.57 0.76 22 0.35 -.029 -0.24 0.28 0.37 23 -0.02 .008 -0.14 0.32 0.03 24 -0.37 .005 -0.01 0.21 0.22 25 -0.25 -.003 0.30 0.35 0.25 26 0.19 .002 0.12 0.40 0.62 27 0.74 -.071 -0.34 0.34 0.34 28 0.16 -.010 -0.10 0.14 0.18 29 0.05 -.003 0.37 0.55 0.80 30 -0.05 .006 -0.03 0.05 0.15

PAGE 198

188 Table A.16 Item Inferential Statistical Tests Based Upon the 26 Valid Items with Population Groups Defined by Gender SIBTEST chi-sq. z_ 1 9.98** -2.87** 3 2.92 2.46* 4 0.01 -0.28 5 0.42 -0.41 7 0.07 0.21 8 0.55 0.46 9 8.13** 3.09** 12 0.69 0.99 13 4.72* -2.25* 14 1.52 -1.26 15 0.12 -0.78 16 0.45 0.22 17 0.02 -0.39 Note. p < .05 ** p < .01 IRT-SA z -1. 83 2.04* -0.21 -1.36 0.10 -0.44 1.70 0.64 -2.41* -1. 72 -1.22 0.84 0.49 IRT-UA _z 1.82 2.04* 0.71 1.18 0.30 1.36 1.68 1.62 2.41* 0.60 1.32 1.19 1.11 LogReg chi-sq. 3.55 11.64** 0.01 3.24 0.24 5.24 1.86 5.14 1.38 0.42 0.81 5.27 2.39

PAGE 199

189 Table A.16 -continued. SIBTEST IRT-SA IRT-UA LogReg chi-sq. z_ z _z chi-sq. 18 0.07 0.13 -0.60 0.60 0.0 1 19 0.65 1.35 1.21 1.32 3.55 20 2.13 -1.05 -2.12* 1.99* 7.00* 21 0.47 0.92 0.97 0.98 3.04 22 2.16 1.64 0.20 1.43 3.56 23 5.12* -2.73** -1.53 1.86 4.02 24 0.45 -0.97 -1. 65 1. 64 0.75 25 0.96 -1.40 -1. 79 1.72 4.30 26 0.42 1.16 1.13 1.14 2.52 27 0.10 0.47 -0.13 1.00 2.46 28 0.04 -0.01 -0.27 1.58 5.60 29 0.09 -0.04 -0.06 0.21 0.08 30 1.71 1.17 1.24 1.27 3.24 Note. p < .05 ** p < .01

PAGE 200

190 Table A.17 Item Inferential Statistical Tests Based Upon the 26 Valid Items with Population Groups Defined by Mathematics Background SIBTEST chi-sq. z_ 1 0.16 0.14 3 0.08 -0.77 4 23.51** 4.68** 5 0.00 -0.22 7 0.98 -1.67 8 4.70* -2.45* 9 1.28 1.06 12 6.07* -2.61** 13 1.30 -1.42 14 2.87 -2.25* 15 0.12 -0.65 16 5.34* 1.24 17 0.01 0.11 Note. p < .05 ** p < 01 IRT-SA z -0.37 0.61 3.14** -0.62 -0.87 -3.13** 0.51 -3.34** -2.73** -4.27** -2.10* 1.12 -0.17 IRT-UA _z 0.10 0.41 2.53* 0.62 0.87 2.30* 1.54 2.84** 2.39* 4.13** 2.15* 1.15 0.63 LogReg chi-sq. 0.45 2.62 8.01* 0.10 0.21 6.03* 5.81 14.19** 9.89** 18.43** 0.14 2.61 0.33

PAGE 201

Table A.17 -continued. SIBTEST chi-sq. z_ 18 5.04* 2.82** 19 0.53 -2.02* 20 21.95** 5.05** 21 1.14 -1.02 22 0.47 -1.51 23 0.55 -0.91 24 1.91 0.54 25 6.07* -3.21** 26 0.57 0.94 27 1.29 -1.02 28 0.23 -0.20 29 2.25 -2.30* 30 0.42 1.09 Note. p < .05 ** p < .01 IRT-SA IRT-UA _z_ _z 2.31* 2.49* -0.24 1.17 1.39 2.31* 0.05 1.25 -1.12 1.13 -0.18 0.52 -0.38 0.53 -2.05* 2.05* -0.84 1.25 -1. 88 1.77 -1.10 1.06 -0.45 0.41 1.45 1. 72 LogReg chi-sq. 24.28** 1.76 4.72 2.30 0.25 7.00* 0.40 0.40 5.12 0.78 2.19 13.28** 16.47** 191

PAGE 202

192 Table A.18 Item Inferential Statistical Tests Based Upon the 26 Valid Items with Population Groups Defined by Test Anxiety SIBTEST chi-sq. z_ 1 0.04 0.26 3 0.24 0.06 4 0.74 -0.92 5 0.64 1.38 7 0.27 -0.87 8 0.19 0.92 9 0.09 0.61 12 0.11 0.56 13 0.14 -0.36 14 1.12 1.29 15 1.47 -1.26 16 1.74 1.23 17 6.61** 2.99* Note. p < .OS. ** p < 01. IRT-SA z -0.33 -0.09 -0.86 0.54 0.90 0.32 0.05 0.10 -0.53 0.72 -1. 71 0.90 1.73 IRT-UA _z 0.35 0.22 0.95 0.54 1.45 0.50 0.14 0.54 0.67 0.68 1.93 0.88 1.67 LogReg chi-sq. 0.70 0.42 2.61 0.27 2.20 0.98 0.90 0.14 0.07 0.16 5.22 0.13 0.56

PAGE 203

193 Table A.18 -continued. SIBTEST IRT-SA IRT-UA LogReg chi-sq. z_ _z_ _z chi-sq. 18 0.31 -0.71 -0.73 0.73 0.35 19 0.06 -0.44 0.60 1.13 1.48 20 0.34 -0.63 -0.97 0.86 2.72 21 0.27 1.22 -1.01 1.52 10.32** 22 1.01 -0.98 -1.26 1.04 3.33 23 0.00 0.28 -0.61 0.95 0.03 24 1.04 0.18 -0.13 1.36 4.53 25 0.19 -0.13 1.05 1.11 1.50 26 0.23 0.08 0.29 0.77 3.99 27 4.72* -2.38* -2.26* 2.28* 1. 72 28 0.18 -0.30 -0.64 0.73 1.59 29 0.00 -0.12 0.80 1.12 7.61* 30 0.00 0.21 -0.14 0.32 0.80 Note. p < .05 ** p < .01

PAGE 204

APPENDIX B DIFFERENTIAL ITEM FUNCTIONING QUESTIONNAIRE INCLUDING THE REVISED TEST ANXIETY SCALE

PAGE 205

DIFFERENTIAL ITEM FUNCTIONING QUESTIONNAIRE Please carefully bubble in the last five digits of your parents' telephone number. Please bubble in your age in the first two columns labeled section. In the second two columns labeled section, bubble in your college classification according to the following criteria: 00 non-degree 01 Freshman 02 Sophomore 03 Junior 04 Senior 05 Master's student 06 Doctoral student Directions (questions 1-6): Answer each of the following questions. 1. Your sex: a. Female b. Male 2. Your ethnic group: a. African-American b. Asian-American c. Hispanic-American d. White-American e. Other 3. Your college major could be best classified under which of the following: a. humanities, fine arts b. social science, psychology, education c. business d. biological sciences e. physical sciences, mathematics, engineering 195

PAGE 206

4. Have you successfully completed a college-level calculus course? a. yes b. no s. Have you successfully completed a college-level statistics course? a. yes b. no 6. Have you previously taken the GRE? a. yes b. no 196 Directions: The following items refer to how you feel when taking a test. Use the scale below to rate items 7 through 26 in terms of how you feel when taking tests in GENERAL. !=almost never always 2=sometimes 3=often 7. Thinking about my grade in a course interferes with my work on tests .. ....... 8. I seem to defeat myself while taking important tests .......................... 9. During tests I find myself thinking about the consequences of failing 10. I start feeling very uneasy just before getting a test paper back ........... 11. During tests I feel very tense ......... 12. I worry a great deal before taking an important exam 4=almost 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

PAGE 207

13. During tests I find myself thinking of things unrelated to the material being tested ............................ 1 14. While taking tests, I find myself thinking how much brighter the other people are. . . . . . . . . . . . . . . 1 15. I think about current events during a test . . . . . . . . . . . . . . . . . 1 16. I get a headache during an important test. . . . . . . . . . . . . . . . . . 1 17. While taking a test, I often think about how difficult it is 1 18. I am anxious about tests 1 19. While taking tests I sometimes think about being somewhere else 1 20. During tests I find I am distracted by thoughts of upcoming events 1 21. My mouth feels dry during a test ..... 1 22. I sometimes find myself trembling before or during tests 1 23. While taking a test my muscles are very tight. . . . . . . . . . . . . . . . . . 1 24. I have difficulty breathing while taking a test.... . . . . . . . . . . . 1 197 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4

PAGE 208

198 25. During the test I think about how I should have prepared for the test ..... 1 2 3 4 26. I worry before the test because I do not know what to expect ................. 1 2 3 4

PAGE 209

APPENDIX C THE CHIPMAN, MARSHALL, AND SCOTT (1991) INSTRUMENT FOR ESTIMATING MATHEMATICS BACKGROUND

PAGE 210

Mathematics Background 1. How many semesters of high school mathematics did you successfully complete? 2. Did you successfully complete a high school calculus course? 3. How many semester credits of college mathematics have you earned? 4. Have you successfully completed a college calculus course? 5. Have you successfully completed a college course in: a. physics? b. computer science programming? c. Engineering? Composite mathematics background scores were determined by giving students one point for each semester of high school mathematics; two points for successfully completing a high school calculus course; one point for each credit of college mathematics achieved up to a total of ten; two points for successfully completing a college calculus course; and one point each for completing a college course in physics, computer science, or engineering. 200

PAGE 211

REFERENCES Ackerman, T. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67-91. Angoff, W.H. (1993). Perspectives on functioning methodology. In P.W. Holland & Differential item functioning (pp. 3-23). Lawrence Erlbaum. differential item H. Wainer (Eds.), Hillsdale, NJ: Angoff, W.H., & Ford, S.F. (1973). Item-race interaction on a test of scholastic aptitude. Journal of Educational Measurement, 10, 95-106. Baker, F.B. (1981). A criticism of Scheuneman's item bias technique. Journal of Educational Measurement, 18, 59-62. Benbow, C.P. (1988). Sex differences in mathematical reasoning ability in intellectually talented preadolescents: Their nature, effects, and possible causes. Behavioral and Brain Sciences, 11, 169-232. Benbow, C.P., & Stanley, J.C. (1980). in mathematical ability: Fact or artifact? 1262-1264. Sex differences Science, 210, Benson, J., & Bandalos, D. (1992). Second-order confirmatory factor analysis of the Reactions to Tests scale with cross-validation. Multivariate Behavioral Research, 27, 459-487. Benson, J., & El Zahhar, N. (1994). Further refinement and validation of the Revised Test Anxiety scale. Structural Equation Modeling: A Multidisciplinary Journal, l, 203-221. Benson, J., Moulin-Julian, M., Schwarzer, c., Seipp, B., & El zahhar, N. (1991). Cross-validation of a revised test anxiety scale using multi-national samples. In K. Hagtvet & T.B. Johnson (Eds.), Advances in test anxiety research (Vol. 7, pp. 62-83). Lisse, The Netherlands: Swets & Zeitlinger. 201

PAGE 212

202 Bentler, P.M., & Bonett, D.G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588-606. Bishop, Y.M., Fienberg, S.E., & Holland, P.W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge, MA: MIT Press. Bridgeman, B., & Wendler, C. (1991). Gender differences in predictors of college mathematics performance in college mathematics course grades. Journal of Educational Psychology, 83, 275-284. Burton, E., & Burton, N.W. (1993). The effects of item screening on test scores and test characteristics. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 321-335). Hillsdale, NJ: Lawrence Erlbaum. Camilli, G. (1992). A conceptual analysis of differential item functioning in terms of a multidimensional item response model. Applied Psychological Measurement, 16, 129-147. Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. Chipman, S.F., Marshall, S.P., & Scott, P.A. (1991). Content effects on word problem performance: A possible source of test bias? American Educational Research Journal, 28, 897-915. Clauser, B.E., Mazor, K., & Hambleton, R.K. (1991). Influence of criterion variable on the identification of differentially functioning test items using the Mantel Haenszel statistic. Applied Psychological Measurement, 15, 353-359. Cohen, A.S., & Kim, s.-H. (1993). A comparison of Lord's chi-square and Raju's area measure in detection of DIF. Applied Psychological Measurement, 17, 39-52. Cole, N.S., & Moss, P.A. (1989) Bias in test use. In R.L. Linn (Ed.), Educational measurement (3rd Ed. pp. 201219). New York: American Council in Education/Macmillian. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart, and Winston.

PAGE 213

203 Darlington, R.B. (1990). Regression and linear models. New York: McGraw-Hill. Donoghue, J.R., Holland, P.W., & Thayer, D.T. (1993). A Monte Carlo study of factors that affect the Mantel Haenszel and standardization measures of differential item functioning. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp.137-166). Hillsdale, NJ: Lawrence Erlbaum. Doolittle, A.E. (1984, April). Interpretation of differenetial item performance accompanied by gender differences in academic background. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. Doolittle, A.E. (1985, April). Understanding differential item performance as a consequence of gender differences in academic background. Paper presented at the annual meeting of the American Educational Research Association, Chicago. Doolittle, A.E., & Cleary, T.A. (1987). Gender-based differential item performance in mathematics achievement items. Journal of Educational Measurement, 24, 157-166. Dorans, N.J., & Holland, P.W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35-66). Hillsdale, NJ: Lawrence Erlbaum. Dorans, N.J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355-368. Dorans, N.J., Schmitt, A.P.,& Bleistein, C.A. (1992). The standardization approach to assessing comprehensive differential item functioning. Journal of Educaitonal Measurement, 29, 309-319. Dweck, c.s. (1986). Motivational processes affecting learning. American Psychologist, 41, 1040-1048. Eccles, J., Adler, T., & Meece, J.L. (1984). Sex differences in achievement: A test of alternate theories. Journal of Personality and Social Psychology, 46, 26-43.

PAGE 214

Educational Testing Service. (1991). Sex, race, ethnicity, and performance on the GRE general test: A technical report. Princeton, NJ: Author. Educational Testing Service. (1993). GRE: 1993-94 information and registration bulletin. Princeton, NJ: Author. 204 Elliot, E.S., & Dweck, c.s. (1988). Goals: An approach to motivation and achievement. Journal of Personality and Social Psychology, 54, 5-12. Elliot, R., & Strenta, A.C. (1988). Effects of improving the reliability of the GPA on prediction generally and on comparative predictions for gender and race particularly. Journal of Educational Measurement, 25, 333-347. Ethington, C.A., & Wolfle, L.M. (1984). Sex differences in a causal model of mathematics achievement. Journal for Research in Mathematics Education, 15, 361-377. Everson, H.T., Millsap, R.E., & Rodriguez, Isolating gender differences in test anxiety: A factor analysis of the Test Anxiety Inventory. and Psychological Measurement, 51, 243-251. C.M. (1991). confirmatory Educational Feingold, A. (1988). Cognitive gender differences are disappearing. American Psychologist, 43, 95-103. Feingold, A. (1992). Sex differences in variability in intellectual abilities. Review of Educational Research, 62, 61-84. Fennema, E. (1985). Attribution theory and achievement in mathematics. In S.R. Yussen (Ed.), The growth of reflection in children (pp. 245-265). New York: Academic Press. Fennema, E., & Petersen, P. (1985). Autonomous learning behavior: A possible explanation of gender-related differences in mathematics. In L.C. Wilkinson & C.B. Marrett (Eds.), Gender influences in classroom interaction (pp. 17-35). New York: Academic Press. Fennema, E., & Sherman, J. (1977). Sex-related differences in mathematics achievement, spatial visualization and affective factors. American Educational Research Journal, 14, 51-71.

PAGE 215

205 Freedle, R., & Kostin, I. (1990). Item difficulty of four verbal item types and an index of differential item functioning for black and white examinees. Journal of Educational Measurement, 27, 329-343. Friedman, L. (1989). Mathematics and the gender gap: A meta-analysis of recent studies on sex differences in mathematical tasks. Review of Educational Research, 59, 185-213. Hackett, G., & Betz, N.E. (1989). An exploration of the mathematics self-efficacy/mathematics performance correspondence. Journal for Research in Mathematics Education, 20, 79-83. Halpern, D.F. (1992). Sex differences in cognitive abilities (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Hambleton, R.K., & Rogers, H.J. (1989). Detecting potentially biased test items: Comparison of IRT area and Mantel-Haenszel methods. Applied Measurement in Education, I, 313-334. Hambleton, R.K., & Swaminathan, H. (1985). response theory: Principles and applications. Kluwer-Nijhoff. Item Boston: Harnisch, D.L., & Linn, R.L. (1981). Analysis of item response patterns: Questionable test data and dissimilar curriculum practices. Journal of Educational Measurement, ll, 133-146. Harris, A.M., & Carlton, S.T. (1993). Patterns of gender differences on mathematics items on the Scholastic Aptitude Test. Applied Measurement in Education, 6, 137-151. Hembree, R. (1988). Correlates, causes, effects, and treatment of test anxiety. Review of Educational Research, 58, 47-77. Hill, K., & Wigfield, A. (1984). Test anxiety: A major educational problem and what can be done about it. Elementary School Journal, 85, 105-126. Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. Braun (Eds.), Test validity (pp.129-145). Hillsdale, NJ: Lawrence Erlbaum.

PAGE 216

206 Holland, P.W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum. Hoover, H.D., & Kolen, M.J. (1984). The reliability of six item bias indices. Applied Psychological Measurement, ~, 173-181. Hyde, J.S. (1981). How large are cognitive gender differences? American Psychologist, 36, 892-901. Hyde, J.S., Fennema, E., and Lamon, S.J. (1990). Gender differences in mathematics performance: A meta-analysis. Psychological Bulletin, 107, 139-155. Joreskog, K.G., Sorbom, D. (1989a). Prelis: A preliminary guide for analysing linear structural relationships. [Computer program]. Chicago: Scientific Software. Joreskog, K.G., & Sorbom, D. (1989b). LISREL 7: Analysis of linear structural relationships by the method of maximum likelihood [Computer program]. Chicago: Scientific Software. Kim, s.-H., Cohen, A.S., & Kim, H.-O. (1994, April). An investigation of Lord's procedure for detection of differential item functioning. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans. Kimball, M.M. (1989). A new perspective on women's math achievement. Psychological Bulletin, 105, 198-214. Kok, F. (1988). Item bias and test multidimensionality. In R. Langeheine & J. Rost (Eds.), Latent trait and latent class models (pp. 263-275). New York: Plenum Press. Li, H., & Stout, w. (1993, April). A new procedure for detection of crossing DIF/bias. Paper presented at the annual meeting of the American Educational Research Association, Atlanta. Li, H., & Stout, w. (1994, April). Detecting crossing item bias/DIF: Comparison of logistic regression and crossing SIBTEST procedures. Paper presented at the annual meeting of the American Educational Research Association, New Orleans.

PAGE 217

207 Licht, B.G., & Dweck, c.s. (1983). Sex differences in achievement orientations: Consequences for academic choices and attainments. In M. Marland (Ed.), Sex differentiation and schooling (pp. 72-97). London: Heinemann Educational Books. Liebert, R.M., & Morris, L.W. (1967). Cognitive and emotional components of test anxiety: A distinction and some initial data. Psychological Reports, 20, 975-978. Linn, M.C., DeBenedictis, T., Delucchi, K., Harris, A., & Stage, E. (1987). Gender differences in national assessment of educational progress science items: What does "I don't know" really mean? Journal of Research in Science Teaching, l.!, 267-278. Linn, M.C., & Hyde, J.S. (1989). Gender, mathematics, and science. Educational Researcher, 18 (8), 17-27. Linn, R.L. (1993). The use of differential item functioning statistics: A discussion of current practice and future implications. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp.349-364). Hillsdale, NJ: Lawrence Erlbaum. Linn, R.L., & Harnisch, D.L. (1981). Interactions between item content and group membership on achievement test items. Journal of Educational Measurement, 18, 109-118. Linn, R.L., Levine, M.V., Hastings, C.N., & Wardrop, J.L. (1981). Item bias in a test of reading comprehension. Applied Psychological Measurement, 5, 159-173. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635-694. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. Maccoby, E.E., & Jacklin, C.N. (1974). The psychology of sex differences. Stanford, CA.: Stanford University. Mantel, N., & Haenszel, w. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748.

PAGE 218

208 Marsh, H.W. (1988). Multitrait-multimethod analysis. In J.P. Keeves (Ed.), Educational research, methodology, and measurement: An international handbook (pp. 570-580). New York: Pergamon. Masters, G.N. (1988). Item discrimination: When more is worse. Journal of Educational Measurment, 25, 15-29. Mccornack, R.L., & McLeod, M.M. (1988). the prediction of college course performance. Educational Measurement, 25, 321-331. Gender bias in Journal of Messick, s. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: American Council on Education/Macmillan. Miller, M.D., & Linn, R.L. (1988). Invariance of item characteristic functions with variation in instructional coverage. Journal of Educational Measurement, 25, 205-219. Millsap, R.E., & Everson, H.T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297-334. Mislevy, R.J., & Bock, R.D. (1990). BILOG III: Item analysis and test scoring with binary logistic models [Computer program]. Mooresville, IN: Scientific Software. Moss, P.A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Reveiw of Educational Research, 62, 229-258. Muthen, B.O., Kao, C-F, & Burstein, L. (1991). Instructionally sensitive psychometrics: Application of a new !RT-based detection technique to mathematics achievement test items. Journal of Educational Measurement, 28, 1-22. National Center for Education Statistics. (1993). Digest of education statistics 1993. Washington: U.S. Department of Education. O'Neill, K.A., & McPeek, W.M. (1993). Item and test characteristics that are associated with differential item functioning. In P.W. Holland & H.Wainer (Eds.), Differential item functioning (pp. 255-276). Hillsdale, NJ: Lawrence Erlbaum. Oshima, T. (1989/1990). The effect of multidimensionality on item bias detection based on item response theory (Doctoral Dissertation, University of Florida, 1989). Dissertation Abstracts International, 2..1, 829A.

PAGE 219

Pajares, F., & Miller, M.D. (1994). Role of self efficacy and self-concept beliefs in mathematical problem solving: A path analysis. Journal of Educational Psychology, 86, 193-203. 209 Pallas, A.M., & Alexander, K.L. (1983). Sex differences in quantitative SAT performance: New evidence on the differential coursework hypothesis. American Educational Research Journal, 20, 165-182. Park, D.G., & Lautenschlager, G.J. (1990). Improving IRT item bias detection with iterative linking and ability scale purification. Applied Pscyhological Measurement, 14, 163-173. Raju, N.S. (1988). characteristic curves. The area between two item Psychometrika, 53, 495-502. Raju, N.S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197-207. Raju, N.S., Drasgow, F., & Slinde, J.A. (1993). An emprirical comparison of the area methods, Lord's chi-square test, and the Mantel-Haenszel technique for assessing differential item functioning. Educational and Psychological Measurement, 53, 301-314. Reckase, M.D. (1985, April). The difficultv of test items that measure more than one ability. Paper presented at the annual meeting of the American Educational Research Association, Chicago. Reckase, M.D. (1986, April). The discriminating power of items that measure more than one dimension. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Richardson, F.C., & Woolfolk, R.L. (1980). Mathematics anxiety. In I.G. Sarason (Ed.), Test anxiety: Theory, research, and applications (pp. 271-288). Hillsdale, NJ: Lawrence Erlbaum. Rosser, P. (1989). Sex bias in college admission tests: Why women lose out. Cambridge, MA: FairTest. Rudner, L.M. (1977). An approach to biased item identification using latent trait measurement theory. Paper presented at the annual meeting of the American Educational Research Association, New York.

PAGE 220

Ryckman, D.B., & Peckham, P. (1987). Gender differences in attributions for success and failure situations across subject areas. Journal of Educational Research, 81, 120-125. 210 Sarason, I.G. (1980). Introduction to the study of test anxiety. In I.G. Sarason (Ed.), Test anxiety: Theory, research, and applications (pp. 3-14). Hillsdale, NJ: Lawrence Erlbaum. Sarason, I.G. (1984). Stress, anxiety, and cognitive interference: Reactions to tests. Journal of Personality and Social Psychology, 46, 929-938. SAS Institute, Inc. (1988). SAS user's guide: Statistics {Version 6.03) [A computer program]. Cary, NC: Author. Scheuneman, J. (1979). A method of assessing bias in test items. Journal of Educational Measurement, 16, 143-152. Scheuneman, J.D. (1987). An experimental, exploratory study of causes of bias in test items. Journal of Educational Measurement, 24, 97-118. Scheuneman, J.D., & Gerritz, K. (1990). Using differential item functioning procedures to explore sources of item difficulty and group performance characteristics. Journal of Educational Measurement, 27, 109-131. Schmitt, A.P. (1988). Language and cultural characteristics that explain differential item functioning for Hispanic examinees on the Scholastic Aptitude Test. Journal of Educational Measurement, 25, 1-13. Schmitt, A.P., & Dorans, N.J. (1990). Differential item functioning for minority examinees on the SAT. Journal of Educational Measurement, 27, 67-81. Schmitt, A.P., Holland, P.W., & Dorans, N.J. (1993). Evaluating hypotheses about differential item functioning. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 281-315). Hillsdale, NJ: Lawrence Erlbaum. Sells, L. (1978). Mathematics--A critical filter. The Science Teacher, 45, 28-29.

PAGE 221

211 Shealy, R., & Stout, w. (1993a). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159-194. Shealy, R.T., & Stout, W.F. (1993b). An item response theory model for test bias and differential item functioning. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 197-239). Hillsdale, NJ: Lawrence Erlbaum. Shepard, L.A., Camilli, G., & Williams, D.M. (1984). Accounting for statistical artifacts in item bias research. Journal of Educational Statistics, 9, 93-128. Shepard, L.A., Camilli, G., & Williams, D.M. (1985). Validity of approximation techniques for detecting item bias. Journal of Educational Measurement, 22, 77-105. Sherman, J. (1983). Factors predicting girls' and boys' enrollment in college preparatory mathematics. Psychology of Women Quarterly, 7, 272-281. Skaggs, G., & Lissitz, R.W. (1992). The consistency of detecting item bias across different test administrations: Implications of another failure. ~J~o~u=r~n~a~l~~o=f-----'E~d~u~c~a=-c.t=i~o~n~a-'-=l Measurement, 29, 227-242. Spielberger, C.D., Gonzalez, H.P., Taylor, C.J., Algaze, B., & Anton, W.D. (1978). Examination stress and test anxiety. In C.D. Spielberger & I.G.Sarason (Eds.), Stress and anxiety (Vol. 5, pp. 167-191). New York: Hemisphere/Wiley. Steiger, J.H. (1980). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87, 245-251. Stout, w., & Roussos, L. (1992). SIBTEST user manual (Computer program]. Champaign: University of Illinois. Swaminathan, H., & Rogers, H.J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370. Tatsuoka, K.K., Linn, R.L., Tatsuoka, M.M., & Yamamoto, K. (1988). Differential item functioning resulting from the use of different solution strategies. Journal of Educational Measurement, 25, 301-319.

PAGE 222

212 Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H.I. Braun (Eds.), Test validity (pp. 147-169). Hillsdale, NJ: Lawrence Erlbaum. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67-113). Hillsdale, NJ: Lawrence Erlbaum. Tryon, G.S. (1980). The measurement and treatment of test anxiety. Review Educational Research, 50, 343-372. Tucker, L.R., & Lewis, c. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 50, 253-264. Uttaro, T., & Millsap, R.E. (1994). Factors influencing the Mantel-Haenszel procedure in the detection of differential item functioning. Applied Psychological Measurement, 18, 15-25. Wainer, H., & Steinberg, L. s. (1992). Sex differences in performance on the mathematics section of the Scholastic Aptitude Test: A bidirectional validity study. Harvard Educational Review, 62, 323-336. Wigfield, A., & Eccles, J.S. (1989). elementary and secondary school students. Psychologist, 24, 159-183. Test anxiety in Educational Wine, J.D. (1980). Cognitive-attentional theory of test anxiety. In I.G. Sarason (Ed.), Test anxiety: Theory, research, and applications (pp. 349-385). Hillsdale, NJ: Lawrence Erlbaum. Young, J.W. (1991). Gender bias in predicting college academic performance: A new approach using item response theory. Journal of Educational Measurement, 28, 37-47. Young, J.W. (1994, April). Differential prediction of college grades by gender and ethnicity: A replication study. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans. Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP history assessment. Journal of Educational Measurement, 26, 55-66.

PAGE 223

BIOGRAPHICAL SKETCH Thomas Edward Langenfeld was born in Des Moines, Iowa. He received a Bachelor of Arts in history and education from Iowa State University. Subsequently, he worked as a high school social studies teacher in Storm Lake, Iowa. He later received a Master of Arts in history from the University of Iowa. While teaching at Storm Lake High School, he was cited by the White House Commission on Presidential Scholars for excellence in teaching. After 14 years of public school teaching, he returned to graduate studies to pursue a doctorate. In 1991 he entered the graduate program in research and evaluation methodology in the Department of Foundations of Education at the University of Florida. He received his Ph.D. in 1995 and accepted a position as assistant professor at West Georgia College. 213

PAGE 224

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Linda Crocker, Chair Professor of Foundations of Education I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. J es Algina P fessor of ucation tions of I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Rodman Webb Professor of Foundations of Education I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. en Hsu istant Professor of Foundations of Education

PAGE 225

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. ~}?za&4 Marc Mahlios Professor of Instruction and Curriculum This dissertation was submitted to the Graduate Faculty of the College of Education and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. August, 1995 rman, Foundat cation v1t~ ri ~~ De; ccilegefof Education Dean, Graduate School

PAGE 226

LO 1780 19gq ,L Z7'1 UNIVERS ITY OF FLORIDA II I II IIIII I Ill I l l ll l ll l l lll I I I IIIII I I I I II ll l l 1 11 1 111111 1 1 1 11111 1 3 1262 08556 8904


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EO4TVXSSM_H16MIJ INGEST_TIME 2017-07-11T21:30:45Z PACKAGE AA00002044_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES