Gender differences on college admission test items

MISSING IMAGE

Material Information

Title:
Gender differences on college admission test items exploring the role of mathematical background and test anxiety using multiple methods of differential item functioning detection
Physical Description:
x, 213 leaves : ; 29 cm.
Language:
English
Creator:
Langenfeld, Thomas E
Publication Date:

Subjects

Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1995.
Bibliography:
Includes bibliographical references (leaves 201-212).
Statement of Responsibility:
by Thomas E. Langenfeld.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 002056850
notis - AKP4870
oclc - 33815423
System ID:
AA00002044:00001

Full Text














GENDER DIFFERENCES ON COLLEGE ADMISSION TEST ITEMS:
EXPLORING THE ROLE OF MATHEMATICAL BACKGROUND
AND TEST ANXIETY USING MULTIPLE METHODS
OF DIFFERENTIAL ITEM FUNCTIONING DETECTION







By


THOMAS


LANGENFELD


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF
THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY


______ --_ l -- m i


__ m a A i I I















ACKNOWLEDGEMENTS


would


like


express


sincerest


appreciation


individuals


who


have


ass


sted


in completing


this


study.


am extremely


indebted


to Dr.


Linda


Crocker,


chairperson


doctoral


committee,


helping


conceptualization,


development,


and


writing


this


ssertation.


Her


assistance


encouragement


were


extremely


important


in enabling


me to achieve


doctorate.


also


want


thank


the


other


members


committee,


James


Algina,


Jin-win


Hsu,


Marc


Mahlios,


and


Rodman


Webb,


patiently


reading


the


manuscript,


offering


constructive


comments


, providing


editorial


assistance,


and


giving


continuous


support.


further


wish


thank


David


Miller,


John


Hall,


Scott


Behrens


their


assi


stance


related


to different


aspects


thi


study.


want


expr


ess


deepest


gratitude


family


providing


during


the


graduate


emotional


exp


support


erience.


that


want


was


so vital


thank


wife,


Ann--


many


way


s thi


degree


is as much


hers


as mine,










studies


Space


limitations


not


allow


express


many


personal


sac


rifi


ces


made


wife


that


could


complete


this


study.


also


want


thank


daughter,


Kathryn


Loui


who


was


born


early


stages


thi


study


and


come


to provide


a special


type


support.
















TABLE OF CONTENTS


pacrge


ACKNOWLEDGEMENTS...................................

LIST OF TABLES.....................................


ABSTRACT . . .. . . .... ........ .


CHAPTERS


Statement of Problem........................
The Measurement Context of the Study........
The Research Problem........................
Theoretical Rationale.......................
Limitations of the Study....................


REVIEW OF LITERATURE........................


DIF Methodology.............................
Gender and Quantitative Aptitude............
Potential Explanations of DIF...............
Summary.....................................


METHODOLOGY....... .........................


Examinees......c............................
Instruments.................................
Analysis ....................... ............
Summary............... .. ..... .... ....... ...


RESULTS AND DISCUSSION......................


Descriptive Statistics......................
Research Findings..................* .........


LIST OF FIGURES.................................... v111ii


INTRODUCTION................................











SUMMARY


AND


CONCLUSIONS.................... .


APPENDICES


SUMMARY


STATISTICAL


TABLES..................


DIFFERENTIAL ITEM
QUESTIONNAIRE
REVISED TEST A


FUNCTIONING
INCLUDING THE
NXIETY SCALE......


THE


CHIPMAN,
INSTRUMENT
MATHEMATIC


MARSHALL, AND SCOTT
FOR ESTIMATING
S BACKGROUND.......


(1991)


BIOGRAPHICAL


REFERENCES.......................................


SKETCH...............................















LIST


OF TABLES


Table


page


Proposed
Matrix


Multitrait
: Uniform


-Multimethod


Indic


es.


Correlation
. S .


Proposed


Matrix:


Multitrait


Alternate


-Multimethod


Indices


Correlation


...... 10


Item


Data


Group


and


Item


Scores


Ability


Group..


Desc


riptive


Stati


stics


item


RTA


Scale


Corre


lation


GRE


Calculus


Colle


Completion,


Mathematics


SAT-


Credits


Frequencies an
Mathematics


Percentages
Background..


Gender


and


Mean


Scores


Revise
Total


the Rel


d Test
Sample,


ease


Anxiety
Gender,


d GRE


Scal
and


and


(RTA)


the


the


Mathematics


Background.


Intercorrelation


and


Mathematics


Sample


, Women,


Rel


Bac
and


eas


ground
Men....


GRE
the


Q, RTA,
Total


Multitrait


-Multimethod


Correlation


Matrix


Uniform DIF

Percent-of-Agr


Indi

eemen


ces..


Rates


Inferential


Tests
and T


by Gender,
A Between D


Mathematics


Methods


Bac


ground,


: 30-Item


GRE-Q.......


Multitrait


-Multimethod


Corre


lation


Matrix


- a


99


_ -











Tetrachoric
Estimates


Corre


lations


Four


Standardized


Problematic


Items


: Exploratory


Sample.


e. 119


Multitrait-Multimethod


Correlation


Matrix


Valid


Test


Items


: Uniform


DIF


Indices


Percent


-of-


Agr


eement


Rates


Inferential


Tests
and T


Gender,


'A Between


Mathematic


Methods


:
*


S
2


Background,
6-Item


Valid


Test


Multitrait-Multimethod


the


Valid


Test


.. 133


Correlation


Items


: Alt


Matrix
ernate


DIF


Indices
















LIST


OF FIGURES


Figure


page


The


Four


Problematic


Test


Questions


LRCs

LRCS


Women

Women


Men

Men


on Item


Item


LRCs


Women


Men


Item


LRCs


Examinees


with


Subs


tantial


Little


Mathematics


Background


Item


LRCs


Women


Men


Item


LRCs


Examin


ees


with


Sub


stantial


and


Little


Mathematics


Bac


ground


on Item


.. 125


LRCs


Women


Men


on Item


Illu
DIF


LRCs


strating the
Condition...


Women


Symmetrical


Men


Nonuniform


Item


Illu


DIF


strating the
Condition...


More


Typical


Nonuniform
















Abstract


the


of Dissertation


University


Requirements


S


of
for


Presented


Florida


the


Partial


Degree


Doctor


Graduate


School


Fulfillment


the


of Philosophy


GENDER


DIFFERENCES


EXPLORING


AND


TEST


ON COLLEGE


ROLE


ANXIETY


ADMISSION


OF MATHEMATICAL


USING


MULTIPLE


TEST


ITEMS:


BACKGROUND


METHODS


OF DIFFERENTIAL


ITEM


FUNCTIONING


DETECTION


Thomas


August,


. Langenfeld


1995


Chairper
Major De


son


: Linda


apartment:


Crocker


Foundations


Education


The


purpose


s study


was


to discover


whether


defining


examine


subpopulations


relevant


educational


or psychological


variable


, rather


than


gender,


would


yield


item


statistic


that


were


more


cons


istent


across


five


methods


detection


differential


item


functioning


(DIF)


subsidiary


purpose


thi


study


was


assess


how


consis


tency


DIF


estimates


were


affected


when


structural


validation


findings


were


incorporated


into


analyst


The


study


was


conduct


the


context


of college


admission


quantitative


examinations


and


gender


issues.


Participants


consi


sted


1263


university


students.


For


purposes


this


study,










were


analyzed


categorizing


examines


their


gender,


mathematics


backgrounds


, and


level


of test


anxiety.


The


hypothesis


that


defining


subpopulations


mathematic


background


or test


anxiety


would


yield


higher


consistency


estimation


than


defining


subpopulations


gender


was


not


substantiated.


Result


indicated


that


using


mathematics


background


to define


subpopulations


explain


gender


had


potential


usefulness;


however,


this


study,


use


of test


anxiety


to define


subpopulations


The


finding


explain

confirmed


DIF

the


was


ineffectual.


importance


structural


validation


analy


ses


Results


from


using


entire


test revealed


that


nonuniform


DIF


methods


had


low


inter-method


consis


tency


variance


related


to methods.


When


structural


validation


findings


were


used


to define


valid


subset


of items,


highly


cons


stent


DIF


indices


resulted


across


methods


and


minimal


variance


related


to methods.


nonuniform


Results


method


further


and


suggested


the


need


importance


use


jointly


interpreting


both


DIF


indices


and


significant


tests.


Implications


recommendations


research


and


practice


are


included.

















CHAPTER


INTRODUCTION


Statement


the


Problem


Differential


item


functioning


(DIF),


a statistical


indication


of item


bias


, occurs


when


equally


proficient


individuals


, from


different


subpopulations,


have


different


probabilities


answering


an item


correctly


Linn,


Levine,


Hastings


, & Wardrop,


1981;


Scheuneman,


1979;


Shepard,


Camilli,


& Williams,


1984).


storically,


researchers


studying


have


addressed


two


principal


concerns.


The


first


concern


researchers


been


development


evaluation


of statistical


methods


detecting


"bia


sed"


item


The


second


concern


been


identify


plausible


explanations


item


bias.


study,


both


methodological


and


substantive


educational


issues


concerning


item


bias


DIF


were


addressed.


During


methods


been


past


four


developed


decades


(for


, a plethora


a comprehensive


detection


review


advances


in item


bias


detection


methods


over


the


ten


years


see


Millsap


Everson,


1993).


DIF


method


have












variable


formed


from


an observed


conditional


score


an unobserved


conditional


estimate


latent


ability


and


whether


they


can


detect


nonuniform


as well


uniform


DIF.


Researchers


applying


methods


using


an observed


conditional


responses


score


the


commonly


test


sum


number


or subsection


the


correct


test


estimate


the


ability


of each


examinee.


Researchers


using


unobserved


unidimensional


conditional


item


estimates


response


most


theory


frequently


(IRT)


apply


model


estimating


the


Uniform


latent


DIF


occurs


ability

when


of each


there


examine.


interaction


between


ability


level


group


membership.


That


the


probability


answering


an item


correctly


is greater


one


group


than


the


other


group


uniformly


over


ability


level


Nonuniform


DIF,


detectable


only


some


methods,


occurs


when


there


interaction


between


ability


level


and


group


membership.


That


difference


the


probabilities


a correct


res


ponse


the


two


groups


the


same


at all


ability


level


. In


terms,


nonuniform


DIF


indicated


nonparallell"


item


character


stic


curves.


I.


L L*


I


IB | "I


~


rr r *1












emerged


most


widely


used


procedure


(more


because


Educ


national


ting


Service'


usage


than


a result


oretical


consensus),


and


s frequently


method


to which


others


are


compared


(Hambl


eton


Rogers


, 1989;


Raju,


1990;


Shealy


& Stout,


1993a;


Swaminathan


& Rogers


, 1990).


The


appeal


procedure

of use, c


hi-square


imple

test


conceptualization,


significance,


relative


and


ease


desirable


statisti


properties


(Dorans


Holland,


1993;


Millsap


Ever son,


1993)


Researchers


applying


MH employ


observed


score


as the


conditioning


variable


and


recognize


that


MH i


sens


itive


to only


uniform


DIF.


Other


methods


compared


with


the


MH procedure


thi


study


included


logi


stic


regression


(Swaminathan


Rogers,


1990),


-Signe


d Area


(IRT


-Unsigned


Area


(IRT-


(Raju,


1988,


1990),


the


Simultaneous


Item


Bias


Test


SIBTEST)


& Stout,


1993a,


1993b


SLogis


regre


ssion


was


signed


to condition


on observed


scores


analyze


item


res


ponses


With


logistic


regre


ssion,


user


can


detect


both


uniform


and


nonuniform


DIF


IRT-SA


and


IRT-UA


were


devised


to condition


on latent


ability


estimates


and


assess


the


area


between


an item


* S .


-SA),


_


* k


L


- -












developed

developed


to detect

to detect


only

both


uniform

uniform


DIF,

and


whereas I

nonuniform


RT-UA


DIF


SIBTEST


was


designed


to conceptualize


DIF


multidimensional


phenomenon


where


nuisance


determinant


adverse


influence


item


responses


(Shealy


& Stout,


1993a,


1993b).


Research


hers


using


SIBTEST


apply


factor


analysis


to define


a valid


subtes


t and


a regression


correction


procedure


to estimate


criterion


variable.


SIBTEST


was


developed


to detect


only


uniform


DIF.


assess


diff


erent


indi


ces


with


data


from


curriculum-based,


eighth


grade


mathemati


test,


Skaggs


Liss


(199


found


that


the


consis


tency


between


methods


was


low,


no reasonable


explanation


items


manif


testing


could


be hypothes


ized.


They


posited


that


cate


goriz


ing


subpopulations


demographic


characteristics


such


as gender


or ethni


city


DIF


studies


was


"not


very


helpful


conceptual


zing


cognitive


issues


and


indicated


nothing


the


reasons


the differences"


239)


number


researchers


have


suggested


need


to explore


using


subpopulations


categorize


d by


psychologically


and


with


educationally


gender


significant


or ethnicity


variables


potentially


that


correlate


influence


item


a ~ a


mL*


____


I












Skaggs


& Liss


itz,


1992;


Tatsuoka,


Linn,


. M.


Tats


uoka,


Yamamoto,


1988).


Thu


, a major


concern


the


study


was


the


consi


tency


results


from


different


DIF


estimation


proce


dure


when


subpopulations


are


conceptualized


psychological


or educational


variable


Three


methods


conceptual


zing


subpopulations


were


combined


with


five


fundamentally


differ


state-of-


-art


procedures


assess


DIF.


The


Measurement


Context


Study


The


differ


substantive


ences


issue


on a sample


was


test


the


investigation


containing


items


gender


similar


those


found


on advanced


college


admi


ssion


quantitative


examinations.


Generally,


men


tend


to outperform


women


Scholastic


Aptitude


t-Math


(SAT


the


American


College


Testing


Assessment


Mathemati


Usage


Test


(ACT-M),


Graduate


Rec


Examination-Quantitative


(GRE-Q)


However,


from


a predictive


validity


perspective,


these


diff


erences


are


problematic


example,


men


tend


score


the SAT-


approximately


(National


standard


Center


deviation


Education


units


Stati


higher


tics,


1993),


although


women


tend


to perform


at nearly


same


-M),












tend


to outperform


men


in general


college


courses


(Young,


1991,


1994).


poss


ible


explanation


quantitative


test


score


differences


between


men


women


background


experience.


Men


tend


to enroll


in more


years


mathematics


(National


Center


Education


Statistics,


1993).


A second


explanation


that


could


potentially


explain


differential


validity


such


tests


test


anxiety.


Test


anxiety


relates


to examinees'


fear


negative


evaluation


defensiveness


(Hembree,


1988).


Women


generally


report


higher


level


test


anxiety


than


men


(Everson,


Mill


sap,


Rodrique


, 1991;


Hembree,


1988;


Wigfield


Eccl


, 1989


Thus


, for


high


-stakes


tests


mathematical


aptitude,


mathematics


background


and


test


anxiety


could


influence


item


responses


differentially


each


gender.


The


earch


Problem


thi


study,


explored


the


feasibility


conceptual


zing


psychological


subpopulations


variable


relevant


contrast


the


educational


use


traditional


demographic


variables.


The


vehicle


to achieve


thi


purpose


was


a released


form


the


GRE-Q.


this


- S


I


L


1 L












examinees


with


substantial


and


little


mathematics


background,


anxiety.


DIF


examines


was


assessed


high


using


and


five


low in t

different


est


measures.


The

IRT


DIF

-UA,


measures


and


were


SIBTEST.


MH,

The


logis

DIF m


regression,


ethods


were


class


IRT


-SA,


ified


into


two


groups--methods


measuring


uniform


DIF


and


alternate


methods.


The


uniform


method


were


IRT


-SA,


and


SIBTEST.


Alternate


methods


included


logis


regre


ssion


and


were


-UA,


along


designed


with


measure


Logis


both


regression


uniform


and


and


nonuniform


-UA


DIF.


Mantel-Haen


was


placed


into


both


analy


group


because


wides


pread


use


test


practitioners.


Regarding


the


study


s methodological


issues,


the


results


five


methods


of estimating


will


contrasted


within


each


three


modes


of defining


subpopulation


groups.


The


observation


interest


was


the


DIF


indices


estimated


each


item


under


a particular


combination


subpopulation


definition


DIF


method.


Replications


were


items


on a released


form


the


GRE


test.


For


research


questions


that


follow,


trait


effects


refer


the


three


subpopulation


conceptual


zations


and


A I -


~t C1





CC L












The


consis


first


tency


four


of DIF


research

indices


questions


between


address


methods


when


subpopulations


are


conceptualized


using


different


traits


The


uniform


methods


of MH,


IRT-SA,


and


SIBTEST


were


combined


with


traits


gender,


mathematics


background,


test


matrix


anxiety


to yield


of correlation


a multitrait-multimethod


coefficients


. (See


(MTMM)


Table


illustration


a MTMM


matrix


with


uniform


measures.


Similarly,


alternate


estimation


methods


-UA,


and


logistic


regrets


sion


were


combined


with


traits


gender,


mathematics


background,


and


test


anxiety


yield


a second


multitrait-multimethod


(MTMM)


matrix


correlation


coefficients.


(See


Table


an illustration


a MTMM


matrix


with


alternate


measures.)


Each


following


research


questions


was


address

uniform


twice;


methods


each


and


question


the


was


alternate


answered

methods,


for

resp


the

ectively:


. Among


the


three


sets of


convergent


coeffi


clients


often


termed


monotrait


-heteromethod


coefficient


(e.g.,


the


correlation


between


the


indices


obtained


from


-SA


methods


when


subpopulations


are


defined


the


trait


gender),


will


coefficients


base


--_, ---------------------- -_


r .1__ -


- a


_ T


I _


_- _


-1













Table


Proposed
Uniform


Multitrait-Multimethod


Correlation


Matrix:


Indices


MH-D


B


C


A


-SA


B


C


SIBTEST-b


A


B


I.MH-D

A.Gender


B.MathBkd


H-M


C.TA


H-M


H-M


II.IRT-SA


A.Gender


B.MathBkd


H-H


M-H*


H-M


C.TA


H-H


M-H*


H-M


H-M


III.SIBTEST-b


A.Gender


M-H*


M-H*


H-H


H-H


B.MathBkd


H-H


M-H*


H-H


M-H*


H-H


H-M


C.TA


M-H*


H-H


M-H*


H-M


H-M


Note.


= reliability


heteromethod


H-M


or the


coefficients.


convergent


= heterotrait-monomethod


validity


coefficient


M-H*


= monotrait-


coefficients
s. H-H =


heterotrait-heteromethod


coefficient.


A


C


M-H*













Table


Proposed
Alternate


Multitrait-Multimethod


DIF


Correlation


Matrix:


Indices


MH-D


B


C


IRT-UA


A


B


Loq


C


A


Req


B


I.MH-D


A.Gender


B.MathBkd


H-M


C.TA


H-M


H-M


II.IRT-UA


A.Gender


M-H*


B.MathBkd


M-H*


H-M


C.TA


H-H


M-H*


H-M


H-M


III.Log


Reg


A.Gender


M-H*


H-H


M-H*


H-H


B.MathBkd


H-H


M-H*


H-H


M-H*


H-H


H-M


C.TA


H-H


M-H*


H-H


H-H


M-H*


H-M


H-M


Note.


= reliability


heteromethod


H-M


or the


coefficients.


convergent


= heterotrait-monomethod


heterotrait-heteromethod


M-H*


validity


coefficients.


= monotrait-


efficientt.
H-H =


coefficient.


A


C












coefficients


when


subpopulations


are


defined


gender?


Will


monotrait-heteromethod


coefficients


higher


than


coefficients


different


traits


measured


the


same


method


(i.e.,


heterotrait-monomethod


coefficients


Will


convergent


correlation


coefficients


higher


than


discriminant


coefficients


measuring


different


traits


different


method


(i.e.,


heterotrait-


heteromethod


coefficient


Will


pattern


correlations


among


three


traits


be similar


over


the


three


methods


of DIF


estimation?


The


final


res


earch


que


stion


addressed


consi


stency


procedures


identifying


aberrant


items


when


subpopulations


are


conceptualized


in different


ways


The


stion


was


applied


twice;


was


answered


uniform


methods


alternate


methods.


was


follows:


For


standard


each


deci


DIF


detection


rules,


what


method


the


respectively,


percent


using


agreement


about


aberrant


items


when


subgroups


are


based


on gender


and


when


subgroups


are


based


mathematics


background


.1 a A


IlIIII ACS












Following


analysis


uniform


and


alternate


methods


conducted


a structural


analyst


the


30-item


quantitative


test.


Shealy


and


Stout


(1993a,


1993b)


stressed


that


practitioners


must


carefully


identify


valid


subset


items


prior


to conducting


DIF


analyses.


They


argued


that


DIF


occurs


as a consequence


multidimensionality.


potential


DIF


occurs


when


one


or more


nuisance


dimensions


interact


with


the


valid


dimension


of a test


(Ackerman,


1992;


Camilli,


1992).


Messick


(1988)


stress


tructural


component


construct


validation.


The


structural


component


concerned


extent


to which


items


are


combined


into


scores


that


reflect


the


Loevinger


structure


(1957)


termed


the

the


underlying


purity


the


latent


construct.


internal


relationships


structural


fidelity,


and


appraised


analyzing


employed


factor


interite

analytic


structure


procedures


a test.


to define


structurally


valid


subset


unidimens


ional


items


identify


problematic


multidimensional


items.


hoped


to define


items


measuring


both


intended


dimension


and


nuisance.


After


identification


a structurally


valid


* -


& U


- S S


r


1













using


the


five


methods


with


subpopulations


defined


gender,


mathemati


indices


as the


background,


unit


and


analy


test


, two


anxiety.


MTMM


Using


matrices


correlation


coefficients


were


generated-


-one


matrix


uniform


methods


and


one


matrix


alternate


methods


applied


the


five


research


estions


the


MTMM


matrices


inferential


stati


stics


using


structurally


valid


items.


contrasted


findings


the


analyst


s for


entire


test


with


findings


analysis


subset


of test


items.


Theoretical


Rationale


The


process


ensuring


that


high-s


takes


tests


contain


items


that


function


differentially


specific


subpopulations


a fundamental


concern


construct


validation.


Items


that


contain


nuisance


determinants


membership


correlated


threaten


with


an examinee


construct


subpopulation


interpretations


derived


from


test


scores


that


subpopulation.


Psychometric


researchers


continue


examine


merits


numerous


DIF


detection


procedures


and


explore


theoretical


explanations


DIF.


However,


to date,


they


have


failed


to reach


consensus


on methodological


issues


or to develop


*


1 F


1 f


* k


r


*I *












identification


with


actual


test


data


Linn,


1993;


Shepard


concerns


et al.,


were


1984;


Skaggs


investigated


Lissit


from


both


, 1992).


a practical


These


and


theoretical


perspective


that


been


suggested


Linn,


1993;


Schmitt


& Dorans,


1990;


Skagg


& Li


ssitz,


1992;


Tatsuoka


et al.,


1988)


but


rarely


tested.


Two


significant


premises


underlie


the


study.


first


premise


that


there


nothing


inherent


in being


female

that p


examine


redisposes


or a member

an individu


of a specif

al to find


ethnic


group


a particular


item


troublesome.


Educational


and


psychological


phenomena


function

specific


unique


item.


way


s to di


Traditional


advantage


DIF


occurs


an individual


when


on a


phenomena


correlate


with


demographic


group


interest.


Consequently,


gender


or ethnicity


can


interpreted


surrogate


educational


or psychological


variables


that


potentially


explain


DIF


s causes.


Skaggs


ssit


(1992)


posited


that


educational


psychological


variables


that


influence


item


performance


and


correlate


with


ethnic


or gender


groups


would


useful


conceptualizing


subpopulations.


Millsap


and


Everson


(1993


commented


that


modeling


1












educational


psychological


variable


the


study


that


were


hypothesized


as potentially


explaining


gender


DIF


quantitative


test


items


were


mathematics


background


and


test


anxiety.


Mathematics


background


was


selected


because


influences


quantitative


reasoning


and


problem


solving.


Further,


high


school


college


men


tend


to enroll


more


mathematics


courses


study


mathematics


more


abstract


level


Educational


Stati


than


women


stics,


(National


1993).


Center


Researchers


assessing


overall


SAT-M


performance


have


found


that


gender


difference


decrease


subs


tantially


when


differences


high


school


mathematics


background


are


taken


into


account,


although


differences


background


does


(Ethington


entirely


Wolfle,


1984;


explain


Fennema


score


Sherman,


1977;


Pallas


Alexander,


1983).


Quantitative


aptitude


test


scores


can


contaminated


familiarity


with


item


context,


the


application


of novel


solutions,


and


the


use


partial


knowledge


to solve


complex


problems


(Kimball,


1989).


These


type


kills


frequently


are


developed


through


background


experien


ces


Test


anxiety


was


selected


because


well-


-*


m


I r 1


1 CI












Liebert


possessing


& Morris,


high


1967;


level


Tryon,


test


1980)


For


anxiety,


test


individual


scores


frequently


are


depressed


and


construct


interpretations


become


problematic


(Everson


et al.,


1991;


Hembree


, 1988;


Sarason,


1980).


Cons


equently,


test


anxiety


exemplifies


psychological


variabi


that


potentially


contaminates


construct


interpre


stations


scores


For


examinees


with


high


level


test


anxiety,


tests


of mathematical


ability


tend


to induce


Woolfolk,


1980)


extreme

SFemal


eve


students


anxiet

tend


y. (Richard

to report h


son


higher


level


levels


test


including


anxie


coll


than


ege


mal


(Everson


student

et al.


at all


, 1991;


grade

Hembree,


1988;


Wigfield


Eccles


1989)


Over


the


st 2


years


several


self-reported


measures


test


anxiety


have


been


developed


that


demonstrate


high


reliability


and


well


defined th

Schwarzer,


Leoretical


Seipp,


properties


Zahhar,


enson,


1991;


Moulin


Sarason,


-Julian,

1984;


Spielberger,


Gon


, Taylor,


Algaze,


Anton,


1978)


Researchers


have


used


the self-reported


instruments


measure


test


anxiety


and


assess


efficacy


treatment


programs


1980)


(Sa


For


reason, 1

studying


980;


Spielberger


gender


et al


on colle


., 1978;


admi


Wine,


ssions


S" A


. S C C


a m b-


I


* *l 1


* .


* *


LL- l_ _













threat


to valid


score


interpretation,


negative


influence


tests


of mathematics


ability,


and


gender


effects.


second


fundamental


premise


tenet


underlying


educational


the


study


measurement.


Item


responses


examines


are


and


products


a set


of complex


items.


interactions


In part,


between


because


complex


interaction,


examines


approximately


equivalent


abilities


who


belong


to different


subpopulations


occasionally


have


different


likelihood


of answering


que


stion


correctly.


Thi


s fascinating


finding


currently


understood


only


crudely.


Before


can


be better


understood,


different


effect


means


of DIF


detection


of conceptualizing


methods


subpopulations


on item


responses


must


examine


Limitations


the


Study


salient


limitation


study


was


nature


performance


task.


Participants


the


study


were


administered


a sample


GRE


and


were


told


they


would


have


minutes


to complete


test.


They


were


told


perform


best of


their


ability


and


they


would


able


to learn


their


results


following


testing.


Although


every


effort


was


made


to simulate


the


conditions












performance


might


accurately


reflect


their


performance


on a high


-stakes


college


admi


ssions


test.


Further,


the


participants


believed


that


examination


had


low-


stakes,


the


level


answering


test


the


sample


anxiety


GRE-Q


felt


would


examinees


not


while


be equivalent


level


test


anxiety


experienced


examinees


while


answering


a college


admi


ssions


test.


Finally,


examines


study


were


predominantly


undergraduate


students


taking


classes


the


colleges


education


and


business


at a large,


southern


state


univer


sity.


reason,


although


the


sign,


methodology,


analy


sis


were


conceived


and


executed


maximize


the


general


ability


findings


, a degree


caution


recommended


in generalizing


to other


populations


or settings.















CHAPTER


REVIEW


OF LITERATURE


The


four


central


aspects


thi


study


were


Differential


Item


Functioning


(DIF)


methodology,


gender


differences


in mathematical


college-level


aptitude


testing,


gender


differences


in mathematics


background,


and


test


anxiety.


These


four


topics


constitute


the


major


themes


the


organi


zation


literature


review


pre


sented


this


chapter.


DIF


Methodoloqy


A Conceptual


Framework


DIF


Tests


placement


education


selection


employment


require


scores


fair


representative


individuals


Since


the


mid-


1960s,


measurement


special


have


been


concerned


explicitly


with


the


fairness


their


struments


and


the


ssibility


that


some


tests


may


biased


(Cole


& Moss


, 1989)


Bias


studies


initially


were


signed


to investigate


the


assertions


that


sparities


between


various


subpopulations


on cognitive


ability


test


scores


were


product


of cultural


bias


inherent


measure


(Angoff,


1993)


Test


critics


charged


that


bias










subpopulations


had


equivalent


score


stributions


on the


cons


truct


measured


dismissed


the


possibility


that


actual


differences


may


exist.


Measurement


specialist


however,


have


reflect


resolved


bias


that

but i


mean


ndicat


e test


differences


impact


do not


(Dorans


necessarily

Holland,


1993).


Concerns


about


measurement


bias


are


inherent


validity


theory


(Cole


Moss,


1989)


A test


score


inference


considered


efficiently


valid


when


various


types


evidence


justify


usage


and


eliminate


other


counterinterpretations


(Messick,


1989;


Moss,


1992


Bias


been


characterized


as "


a source


invalidity


that


some


examines


with


trait


or knowledge


being


means


ured


from


demon


strating


that


ability"


Shepard,


Camilli,


Williams,


1985,


score


-based


inference


are


equally


valid


relevant


subgroup


decis


ions


derived


from


score


inferences


will


not


fair


individuals


Therefore,


measurement


bias


occurs


when


score


interpretations


are


diff


erentially


valid


subgroup


test


takers


(Cole


& Moss,


1989).


To inves


researchers


tigate


have


the


examined


potential


test


items


measurement


bias


as a source


explanation.


The


suppos


ition


that


biased


item


require










bias r

biased


research


are


(Angoff,


to identify


1993)


and


remove


to provide


test


items


detected


developers


with


guidelines


making


future


cons


truction


of biased


items


less


likely


(Scheuneman,


1987;


Schmitt,


Holland,


& Dorans,


1993).


Measurement


specialists


have


defined


item


bias


occurring


when


individuals


, from


different


subpopulations,


who


are


equally


proficient


on the


construct


measured


have


different


probabilities


of successfully


answering


the


item


(Angoff,


1993;


Linn,


Levine,


Hastings


Wardrop,


1981;


Scheuneman,


1979;


Shepard


et al.,


1985).


Researchers


apply


stati


stical


methods


to equate


individuals


on the


construct,


utili


zing


either


erved


scores


latent


ability


scores,


estimate


examines


each


group


probability


of a correct


response.


These


methods


provide


statistical


evidence


bias.


When


a statistically


biased


item


identified,


might


interpreted


as unfairly


disadvantageous


to a minority


group


cultural


and


social


reasons.


the


other


hand,


the


item


might


interpreted


as unrelated

an important


and


understood


to cultural

educational


and soc

outcome


groups


ial


factors


that


this


but


not


latter


related


equally


case,


known


deleting


item


for


strictly


stati


stical


reasons


may


reduce


validity.


- -- -


1










Researchers


discovered


that


tatis


tical


analyses


item


bias


raised


expectations


and


created


confusion


already


scure


and


volatile


topic.


The


term


differential


item


functioning


(DIF)


gradually


replaced


item


bias


preferred


technical


term


research


connotations


because


(Angoff,


more


1993;


Dorans


neutral


Kulick,


1986).


Holland


and


Wainer


(1993)


distinguished


between


item


bias


and


DIF


stating,


item


bias


refers


"an


informed


judgment


about


an item


that


takes


into


account


purpose


the


test,


the


relevant


experiences


of certain


subgroups


examines


taking


and


statistical


information


about


item"


xiv).


DIF


s a


"relative


term"


xiv)


a stati


stical


indication


of a differential


res


ponse


pattern.


Shealy


Stout


(1993a


proposed


that


difference


between


item


bias


and


DIF


"the


degree


the


user


or researcher


embraced


a construct


validity


argument"


197).


Shealy


and


Stout


(1993a,


1993b)


conceptualized


DIF


violation


unidimensional


nature


test


items.


They


classified


the


intended


dimension


the


target


ability


unintended


dimensions


nuisance


determinants


occurred


because


nuisance


determinants


existing


differing


degrees


among


subgroups.


Crocker


Algina










construct,


the


distributions


irrelevant


sources


variation


are


different


subgroups


Therefore,


can


be conceptualized


as a consequence


of multidimensionality


with


differing


sources


variation


influencing


subgroups'


item


responses


A Formal


Definition


of DIF


All


DIF


detection


methods


rely


on assessment


response

subgroups


patterns


of subgroup


, conceptualize


in most


test


items.


studies


The


the


basi


demographic


characters


tics


, blacks


and


whites


women


and


men),


form


a categorical


variable.


When


two


groups


are


contrasted,


the


group


interest


(e.g.,


blacks


or women)


designated


the


focal


group,


the


group


serving


the


group


reference


comparison


group.


(e.g.,


Examinees


whites


are


or men)


matched


ignated


on a criterion


variable,


assumed


to be a valid


representation


purported


group


construct,


response


patterns


DIF

for


methods


assess


individuals


differential


of equal


ability.


Denote


the


item


score


, frequently


scored


dichotomous


variable


denote


the


conditioning


criterion;


member


and


ship.


denote


Lack


as the


of measurement


categorical


bias


variable


or DIF


group


an item


define










all


values


X for


reference


and


focal


groups.


this


definition,


Pg(Y=l


the


conditional


probability


function


for


at all


levels


(Millsap


Everson,


1993).


Although al

definition, they


DIF

diffe


procedures

r on the b


operate


asis


from


this


statistical


models


and


possess


various


advantages.


DIF


procedures


can


characterized

invariance or


as models


models


using


utilizing


observed co

unobserved


nditional

conditional


invariance

conditional


(Millsap


Everson,


invariance


used,


1993).

the c


When


*riterion


observed

variable


sum


the


total


number


of correct


res


ponses


on the


test


or a subset


the


test.


When


unobserved


conditional


invariance


is used,


a unidimensional


item


res


ponse


theory


(IRT)


model


estimates


a 8 parameter


each


examinee


that


functions


the


criterion


variable.


Other


differences


detection


procedures


are


capacity


to detect


nonuniform


DIF,


test


statistical


significance,


and


to conceptualize


DIF


as a consequence


multidimensionality.


Uniform


DIF


occurs


when


there


interaction


between


group


member


ship


and


the


conditioning


criterion


regarding


probability


answering


an item


correctly.


In other


words,


DIF


functions


a uniform


shion


across


ability


spectrum.


Nonuniform


DIF


refers










ability


spectrum


and


disfavor


the


subgroup


other


end


the


spectrum.


All


DIF


procedures


are


used


estimate


an index


describing


magnitude


the


differential


response


pattern


the


groups


item.


Some


procedures


provide


states


tical


tests


to detect


the


DIF


index


differs


significantly


from


zero.


Finally,


although


DIF


perceived


as a consequence


multidimen


Stout'


sionality,


Simultaneous


every

Bias


procedure


Test


except


(SIBTEST


Shealy


functions


and

within


unidimensional


framework.


Many


DIF


detection


methods


have


been


developed


during


past


three


decades.


thi


review,


they


are


categorized


as based


upon


observed


conditional


invariance


unob


erved


latent


conditional


invariance.


Related


issues,


research

Following

efforts t


problems,


the


and


review


o explain


potential


of DIF


the


usage


detection


underlying


are


evaluated.


methods,


causes


of DIF


research

are


presented.


Methods


Upon


Observed


Scores


Angoff


detection


Ford


method


(1973)

called


offered


the


delta


first


-plot.


widely


The


used


delta-plot


procedure


was


problematic


due


tendency,


under


conditions


of differing


ability


score


stributions,










sample


and


was


not


based


upon


a chi-square


sampling


stributions


' in


ect,


a chi-square


procedure


at all


(Baker,


1981)


The


full


-square


procedure


(Bis


hop,


Fienberg,


& Holland,


1975)


was


a valid


technique


testing


but


required


large


sample


zes


at each


ability


level


sustain


statistical


power


Holland


and


Thayer


(1988)


built


upon


these


chi-s


quare


hniques


when


they


applied


the


Mantel


and


Haensze


(1959)


statistic,


originally


developed


medical


research,


the


detection


DIF.


Mantel


-Haensze


procedure.


The


Mantel


-Haenszel


(MH)


statistic


become


most


widely


used


method


of DIF


detection


(Millsap


Everson,


1993)


The


MH procedure


assesses


the


item


data


a J-by


-by-


contingency


table.


At each


score


level


, individual


item


data


are


presented


two


groups


the


two


levels


item


response,


right


or wrong


see


Table


The


null


hypothesis


the MH procedure


can


expressed


the odds


answering


an item


correct


given


ability


level


are


same


both


groups


across


ability


levels


The


alternative


hypothesis


that


the


two


group


have


equal


probability


answering


item


correctly


some


level










Table


Item


Data


for


Groups


Item


Scores


Ability


Group


Score on Studied Item

1 0 Total


Group


Total


The


MH stati


stic


uses


a cons


tant


odds


ratio


(a )


as an


index


of DIF.


The


estimate


constant


odds


ratio


A. D


C B /T. ]
F; .RJ1


The


constant


odds


ratio


ranges


value


from


zero


infinity.


The


estimated


value


under


null


condition.


interpreted


the


average


factor


which


odds


that


a reference


group


examinee


will


answer


the


item


correctly


exceeds


that


of a focal


group


examine.










estimated


value


am frequently


transformed


more


easily


interpret


ed A metric


via


MH D


-DIF


--2


35 in[taf ]


Positive


values


of MH D-DIF


favor


the


focal


group,


whereas


negative

The


values


chi


favor


-square


the ref


test


erence


group.


significance


-r


E(A i)


.5]2


Var (A )


where


= 7Ra m


IT,
-7


and


var (A .)


[n m,


nrim


[ 2( )]


The


MH chi-square


is di


tribute


d approximately


as a chi-


square


with


one


degree


freedom.


Holland


and


Thayer










The


advantages


are


computational


simplicity


(Holland


Thayer,


1988),


tati


stical


test


significance,


lack


sens


itivity


to subgroup


differences


the


stribution


ability


(Donoghue,


Holland,


& Thayer,


1993;


Shealy


& Stout,


1993a;


Swaminathan


& Rogers


, 1990).


The


most

detec


frequently c

t nonuniform


ited

DIF


disadvantage

(Swaminathan


s lack


& Rogers,


power


1990).


further


limited


unidimensional


conception


and


assumption


that


total


test


score


provides


a meaningful


measure


the


construct


purported


to be estimated.


The


standardization


procedure.


The


standardization


procedure


(Dorans


nonparametric


regr


Kulick,


session


1986)


test


based


scores


upon


on item


the


scores


two


groups.


Let


define


the


expected


item


test


nonparametric


regression


reference


group,


and


E,(Y


define


the


expected


item


test


nonparametric


regrets


focal


group,


where


item


score


X i


test


score.


DIF


analysis


the


individual


score


level


The


statistic,


s the


fundamental


measure


= E


- E .











differences


and


cannot


explained


differences


the


attribute


ted.


The


standard


zation


procedure


derived


name


from


standardization


group


that


functions


to supply


a set


weights,


one


at each


ability


level,


that


will


be used


weight


each


individual


The


standardized


p-difference


(STD


P-DIF)


Wi (EFj


-E.


STD


STD -P


The

The


essence

specific


of standard


weight


zation


implemented


the


weighting


function.


tandardization


depends


upon


nature


study


(Doran


& Kulick,


1986)


Plaus


ible


options


of weighting


include


number


examinees


the


total


group


at each


level


of j


, the


number


---- --m -- -


C,,


4.. a


1 -- aI


A -


n-


I


_ -


.I.E.


U Im L


E










used;


the


focal


thus,


observed


group


The


STD-P


performance


on an item


standard


zation


defined


and


(Dorans


procedure


difference


expected


Kulick,


contains


between


performance


1986)


a significance


test.


The


standard


error


using


focal


group


weighting


SE(STD


-p)=


PF (1-PF)


+VAR (P;),


where


the


proportion


focal


group


members


correctly


answering


the


item,


and


where


thought


as the


performance


reference


the


group


focal


item


group


test


members


regression


predicted


curve


from


and


J


(Ps)


Fj Pj (1


- PA.)


The


tandardization


procedure


a flexible


method


inve


stigating


DIF


(Dorans


Holland,


1993),


and


been


applied


ass


essing


differential


functioning


stractors


(Dorans,


Schmitt,


Blei


stein,


1992)


and


the


differential


effect


speedednes s


(Schmitt


& Dorans,


1990).


DIF


findings


from


the


standard


zation


procedure


will


close


agreement


with


the


MH procedure


(Millsap


& Everson,


1993)


a a


a. a-


a S


- S *


a -


1


* Jl










standardization


method


are


much


the


same


as for


The


most


commonly


cited


deficiency


both


methods


their


inability


to detect


nonuniform


DIF.


Donoghue


et al.


(1993)


determined


that


both


methods


require


approximately


more


items


the


conditioning


score,


the


studied


item


should


extreme


included


ranges


determining


item


the


difficulty


conditioning


can


score,


adversely


influence


DIF


estimation.


Linn


(1993)


observed


that


estimates


using


these


procedures


appear


to be


confounded


with


item


discrimination.


Loqistic


repression


model


Swaminathan


and


Rogers


(1990)


applied


logis


regress


to DIF


analysis.


Logistic


regress


model


unlike


least


squares


regression,


permit


categorical


variables


as dependent


variables.


Thus,


permits


analy


of dichotomou


scored


item


data.


It has


additional


flexibility


including


the


analysis


interaction


between


group


and


ability,


as well


allowing


the


inclusion


of other


categorical


and


continuous


independent


variables


model.


A fundamental


concept


analysis


with


linear


models


assessment


cons


tency


between


a model


and


a set


of data


(Darlington,


1990).


Cons


istency


between


the


model


the


data


means


ured


the


likelihood


* --


I ..


i


I III










examinee


will


have


a probability


between


and


answering


an item


correctly


the


multiplicative


law


independent


probabilities


, an overall


probability


group


examinees


answering


a specific


pattern


can


estimated.


For


example,


probability


four


individual


each


answering


an item


correctly


0.9,


and


three


the


subjects


answer


correctly,


overall


probability


this


pattern


occurring


X 0.9


X 0.9 X


-0.9)


or 0


.0729


Therefore,


item


, the


likelihood


function


a set


examinee


responses


each


with


ability


level


determined


L(Datae 8)


P(ui/1'


-Prui


n=l


where


has


value


a correct


response


and


a value


an incorrect


response.


The


logis


regr


ess


model


predicting


probability


a correct


answer


exp 0+3 ,10)
+ exp q30+p13)]


where


the


response


item


given


ability


level


fois


the


intercept


parameter,


and


slope


--


__














11e)


exp(Po


+ exp po


P3egj


+ 3ie +


Pf.g5


aP3e i


where


estimate


of uniform


difference


between


groups,


and


estimated


interaction


between


group


ability.


only


deviate


from


zero,


the


item


interpreted


as containing


no DIF.


does


not


equal


zero,


equal


zero,


uniform


DIF


indicated.


does


equal


zero,


nonuniform


DIF


inferred.


Estimation


the


parameters


and


carried


out


each


item


using


a maximum


likelihood


procedure


. The


two


null


hypothe


ses


can


tested


jointly


x2=P'c' (CC- C}CB,


where


The


test


a chi-square


stribution


with


degrees


r .A


et-


1L -- --_ -


* A r A* _


A-I- -


S- .-1 -- -


+,1e +


2g +


L1










The


logi


stic


regression


procedure


offers


a powerful


approach


testing


pre


sence


of both


uniform


and


nonuniform


DIF.


In sample


sizes


and


examinees


per


group


and


with


and


test


items


serving


criterion,


Swaminathan


and


Rogers


(1990)


concluded


under


conditions


uniform


DIF


that


the


logistic


regression


procedure


had


power


similar


MH procedure


controlled


Type


errors


almost


as well.


The


logis


regression


procedure


had


effective


power


in identifying


nonuniform


DIF,


whereas


the


MH procedure


was


virtually


powerless


to do


so.


In demonstrating


the


ineffectiveness


MH procedure


to detect


nonuniform


DIF,


Swaminathan


Rogers


(1990)


simulated


data


keeping


item


difficulties


equal


varying


the


discrimination


parameter.


In effect,


they


simulated


nonuniform


symmetrical


DIF.


Their


simulation


created


a set


conditions


where


theoretically


the


procedure


has


no power.


Researchers


must


ask


whether


such


symmetrical


interactions


occur


with


actual


test


data.


Millsap


and


Everson


1993)


commented


that


Swaminathan


Rogers


(1990)


utilized


large


numbers


items,


they


conjectured


that


in cases


with


a small


number


homogeneous


items


forming


the


criterion


variable,


positive


rates


would


increase


unacceptably


above


nominal


levels.


i J


~ L


_


i


_ _


1-










which


ability


level


is observed.


The


logi


stic


procedure,


although


developed


from


a unidimen


ional


pers


pective,


provides


a fl


exible


model


that


can


incorporate


a diversity


independent


categorical


and


continuous


variables.


Millsap


Everson


(1993)


observed


that


the


procedure


"allows


inclusion


of curvilinear


terms


other


factors--such


as examine


character


stics


like


test


anxiety


instructional


opportunity--that


may


relevant


factors


exploring


poss


ible


causes


of DIF"


306)


Methods


Based


Upon


Latent


Ability


Estimation


DIF


are


detection


developed


methods


through


conditioning


various


model


on latent


ability


approaches


describe


the


relationship


between


individual


item


responses


the


construct


measured


test


or subtes


When


applied


to DIF


analy


ses


permits


the


use


estimates


true


more


ability


as the


subjective


criterion


measure


variable


observed


as opposed


scores


pite


theoretical


disadvantages


appeal,


approaches


of requiring


large


possess


sample


the


sizes


inherent


, being


computationally


complex


cos


tly,


and


including


stringent


assumption


unidimensionality


shima,


1989)


The


most


widely-used


model


are


Rasch


model


or one-


parameter


model,


two-parameter


logis


model


(2PL),


.


A-~~~~~~~~ 1I a1 A. 1 a a--n--


,nr


T.l1 A


~,nA


T ft


IIA AI


L. L-


I ~ *I


f


111










ability


score,


except


possibly


the


studied


item,


contain


DIF,


MH provides


a DIF


index


proportional


index


estimated


the


Rasch


model.


Therefore,


methods


based


upon


the


Rasch


model


will


not


reviewed,


the


more


complex


and


model


will


be reviewed


regarding


their


potential.


The


central


components


model


are


unobserved


latent


trait


estimate,


termed


0, and


a trace


line


each


item


res


ponse,


often


termed


the


item


character


stic


curve


(ICC).


The


will


take


a specified


monotonically


increasing


function.


the


model,


the


probability


correct


response


to Item


as a function


exp [Da ( -
- exp [Da1 (0


where


the


item


parameters


i and


are


item


discrimination


and


difficulty,


respectively,


a constant


order


to convert


logis


scale


into


an approximate


probit


scale


(Hambleton


Swaminathan,


1985)


the


model,


the


probability


a correct


response


P(u,


1


exp[Da i(e
- exp[Da,


- ,)]


- bi)]


- b)]










The


model


general


includes


procedure


combining


estimating


both


DIF


groups


using


and


a 3PL


estimating


item


parameters


utili


zing


either


a maximum


likelihood


Bayesian


procedure,


fixing


the


i parameter


items,


after


dividing


examinee s


into


reference


and


focal


group


members,


estimating


the


and


i parameters,


equating


parameters


from


the


focal


group


scale


reference


group


scale


or vice


versa,


calculating


the


index


and


significance


test,


and


utilizing


purification


procedure


(Lord,


1980;


Park


Lautens


chlager,


1990)


to further


examine


and


enhance


analy


Purification


and reestimate

items included,


procedures,


ability


will


which


level


extract


without


be elaborated


potential


the


DIF


potential


upon.


items


DIF


indices


statistical


tests


based


upon


latent


ability


proceed


either


(ai,


analyzing


or analyzing


Lord'


chi-souare


difference


area


and


between


between


-LR.


the


the


Lord's


item


groups'


(1980)


parameters


ICCs.


chi- square


and


-Likelihood


Ratio


(IRT


-LR)


simultaneously


tests


dual


hypothesis


- aFp


= b1i.


Because


pseudo-


chance


parameter


standard


errors


are


not


accurately


estimated


separate


groups


(Kim,


Cohen,


& Kim,


1994),


usually


tested


with


either


procedure.


fl ~ ~ ~ ~ ~ ~ ~ a -a-- -a-a SE U-~ .naa -e e -e


S:,


m m


rl


~CI~ C Y I.rC ~~


~C~I~UA


nru


T A










large


to effectively


assume


an infinite


number


of degrees


freedom,


test


becomes


-4 )


var(b, )+ var(br)


Alternately,


z2 will


tribute


as a chi-square


stati


stic


with


one


degree


freedom


(Thissen,


Steinberg,


Wainer,


1988).


simultaneous


test


the


discrimination


difficulty


parameters


based


upon


Mahalanobi


stance


between


parameter


vectors


the


groups


The


test


states


becomes


= v'z


which


V i


the


vector


of differences


between


the


parameter


estimations


- b,)


and


the


estimated


covariance


matrix.


The


test


distributed


chi-square


with


degrees


freedom.


The


same


hypothesis


tested


Lord


s chi-square


can


with


IRT-LR


(Thissen,


Steinberg,


Wainer


1993)


null


hypothes


with


-LR


tested


through


three


steps


model


fitte


simultaneou


both


groups


data.


A set


of valid


"anchor"


items,


containing


no DIF,


- a










equality


s or a's


The


model


assessed


maximum


likelihood


statistics


and


-2(0loglikelihood).


The


model


refitted


under


the


constraint


that


and


a parameter


are


equal


both


groups


-2(loglikelihood).


The


likelihood


ratio


test


significance


s the


difference


between


the


two


models


and


likelihood


ratio


test


assesses


significant


improvement


model


fit


as a consequence


allowing


two


parameters


to fluctuate.


likelihood


ratio


significant,


either


two


the


b parameter


groups,


or the


DIF


a parameter


detected.


is different


this


example,


simultaneously


testing


differences


both


parameters,


test


statistic


stributed


as a chi


-square


with


two


degree


freedom.


situation


ting


significance


only


item


difficulty,


stati


stic


would


tribute


as a chi


-square


with


one


degree


freedom.










second-derivative


approximations


the


standard


errors


estimated


likelihood


item


parameters


estimation


as a part


The


IRT-LR


the


procedure


maximum


does


require


estimated


error


variances


and


covariances.


results


from


computing


likelihood


the


overall


mode


under


the


equality


constraints


placed


upon


the


data


then


estimating


the


probability


under


the


null


hypothesis


(Thissen


et al., 1988)


Lord'


chi-square


IRT-LR


are


capable


of detecting


nonuniform


DIF


sess


good


stati


stical


power


(Cohen


Kim,


they


1993)


tend


Because


to be expensive


requis


and


large


yield


sample


positive


zes


rates


above


the


nominal


eve


s (Kim


et al., 1994).


. Linn


(1981),


with


simulated


data,


and


epard,


Camilli,


Williams


(1984),


with


actual


data,


demon


strated


that


significant


differences


detected


Lord


s chi


-square


occurred


even


when


plotted


ICCs


were


nearly


identical.


additional


problem


when


employing


IRT-LR


need


set of


truly


unbiased


anchor


items


(Millsap


& Everson,


1993).


Procedures


estimatinsc


area


between


ICCs


Eight


different


DIF


procedures


have


been


developed


to estimate


area


between


the


reference


group'


and


focal


group


1- -


*


I I-


II


__










interval


, (c)


continuous


integration


or discrete


approximation,


weighting


(Mills


& Everson,


1993)


The


first


area


proc


edures


utilized


bounded


interval


with


discrete


approximations


Rudner


(1977)


suggested


unsigned


index


-z


PF(e)


with


discrete


interval


from


j -3


to 8


j = 3


Rudner


(1977)


used


small


interval


stances


(e.g.;


.005)


summed


across


interval


The estimated


is converted


to a signed


index


removing


the absolute


value


operator


Shepard


area


et al.


procedures


(1984)


extended


introducing


four


signed


hniques


and

that


unsigned

included


sums


of squared


values,


weights


based


upon


number


examinees


in each


interval


along


0 scal


and


weighting


initial


differences


inverse


the


estimated


standard


error


the


different


ce.


They


determined


that


distinctively


different


interpretations


occurred


when


signed


area


indices


were


estimate


as compared


to unsigned


indices


They


further


found


that


various


weighting


procedures


influence


interpretations


only


slightly,


they


concluded


that


item


PR (e)


A],










All


the


area


indices


proposed


Shepard


et al.


(1984) u

standard


tilized

errors


discrete

to permit


approximations

significant t


and


ests


lacked

Raju


sample

(1988,


1990)


augmented


these


procedures


devi


sing


an index


measure


continuous


inte


gration


over


unbounded


interval


and


derived


(1988)


standard


proposed


errors


setting


permitting


the


significant


c parameter


tests


equal


Raju


both


groups


estimating


the signed


area


SA = (,R


-b,)


unsigned


area


estimated


-a,,)


Da.a


In( 1


+ exgF


Daa,(b,-b,)
F(BRV)))
a Sa


- b,)


Raju


(1990)


derived


asymptotic


standard


error


formulas


signed


and


unsigned


area


measures


that


can


use


generate


tests


to determine


significance


level


DIF


under


conditions


of normality


Theoretically,


Raju'


procedure


measuring


and


testing


the


significance


area


between


ICCs


two


utili


group


zing


a s


score


significant


erval


advancement


Raju


over


(1990)


procedure


interpreted


2(a,










(1993),


analyzing


data


from


a 45-


item


vocabulary


trial


test


contras


ting


girl


boys


black


and


white


students,


found


that


significance


tests


the


area


measures


identified


identical


aberrant


items


as Lord


chi-square.


Raju


et al.


(1993)


the


alpha


rate


at 0.001


to control


Type


errors


Cohen


and


Kim


(1993)


found


that


two

Lord


comparing


procedures


s chi-square


Lord


produced


appeared


-square


similar


to Raju's


results,


slightly


more


SA and


although

powerful


identifying


simulated


DIF.


as a Consequence


Multidimensionalitv


In all


procedures


thus


reviewed,


researchers


have


either


conditioned


an item


response


on an ob


erved


test


score


or a latent


ability


estimate.


Procedure


using


observed


scores


assumed


that


total


score


valid


meaning


terms


purported


construct


measured.


procedures


assumed


response


to a set


items


are


unidimens


ional


even


though


examinees'


scores


may


reflect


composite


abilities.


potential


DIF


can


conceptualized


as occurring


when


test


consists


targeted


ability,


item


respon


ses


are


influenced


one


or more


nuisance


determinants


Shealy


Stout,


1993a,


1993b).


Under


thi


circumstance,


an item


may


misinterpreted










means


are


not


equal,


means


are


not


equal,


the


ratio


o,/O


are


equal,


correlations


between


the


valid


and


nuisance


dimensions


are


not


equal


(Ackerman,


1992).


The


presence


multidimens ionality


a set


items


does


not


necessarily


lead


to DIF.


For


example,


quantitative

achievement


ability


may


test


contain


used


to predict


mathematical


word


future college

problems requiring


proficiency


reading


kills


The


test


contains


one


primary


dimension--quantitative


ability;


however,


a second


requisite


measured


skill--reading


ability--is


valid


specific


usage.


A unidimensional


analysis


applied


to such


multidimensional


data


would


weight


relative


discrimination


the


multiple


traits


to form


a reference


composite


(Ackerman,


1992;


Camilli,


1992).


the


focal


and


reference


groups


share


a common


reference


composite,


not


possible.


Since


any


test


containing


two


or more


items


will


degree


multidimens


ional,


practitioners


should


define


validity


sector


approximately


to identify


the


same


test


composite


items

of ab


measuring


,ilitie


(Ackerman,


1992).


In DIF


studies,


conditioning


variable


should


consist


only


items


means


urging


the


same


compo


site


- a a


a1 1


1


-










composites


problem


of ability.


trying


This


compare


creates,


apple


in essence,


oranges.


the

The


potential


effect


this


to confound


DIF


with


impact


resulting


in spurious


interpretations


(Camilli,


1992).


The


effect


multidimensionality


analy


ses


resulted


limited


consistency


across


method


(Skaggs


ssitz,


1992)


across


differing


definitions


conditioning


Further,


variable


Linn


(Clauser,


(1993)


Mazor,


observed


& Hambleton,


that


1991).


rigorous


implementation


to identify


a proper


set


test


items


may


restrict


validity.


For


example,


the


SAT


-Verbal


(SAT


-V),


items


with


large


erial


correlations


total


score


were


more


likely


to be


flagged


than


items


with


average


or below


average

suggest


biserial

d that t


correlations


traditional


using


unidimens


ional


Thi


DIF


finding

analyses,


part,


might


be stati


stical


artifacts


confounding


group


ability


differences


item


discrimination.


Differential


item


functioning


procedures


based


upon


multidimens


ional


perspective


conditioning


on items


clearly


defined


from


a validity


sector


have


the


potential


reduce


these


problems


(Ackerman,


1992).


Further,


multidimensional


explanation


approach


(Camilli,


1992


should


also


Careful


facilitate


evaluation


DIF


and










SIBTEST.


Shealy


and


Stout


(1993a,


1993b)


have


formulated


a DIF


detection


procedure


within


multidimensional


conceptualiz


ation.


They


conceptualize


test


as measuring


composite


the


a unidimensional


target


ability--that


trait


or reference


influenced


periodically


nuisance


determinant


DIF


interpreted


as the


consequence


the


differential


effect


nui


sance


determinants


functioning


on an item


or set


items.


The


SIBTEST


procedure


employs


factor


analy


Si1


identify


sector.


a set


These


items


items


that


adheres


constitute


to a defined


valid


subtes


validity


, and


remaining


items


become


the


tudied


items.


Examinees


are


divided


into


strata


based


upon


the


valid


subtest


score,


and


the


DIF


index


estimated


-zP


where


the


pooled


weighting


focal


and


reference


group


examinees


who


achieve


The


value


identical


the


value


P-DIF


when


total


number


examines


are


weighting


group


Shealy


and


Stout


(1993a)


have


referred


standard


zation


procedure


"progenitor"


161)


SIBTEST.


They


present


* - n rn a


h e4-nArn rrrr


0c+0 mAF-tI


,.r: 4-1


4-V*


CArrC


LSrrrnr













SE(8)=


.27


- PJ)


-p.)


With


SIBTEST


total


score


valid


subset


serves


conditioning


criterion.


The


SIBTEST


procedure


resembles


methods


on which


an observed


test


score


is the


criterion;


although,


incorporates


an adjustment


item


mean


prior


to comparing


groups


these


means.


Thi


adjustment


an attempt


remove


that


portion


group


mean


difference


attributable


group


mean


differences


the


valid


targeted


ability.


When


the


matching


criterion


an observed


score


the


studied


item


included


the


criterion


score,


group


differences


statistically


in target


inflate


Cons


ability


will


equently,


tend


SIBTEST


employs


correctional


procedure


based


upon


regression


and


theory.


In effect,


the


purpose


is to


transform


each


observed


mean


group


ability


level


score,


into


transformed


mean


so that

ability

remove


the

leve

that


trans


formed


I score

portion


mean.


score,


This


group


mea


a valid


adjustment at

n differences


estimate


tempts

that


attributable


group


differences


underlying


targeted


Pj(1


Pg (1










an estimate


difference


in subtest


true


scores


referenced


focal


groups


with


examinees


matched


ability


levels.


this


trans


formation


to yield


unbiased


estimate,


valid


subtest


must


contain


a minimum


20 items


(Shealy


Stout,


1993a).


SIBTEST


only


procedure


based


on conceptualizing


DIF


as a result


of multidimensionality.


Although


resembles


the


procedures


that


condition


on observed


scores,


offers


conditioning


conditions


a regression


correction


on estimated

demonstrates


true

good


procedure


scores. U

adherence


that


nder


allows


simulated


to nominal


error


rates


even


when


group


target


ability


distribution


differences

powerful as


are


MH i


extreme, and it

n the detection


has been


unifor


shown

m DIF


to be a

(Shealy


Stout,


1993a).


multidimensional


conceptualization


potentially


nuisance


can


lead


determinants


identification


greater


of different


understanding


of DIF


causes


(Camilli,


1992).


The


major


weakness


SIBTEST


are


inability


assist


the


user


in detecting


nonuniform


DIF


and


the


need


or more


items


to fit


a unidimensional


validity


sector.


With


a relatively


short


test


or subtest,


this


latter


weakness


would


problematic


under


some


practical


testing










Methods


Summary


After


years


development,


a plethora


sophisticated


DIF


procedures


have


been


devised.


Each


method


approaches


DIF


identification


from


a fundamentally


different


per


spective,


each


method


contains


advantages


and


limitations


Currently,


no consensus


among


DIF


researchers


exits


regarding


a single


theoretical


or practical


best


method.


The


design


thi


study


reflected


lack


consensus.


possessing


selected


theoretical


five


different


or pra


procedures,


appeal,


each


assess


item


responses


examinees


The


design


the


study


was


compare


the


reliability


and


validity


the


methods


themselves,


but


assess


the


similarity


results


obtained


from


methods


when


subpopulations


were


define


conceptually


different


ways.


Uncovering


the


Underlyingq


Causes


DIF


The


overwhelming


majority


of DIF


researchers


have


focused


on designing


stati


stical


proc


edures


evaluating


their


efficacy


detecting


aberrant


items


Few


researchers


have


attempted


move


beyond


methodological


issues


examine


DIF'


causes.


The


researchers


broaching


thi


topic


have


experience


ed few


successes


many


frustrations


Schmitt


et al


. (1993)


propo


that


explanatory










be classified


post


speculation


, (b)


hypothesis


testing


item


item


manipulations,


DIF


can


categories


and


attributed


hypothesis


manipulation


to a complex


testing


of other


interaction


using

variables


between


the


item


and


examinee


Scheuneman


Gerrit


, 1990)


Researchers


are


unlikely


to find


a single


identifiable


cause


of DIF


since


it stems


from


both


differences


within


examinees


and


item


characters


(Scheuneman,


1987).


earchers


examining


DIF


from


perspective


examinee


differences


may


uncover


takers,

Gerritz


significant


educators,


(1990)


and


suggest


finding


policy

d that


with


makers

"prior


implications


Scheunem

learning,


test-


an and

experience,


and


eres


t pattern


between


mal


females


and


between


Black


and


White


examinees


may


linked


with


DIF"


. 129)


Researchers


examining


from


the


perspective


item


character


sti


may


discover


findings


with


strong


implications


developers


test


may


need


developers


to balance


and


conten


item

t and


writers


item


Test


format


ensure


fairness.


Post


hoc


evaluations,


despite


their


limitations,


dominate


the


literature


(Freedle


tin,


1990;


. Linn


& Harn

1984;


isch,

Skaggs


1981;

& Lis


O'Neill

sitz, 1


McPee


992)


, 1993;


Speculation


Shepard

ns for


et al

causes










(O'Neill


& McPeek,


1993;


Shepard


et al., 1984;


Skaggs


Liss


itz,


1992


Hypothes


s testing


item


categories


a second,


more


sophisticated,


means


uncovering


explanations


DIF.


Doolittle


and


Cleary


(1987)


Harri


and


Carlton


(1993)


evaluated


several


DIF


hypotheses


on math


test


items.


Doolittle


and


Mathematics


Cleary


Usage


Test


(1987)


(ACT


employed


items


ACT


and


Assessment


a pseudo-


detection


procedure


to analyze


differences


across


item


categories


test


forms


Mal


examinees


performed


better


on geometr

examinees


mathematical


performed


better


reasoning


items,


on computation


whereas


items


femal


Harri


and


Carlton


(1993),


using


SAT-Mathematics


(SAT


items


MH procedure,


concluded


that


mal


examinees


better


application


problems


femal


examinees


did


better


on more


textbook


-type


problems


Scheuneman


concerning


(1987)


potential


analyzed


separate


causes


black


hypotheses


white


examinees


manipulating


test


items


on the


experimental


portion


the


general


test


The


hypoth


eses


, analyzed


through


linear


models


included


examinee


character


, such


test


seness,


and


item


character


such


format.


Complex


interactions










earlier


post


review.


employed


the


STDP


-DIF


index


with


ANOVA


and


found


that


panic


examinees


were


favored


antonym

common


items


root


that


included


Englis


h and


a true


Spanis


cognate


and


a word


on reading


with


passages


containing


material


inter


to Hispanics


False


cognates,

containing


words


spelled


different


similarly


meanings


both


language


homograph


words


but

spelled


alike


Engli


to be more

greater for


h but


containing


difficult for

Puerto Rican


Hispanics

examinees


different


The


, a gro


meanings,


differences

up generally


tended


were

more


dependent


on Spanis


as compared


to Mexican


-American


examines.


Yamamoto


. Tats

(1988)


;uoka,

studi


. L.


ed DIF


Linn,


on a 40


. Tatsuoka,


-item


fractions


and


test.


They


initially


analyzed


examines


dividing


them


into


groups


based


upon


instructional


methods.


Thi


procedure


failed


to provide


an effective


means


detecting


DIF


However,


upon


subsequ


review


and


analyst


they


divided


examinee

solving


into


groups


problems


Wit


based

h thi


upon


solution


grouping


strategies


variable,


they


used

found


DIF


indices


consi


stent


with


their


a priori


hypoth


eses


They


concluded


that


the


use


of cognitive


and


instructional


subgroup


categories


, although


counter


traditional


DIF










Miller


and


Linn


(1988)


considered


the


invariance


item


parameters


Second


International


Mathematics


Study


(SIMS)


examination


across


different


levels


mathematical


instructional


coverage.


Although


their


principal


concern


was


the


multidimensionality


achievement


test


data


as related


to instructional


differences


and


model


usefulness,


they


found


that


instructional


differences


could


explain


a significant


portion


observed


DIF.


Using


cluster


analysis,


they


divided


students


into


three


instructional


groups


based


upon


teacher


res


ponses


opportunity-


to-learn


que


stionnaire.


The


size


differences


in the


ICCs


groups


based


upon


instructional


groups


was


much


greater


than


differences


observed


previously


reported


compare


sons


black


and


white


examinees


They


interpreted


these


findings


as supportive


Linn


and


Harnisch


s (1981)


stulation


that


what


appears


item


bias


may


reality


"'instructional


bias'"


216).


Despite


Miller


and


Linn


s (1988)


straightforward


interpretation


instructional


experiences


, Doolittle


(1984,


1985)


found


that


instructional


differences


did


not


account


for


or parallel


gender


DIF


on ACT


-M items


dichotomized


high


school


math


background


into


strong


and










tended


to favor


female


examinees


did


not


favor


low


background


examinees


vice


versa.


Correlations


of DIF


indices


were


negative,


sugge


ting


that


gender


DIF


was


unrelated


to math


background


DIF.


Muthen,


Kao,


Burstein


(1991),


analyzing


core


items


the


SIMS


test,


found


several


items


to be


sensitive


to instructional


effects.


In approaching


DIF


from


alternative


methodological


perspective,


they


employed


linear


structural


modeling


assess


the


effects


instruction


latent


mathematic


ability


and


item


performance.


They


found


that


instructional


effects


had


negligible


effects


on math


ability,


but


had


significant


influence


on specific


test


items.


Several


items


appeared


particularly


sensitive


instructional


influences


They


interpreted


the


identified


items


less


an indicator


general


mathematics


ability


more


an indicator


exposure


to a specified


math


content


area.


In using


linear


tructural


modeling,


Muthen


et al.


(1991)


avoided


arbitrariness


defining


group


categories


a situation


where


group


membership


varied


across


items.


The


SIMS


data


permitted


estimation


instructional


background


each


core


items.


Under


most


testing


conditions,


estimating


examinee


_













nuisance


dimen


sons


estimated.


Analyzing


the


relationship


theoretical


causes


nuisance


dimensions


combines


the


approach


Muthen


et al.


(1991)


with


Shealy


and


Stout


(1993a,


1993b).


Summary


Researchers


investigating


underlying


causes


have


produce


ed few


significant


result


After


more


than


years


DIF


studi


, conclu


sions


test


wiseness


(Scheuneman,


cognates


1987)


Schmitt,


or Hi


1988)


spanic


must


tendencies


on true


be interpreted


as meager


guidance


test


developers


and


educators.


These


limited


results


can


explained


problems


inherent


traditional


Tatsuoka


DIF


et al


procedures


., 1988)


Skaggs


Indices


& Li


derived


ssitz


, 1992;


using


served


total


scores


as the


conditioning


variable


have


been


observed


to be confounded


with


item


difficulty


(Freedle


Kostin,


1990)


and


item


disc


rimination


Linn,


1993;


Masters


1988).


Indi


ces


derived


from


model


s are


conceptualized


from


an unidimensional


perspective,


DIF


a product


multidimensionality


(Ackerman,


Camilli,


1992)


Consequently,


DIF


detection


procedures


have


been


criticized


a lack


of reliability


between


methods


and


across


samples


(Hoover


Kolen,


1984;


Skaggs


ssitz


, 199










1988).


The


uninterpretability


findings


may


because


group


membership


only


a weak


surrogate


variable


greater


psychological


or educational


significance.


For


example,


demographic


categories


women


or blacks)


lack


any


psychological


or educational


explanatory


meaning.


Moving


beyond


demographic


subgroups


more


meaningful


categories


would


expedite


understanding


causes


Linn,


1993;


Schmitt


& Dorans,


1990;


Skaggs


Lissitz,


Tatsuoka


conceptualization


et al


been


., 1988).


Although


advocated,


this


been


used


paringly


Doolittle


(1984,


1985),


Miller


and


R.L.


Linn


(1988),


Muthen


et al.


(1991)


. K.


Tats


uoka


et al.


(1988)


used


this


conception


and


appeared


to have


reached


promise


ing,


incompatible,


interpretations


. Future


researcher


analyse


need


to achieve


to apply


alternative


explanatory


power.


approaches

Approache


to DIF

s advocated


Muthen


et al


. (1991)


and


Shealy


Stout


(1993a,


1993b)


provide


sound


methods


that


potentially


permit


modeling


of differing


influences


on item


respon


ses.


Gender


and


Quantitative


Aptitude


Educational


psychological


researchers


have


been


concerned


with


gender


differences


in scores


on quantitative


aptitude


tests


(Benbow,


1988;


Benbow


& Stanley,


1980;










filter"


that


prohibits


many


women


from


having


access


high


-paying


and


pre


tigiou


occupations


Sell


, 1978).


Although


gender


differences


in quantitative


ability


interact


with


development,


with


elementary


children


demonstrating


difference


or difference


slightly


favoring


girls


, by


late


adolescence


early


adulthood,


when


college


entrance


examinations


are


taken


critical


career


deci


sions


are


made,


slight


Sherman,


1977;


differences


Hyde


appear


, Fennema,


favoring


& Lamon,


boy


1990).


(Fennema


In studies


linking


gender


difference


quantitative


test


scores


with


underrepresentation


women


prestigious


technical


careers,


analyses


should


limited


taken


late


adol


escence


or early


adulthood


that


significantly


influence


career


deci


sons


opportunity


es.


Significant


Important


Test


Score


Differences


Standardize


d achievement


tests


utili


zing


representative


samples


(e.g.,


National


Assessment


Educational


Progress,


High


School


Beyond)


college


admi


ssions


tests


utilize


self-selected


samples


S, SAT,


ACT,


GRE)


have


been


analyzed


to ascertain


gender


differences


Gender


differences


found


in representative


samples


are


systematically


diff


erent


from


those


found


self-selected


samples


(Feingold,


1992)


Women


appear


ess


proficient,










successfully


matriculate


through


a process


that


relies


heavily


upon


admi


sslons


test


scores


There


fore,


in studying


quantitative


differences


with


primary


concern


related


career


decis


ions


opportunities


self-selected


admiss


test


scores


are


most


germane


measures


analysis


. C.


studied


Linn


(Friedman,


Hyde


1989;


(1989)


Hyde


concluded


et al., 1990)


from


that


meta-analytic


"average


quantitative


gender


differences


have


declined


to essentially


zero"


19),


and


differences


in quantitative


aptitude


can


no longer


used


to ju


tify


underrepresentation


women


in technical


profess


ions.


Feingold


(1988)


assess


gender


differences


several


cognitive


measures


on the


Differential


Aptitude


Test


(DAT)


and


the SAT


concluded


that


gender


different


ces


are


rapidly


diminishing


areas


one


exception


this


finding


was


the SAT


(Feingold,


1988).


Although


mean


diff


erences


had


either


substantially


dimini


shed


or vanished


on DAT


measures


of numerical


ability,


stract


reasoning,


space


relations,


and


mechanical


reasoning,


during


past


years,


SAT-M


differences


have


remained


relatively


cons


tant.


Despite


the


finding


that


gender


differences


are


appearing


on many


mathematical


ability


tests


the


major


colle


entrance


examinations


gender


differences










higher


SAT-M than


women


(National


Center


Education


Stati


stics


, 1993).


Thi


difference


can


also


stated


units


an effect


size


0.39


which


represents


difference


between


the


means


divided


pooled


standard


deviation)


The


trends


regarding


gender


differences


on the


ACT-M


are


similar.


The


ACT


scale


range


from


to 39 point


the


mean


difference


favoring


male


examines


from


1978


1987


was


2.33


points


0.33


(National


Center


Education


Statistics,


1993)


Thi


scoreI


differential


been


relatively


consistent


provides


indication


disappearing.


The


greatest


parity


between


men'


and


women'


mean


scores


occurs


on the


-Quantitative


(GRE


. For


1986-


1987


testing


years,


U.S


. mal


examinees


averaged


and


point


higher


than


U.S


. femal


examines


(Educational


ting


Service,


1991)


Transformed


into


effect


zes


, these


differences


are


d and


0.62


res


pectively


. Gender


mean


score


diff


erences


on the


large


part,


reflect


gender


differences


in choice


of major


field.


Particularly


the


case


graduate


admi


sslons


tests


, mean


scores


are


confounded


with


gender


differences


choice


undergraduate


major


. Analyz


ing


GRE-Q


data










differences


favoring


men


were


points


, respectively


and


.19).


examinees


intending


major


the


humanities


and


education


the


same


testing


year,


mean


score


differences


favoring


men


were


and


37 points,


res


pectively


.31)


Averaging


across


identified


intended


study,


mean


score


differences


favoring


men


were


points


.35)


(Educational


Testing


Service,


1986-8

sizes


1991).


testing


appear


Although


years,


to indicate


data


mean

e tha


was


score

t U.S


available


differences


. male


only


and


examinees


the


effect


tend


score


higher


than


. female


examinees


on the


GRE-Q


pattern


consistent


with


SAT-M


ACT-M.


Despite


changes


curriculum


and


text


material


that


depict


both


genders


ess


stereotypic


manners


(Sherman,


1983)


reductions


gender


difference


on many


mathematics


tests


(Feingold,


1988),


on coll


ege


admi


ssions


quantitative


tests


ender


differences


are


significant


and


appear


not


to be diminishing


Due


the


importance


these


tests


regarding


colle


admission


deci


sions


and


the


awarding


finan


cial


aid,


parity


in scores


tends


reduce


opportunities


women


sser,


1989)


Predictive


Validity


Evidence


Although


mean


scores


on quantitative


admiss


scores


i










evidence


that


admi


sslon


tests


are


biased


against


women


(Rosser,


1989)


Defenders


the


use


college


admission


tests


argued


that


other


relevant


factor


explain


phenomenon


(McCornack


McLeod,


1988;


Pallas


& Alexander,


1983).


They


postulated


that


women


tend


to enroll


major


fields


where


faculty


tend


to grade


ess


rigorously


women


are


more


likely


to major


the humanities


whereas


men


are


more


major


sciences


Investigators


analyze


differential


predictive


validity


of college


admi


ssions


exams,


therefore,


must


consider


gender


differences


in course


enrollment


patterns.


McCornack


(1988) g

patterns


and


generally

were co


McLeod


found


(1988)


that,


nsidered,


SAT


and


when

-V an


Elliot


differential

d -M coupled


and


Strenta


course

with h


taking


igh


school


grades


were


not


biased


in predicting


achievement


men


and


women.


McCornack


and


McLeod


(1988)


considered


performance


in introductory


level


college


courses


at a state


university


and


used


SAT


composites


with


high


school


grade


point


average


. They


found


no pr


edictive


bias


when


analy


zing


data


the


course


level


Elliot


and


Strenta


(1988)


considered


performance


in various


college-level


courses


private


university


utili


SAT


composites


with


scores


from


a college


placement


examination


and


high


school


rank.


.g.,










were


found


flawed


that


no bias


Had


they


they


combined


separately


various


studied


predictors


SAT-M and


and


high


school


grades


, they


might


have


arrived


at a different


interpretation.


Bridgeman


Wendler


(1991)


and


Wainer


and


Steinberg


(1992)


conducted


more


extens


studies


and


concluded


that,


equivalent


mathematics


courses


, the


SAT-M


tends


underpredict


college


performance


women.


Bridgeman


and


Wendler


(1991)


studied


SAT-M


as a predictor


college


mathematic


course


performance


at nine


colleges


universe


ities.


They


divided


mathematics


courses


into


three


categories


found


that,


algebra


and


e-c


alculus


courses


women


s achievement


was


underpredicted


and,


calculus


courses,


no underprediction


occurred.


The


most


extensive


study


to date


concerning


predictive


validity


the


SAT


was


conducted


Wainer


Steinberg


colleges


(1992).


and


Analyzing


universities,


nearly


they


47,000


concluded


student


that,


at 51


students


same


relative


course


rece


giving


the


same


letter


grade,


SAT-M


underpredicted


women'


achievement.


Using


backward


regression


model,


they


estimated


that


women,


earning


the


same


grades


similar


courses


, tended


score


roughly


25-30 points


ess


on the


SAT-M.










quantitative


admission


exams,


although


women


generally


outperform


men


in high


school


and


college


courses


The


principal


explanation


offered


this


paradox


gender


differences


course


taking.


Researchers


investigating


relationship


achievement,


of quantitative


controlling


admission


course


tests


taking


and


patterns


subsequent


and


course


performance,


mathematics


courses,


have

the


concluded


tests


that,


underpredict


equivalent


women


achievement.


Although


underprediction


is not


large


as mean

appear


score

to be


differences,


biased


quantitative


underpredicting


admission


women's


tests


college


achievement.


recognized


that


predictive


bias


and


DIF


are


fundamentally


distinct;


however,


the


determination


predictive


bias


quantitative


admi


ssion


tests


makes


them


an evocative


instrument


analy


Potential


Explanations


DIF


This


study


will


approach


from


the


pers


pective


examinee


characteristics


When


analy


zing


DIF


explanations


from


this


perspective,


theoretical


explanations


predictive


Kimball


bias


(1989)


offer


a reasonable


sented


three


point


theorectical


departure.


explanations


paradoxical


relationship


gender


differences


admissions


test


scores


and


college


grades:


men


have










learning


styles


, and


men


tend


to prefer


novel


tasks


whereas


women


tend


to prefer


familiar


tasks.


these


three


theorectical


explanations,


would


submit


a fourth


explanation


related


test-


taking


behavior--differences


between


men


in women


test


anxiety.


Differences


Mathematics


Background


It i


well


document


d that


students


enter


high


school


and


proceed


toward


graduation


boys


tend


to take


more


mathematics


courses


than


girls


(Fennema


Sherman,


1977;


Pallas


& Alexander,


1983)


During


the


1980s


, high


school


boys


averaged


2.92


Carnegie


units


of mathematics


whereas


high s

Center

school


school

for

girl


girls av

Education

s entered


eraged

Statis


the


tics


Carnegie

, 1993).


upper-track


nint


units

Althoui

h grad'


(National

gh high

e mathematics


curriculum


slightly


greater


numbers


than


boys,


graduation,


boys


outnumbered


girls


advanced


courses


such


as calculus


and


trigonometry.


High


school


boys


were


more


likely


(National

trends co


to study


Center


ntinue


computer


for

as s


science


Education


students


and


Stati


enter


phys


stics


college.


than


1993).

During


girls


These

the


1980s,


men


slightly


outnumbered


women


in achieving


undergraduate

outnumbered w


mathematics


omen


degrees,


in attaining


and


overwhelmingly


undergraduate


degree










science,


and


physics


(National


Center


Education


Stati


stics,


1993).


Researchers


investigating


relationship


between


mathemati


background


test


scores


have


found


that,


when


enrollment


differences


are


controlled,


gender


differences


mathematical


reasoning


tests


are


reduced


(Fennema


& Sherman,


1977;


Pallas


Alexander,


1983;


Ethington


Wolfle,


1984)


Gender


score


diff


erences


on the


SAT-M,


when


high


school


course


taking


was


controlled


, were


reduced


approximately


two-thirds


(Palla


Alexander,


1983)


and


one-third


(Ethington


& Wolfle,


1984).


These


studies


analyze


total


score


differences


controlling


course


background.


Miller


and


Linn


(1988)


and


Doolittl


(1984,


1985)


analyzed


item


differences


controlling


instructional


diff


erences


, but


their


results


were


contradictory


Background


differences


offer


plausible


explanation


that


implores


additional


investigation.


Rote


Versus


Autonomous


Learnincr


Styles


Boys


tend


to develop


a more


autonomous


learning


style


which


facilitates


performance


on mathematics


reasoning


problems


and


girls


tend


to develop


a rote


learning


style


which


facilitates


classroom


performance


(Fennema


& Petersen,










better,


are


more


motivated,


and


are


more


likely


persevere o

independent


n difficult

format. S


tasks


students


presented


splaying


in a novel


rote


learning


behavior


tend


to do


well


applying


memorized


algorithms


learned


direction.

challenging


class


Often,

tasks


and


are


thes

when


heavily


student


given


dependent


tend


an option.


upon


to choos

This


teacher


less


dichotomy


congruent


with


finding


that


girls


tend


to perform


better


on computational


problems


boys


tend


to perform


better


application


and


reasoning


problems


(Doolittle


Cleary,


1988;


Harri


Carlton,


1992).


The


autonomous


versus


rote


learning


style


theory


consistent


with


literature


addre


ssing


gender


socialization


patterns


standardized


test


performances.


Before


can


further


applied,


however,


must


more


completely


operationalized


(Kimball,


1989).


To validate


this


theory,


researcher


must


demonstrate


that


boys


and


girls


approach


study


mathematics


differently,


then


relate


learning


styles


to achievement


on classroom


asses


sments


and


standardized


tests


(Kimball,


1989).


Novelty


Versus


Familiarity


Kimball


(1989)


hypothesized


that


girls


tend


to be


more


motivated


to do well


are


more


confident


when


working










on familiar


demonstrate


classroom


higher


assessments


achievement


and


on novel


boys


tend


standardized


tests.


Thi


theory


is based


on the


work


Dweck


and


her


colleagues


(Dweck,


1986;


. Elliot


Dweck.


1987;


Licht


Dweck,


1983)


who


related


attributions


to learning


and


achievement.


Students


with


a performance


orientation


and


low


confidence


tend


to avoid


difficult


and


threatening


tasks


They


prefer


familiar,


non-threatening


tasks


and


seek


to avoid


failure.


Students


with


a performance


orientation


high


challenging


confidence


tasks


are


Consi


more


stent


likely


to select


findings


moderately


demonstrate


that


girls


tend


to have


less


confidence


their


mathematical


abilities


than


boys


(Eccles


, Adler,


Meece,


1984;


Licht


Dweck,


1983)


Girl


are


also


more


likely


on standard


tests


to leave


items


unanswered


or mark


"I don


know"


when


given


thi


option


Linn,


DeBenedictis,


Delucchi,


Harri


, & Stage,


1987).


Girl


, more


than


boys,


attribute


their


success


mathematics


to effort


rather


than


ability


and


their


failures


to lack


ability


(Fennema,


1985;


Ryckman


their


Peckham,


abilities


1987).

, girls


Therefore,


generally


due


are


to less


ess


confidence


motivated


novel


mathematical


task


, find


them


more


threatening,


and


perform


ess


well.










achievement


tests.


High


test


anxiety


individuals


tend


score


lower


than


low


test


anxiety


individuals


of comparable


ability


(Hembree,


1988;


Sarason,


1980).


Because


aptitude


and


achievement


tests


are


not


intended


to include


test


anxiety


a component


total


score,


because


estimated


million


elementary


and


secondary


pupil


have


substantial


test


anxiety


(Hill


& Wigfield,


1984),


exemplifies


a nuisance


factor


influencing


item


res


ponses.


Test


anxiety


been


theorized


both


cognitive


behavioral

Gonzales,


terms


Taylor,


(Hembree,

Algaze,


1988;

& Anton


Sarason,

, 1978;


1984;

Wine,


Spielberger,

1980).


Liebert


and


Morris


(1967)


proposed


a two


dimens


ional


theory


test


anxiety,


consisting


worry


and


emotionality.


Worry


includes


expression


concern


about


one'


performance


consequences


stemming


from


inadequate


performance.


Emotionality


refers


to the


autonomic


reactions


to test


situations


increased


heartrate,


stomach


pains,


and


per


spiration)


Hembree


(1988)


use


meta-


analysis


test


anxiety


studied


and


found


that,


although


both


dimensions


related


significantly


performance,


worry


was


more


strongly


correlated


to test


scores.


The


mean


correlations


worry


and


emotionality


aptitude


/achievement


tests


were


and


-0.15,


- a -.


* |


1 I


1










Wine


(1980)


proposed


a cognitive-attentional


interpretation


test


anxiety


which


examinee


who


are


high


low


on test anxiety


experience


different


thoughts


when


confronted


test


situations.


The


low


test


anxious


individual


experiences


relevant


thoughts


and


attends


to the


task.


The


high


test


anxious


individual


experiences


self-


preoccupation


and


s absorbed


thoughts


of failure


These


task


irrelevant


cognitions


only


create


unpleasant


experiences,


but


as major


tractions


Sarason


(1984)


proposed


the


Reactions


Test


(RTT)


scal


based


upon


cognitive,


emotional,


and


behavioral


model.


The


40-item


Likert-


scaled


questionnaire


operationalized


a four


dimen


ional


test


anxiety


model


worry,


tension,


bodily


Benson


symptoms,


Bandalos


(199


test-


elevant


a confirmatory


thinking


cross-


validation,


problematic.


large


item


item


number


deletion,


four-factor


found


They


four


-factor


speculated


similarly


they


model.


that


worded


found


structure


misfit


items


substantial


To further


the


resulted


Through


support


validate


the


RTT


from


a process


a 20-


structure


test


anxiety,


Benson,


Moulin


-Julian,


Schwar


zer


, Seipp,


and


El Zahhar


(1991)


combined


and


RTT


formulate


a new


scal


The


Revi


Test


Anxiety


scale


(RTA)










The


cognitive


emotional


structure


of math


anxiety


is closely


related


test


anxiety


Richardson


and


Woolfolk


(1980)


demonstrated


that


math


anxiety


and


test


anxiety


were


highly


related,


and


mathemati


testing


provided


a superb


context


studying


test


anxiety


They


reported


correlations


between


inventories


test


anxiety


and


math


anxiety


ranging


mathematics


test


near

with


0.65.

a time


They commented

limit under i


that


takingig


instructions


to do


as well


as possible


appears


to be


nearly


threatening


as a


real


-life


test


most


mathematics-anxious


individual


s" (p.


271)


Children


first


sec


ond


grade


indicate


inconsequential


anxiety


emerges


test


anxiety


increases


level


, but


in seven


rity


third


until


grade


sixth


test


grade.


Female


student


at all


eve


tend


ssess


higher


test


anxiety


level


than


mal


students


at all


grade


level


(Everson,

behavioral


Millsap,


and


& Rodriguez


cognitive-b


, 1991;


ehavioral


Hembree,


treatments


1988)

have


. Some

been


demonstrated


to eff


ectively


reduce


test


anxiety


and


lead


increases


in performance


(Hembree,


1988).


This


finding


supports


lower


the


causal


performance


direction


and


test


test


anxiety


anxiety


s multidimens


producing


ional


structure.










in cases


unduly


model


influence


misfit,


forces


performance


such


the


test


item


level.


anxiety


High


might


test


anxiety


individual


may


find


some


items


differentially


more


difficult


than


other


test


items.


Summary


have


reviewed


several


different


methods


identifying


DIF.


large


part


because


computational


efficiency,


has emerged


the


most


widely


used


method.


It is


limited


terms


flexibility,


as researchers


continue


to search


underlying


explanations

apparent. L


DIF,


ogistic


limitations


regression


will


models


become


more


(Swaminathan


Rogers


1990)


provide


an efficient


method


that


has


greater


flexibility


than


MH and


potentially


models


theoretical


causes


of DIF.


Raju


s (1988)


signed


and


unsigned


area


measures


supply


a theoretically


sound


method


of contrasting


item


response


patterns.


Shealy


and


Stout


s SIBTEST


(1993a,


1993b)


conceptualizes


a multidimen


sional


phenomenon


and


defines


a validity


sector


as the


conditioning


variable.


sound


theoretical


foundation


coupled


with


computational


efficiency


and


explanatory


potential


makes


perhaps


the


most


comprehens


DIF


procedure.


These


five


approaches


were


employed


the


study.


Linear


structural










findings


the


validation


study.


Thus,


the


significance


validation


consistency


of DIF


estimation


was


considered.


Gender


context


DIF


thi


on quantitative


study.


test


context


items


was


will


taken


serve


because


paradoxical


finding


that


men


tend


score


higher


standardized


tests


math


reasoning,


although


women


tend


achieve

common


equivalent

categorical


supplement


or higher


course


variable


dichotomi


examines


grades.

studies,


into


Gender,


will


substantial


weak


mathematics


background


and


high


and


low


test


anxiety.


Thi


study


based


on the


premise


that


gender


differences


serve


as as


urrogate


differences


background


and


test


anxiety.


The


two


variables


were


selected


an effort


to explain


in terms


consi


stent


with


theoretical


explanations


of gender


differences


mathematics


test


scores


course


achievement.


Mathematics


background


has


been


applied


other


DIF


studies


with


inconsistent


interpretation


Test


anxiety


interest


to both


educators


cognitive


psychology


and


highly


related


to performance.


tudy


an attempt


determine


consis


use


tency


these


indices,


variable


detection


serves


methods,


improve


and

















CHAPTER


METHODOLOGY


The


present


study


was


designed


to investigate


the


inter-method


consistency


five


separate


differential


item


functioning


(DIF)


indices


and


associated


tati


stical


tests


when


defining


subpopulations


educationally


significant


variables


as well


the


commonly


used


demographic


variable


gender.


The


study


was


conducted


the


context


of college


admi


ssion


quantitative


examinations


gender


issues


study


was


designed


evaluate


the


effect


on DIF


indices


of defining


subpopulations


gender,


mathematics


background,


test


anxiety.


Factor


analytic


procedures


were


used


to define


structurally


valid


subt


items.


Following


the


identification


a valid


subtest,


the


DIF


analysis


was


repeated.


The


findings


DIF


analysis


before


validation


were


contrasted


with


the


DIF


analysis


based


valid


subset.


A description


examinees,


ins


truments,


data


analysis


methods


is presented


thi


chapter.











Examinees


The


data


pool


to be analyzed


consisted


test


scores


item


respon


ses


from


1263


undergraduate


college


students


The


sample


consi


sted


women


and


men.


solicited


help


various


instructors


in the


colleges


of education


business


, and


in most


cases


students


participate


d in


study


during


their


class


time.


the


total


sample


examinees,


individual


were


tested


asses


college


of education,


individual


were


tested


asses


the


college


business,


and


individual


were


ted


at other


sites


on campus.

background


Women


were


examines


largest


with


groups


little

the c


mathematics


college


education


asses


and


men


examinees


with


substantial


mathematics

college of


background


business


were

asses


see


largest

Table


group


of Appendix


examine


frequencies


test


setting,


gender,


and


mathematics


background).


majority


student


received


class


credit


participating.


No remuneration


was


provided


any


participant.


All


students


had


previous


taken


a college


admi


ssion


examination,


some


the


students


(approximately


percent)


had


taken


the


Graduate


Record


Examination-Quantitative


Test


(GRE-Q).











Instruments


The


operational


definition


a collegiate-level


quantitative


aptitude


test


was


a released


form


GRE-


Test


anxiety


was


operationally


defined


a widely


used,


tandardi


measure,


Revi


Test


Anxiety


Scal


(RTA)


The


mathematics


background


variable


was


measured


using


the


whether


dichotomous


or not


student


response


had


an item


completed


concerning


a particular


advanced


mathemati


class


the


college


level


(i.e.,


calculus)


. In


following


sections,


a more


detailed


description


of each


ese


instruments


is presented


accompanied


particular


technical


instruments


information


item


that


the


supports


purpose


use


the


study.


Released


GRE


Each


examinee


comply


a released


form


GRE


30-item


was


contained


test,


supplied


a 30-minute


"many


timed


kinds


Educational

examination.


of questions


Testing


Service


The sample


that


test


are


included


currently


used


forms"


(ETS,


1993,


GRE


The


test


was


signed


measure


basic


mathematical


skills


concepts


required


to solve


problems


in quantitative


settings


. It


was


divided


into











reason


quantity


accurately


comparing


or to recognize


when


relative


insufficient


two


information


had


been


provided


to make


such


a comparison.


The


format


the


second


section,


employing


multiple


choice


items,


assess


ability


to perform


computations


and


manipulations


of quantitative


symbols


and


to solve


word


problems


in applied


or abstract


contexts.


The


instructional


described


background

"arithmetic,


required

algebra,


answer


geometry,


items

and


was

data


analysis


" and


"content


areas


usually


studied


high


school"


(ETS,


1993,


. 18).


The


internal


consis


tency


the


test


1263


participant


was


relatively


good,


KR-20 = 0


pilot


study,


sample


test


correlations


with


the


GRE


examines


with


Scholastic


Aptitude


Test-


Mathematics


(SAT-M)


examinees


were


0.67


and


.79,


respectively


, the


scores


on the


released


GRE


were


similar


scores


examinees


earned


on other


college


admission


quantitative


examinations


Revised


Test


Anxiety


Scale


(RTA)


The


Seipp,


RTA


& El


scale


-Zahhar;


(Benson,


1991)


Moulin


was


-Julian,


formed


Schwarzer,


combining


theoretical


framework


two


recognized


measures


test











Reactions


to Tests


(RTT)(Sarason,


1984).


The


TAI,


based


upon


a two-factor


theoretical


conception


test


anxiety--


worry


and


emotionality


(Liebert


Morris


, 1967),


contained


items.


Sarason


(1984)


augmented


this


conceptual


zation


with


a four


-factor


model


test


anxiety--worry,


tension,


bodily


symptoms,


and


test


irrelevant


thinking.


To capture


the


best


qualities


of both


scales


Benson


et al


. (1991)


combined


the


instruments


to form


the


RTA


scale.

capture


They


intended


Sarason


four


that


propo


combined

factors.


scale

From


would

L the


original


combined


items,


using


a sample


more


than


college


students


from


three


countries,


they


eliminated


items


the


basis


items


not


loading


a single


factor,


having


low


item/factor


correlation


having


low


reliability


They


retained


items


each


loading


on the


intend


ed factor


and


containing


high


item


reliability.


The


bodily


symptoms


subscale,


containing


only


items,


was


problematic


due


to low


internal


reliability.


Consequently,


Benson


and


-Zahhar


(1994)


further


refine


the


RTA


scale


and


developed


a 20-


item


scal


with


four


factors


and


relatively


high


scale


internal


reliability


see


Table


With


a sample


of 562


coll


ege


students


from


two


countries


, randomly


split


into











correlations


, and


item


uniquene


sses


criptive


stati


stic


each


subscale


the


RTA


Benson


and


Zahhar


are


(1994)


reported


American


Table


sample


The in


and


study


strument


was


s sample


selected


because


evidence


reliability


and


construct


validity


compared


favorably


with


that


of other


leading


test


anxiety


scales


used


with


college


students


Table


criptive


Statistics


the


item


RTA


Scale


Scal


Benson


- El


American


Zahhar


Sample


= 202


Study
Sample
N = 1263


Total


Scale


38.31
10.40


39.17
9.37


Worry


11.61


12.03
3.50


Tension


3.85


Test


Irrelevant


6.79


Thinking


Bodily
Symptoms


7.54
2.79


7.35


Note.
First


U
.fl 4-ha


Numbe
entry


items


in each


nC 4-~ n A -


per


column


subscale


s the


Airt.4 :4-4 nfl


4-ba


in parenthe


mean,
4-1-4 .A


second
nfl w 4-* rt


ses


entry











Mathematics


Background


Researchers


have


experienced


problems


selecting


best


approach


measure


subjects


' mathematics


background


(Doolittle,


subjects'


1984).


background


Typically,


include


methods


asking


cla


subjects


ssifying


to report


number


mathematics


credits


earned


or semesters


studied

Pajares


(Doolittle,

& Miller, 1


1984,


1985;


.994)


Hacket

asking


& Betts

subjects


, 1989;

a series


que


tion


related


to specific


courses


studied


(Chipman,


Marshall,


Scott,


1991).


Asking


subject


que


stions


concerning


their


course


background


implies


that


one


or two


"watershed"

subjects' i


mathematics


instructional


courses qualitatively

background. To decide


capture

which


these


two


options


to employ


thi


study,


conducted


pilot


study


to ascertain


whether


measuring


examinees


mathematics


background


quantitatively


counting


mathematics


a watershed


credits


earned


mathemati


or by


course


qualitatively


was


more


identifying


useful


In a pilot


study,


undergraduates


were


asked


answer


the


five


question


posed


Chipman


et al.


(1991)


report


the


number


coll


credits


earned


mathematics


see


Appendix


questions


and


the


scoring


scheme


used


with


Chipman


et al.,


1991)


Subject


were










The


subjects


were


then


divided


using


their


responses


the


single


question


about


successful


comply


etion


college


calculu


course.


two


methods


dividing


subjects


into


two


background


groups


had


an 84%


agreement

predictors


rate;

with


however, co

performance


relations

on the GRE


thes

and


two


SAT-M


indicated


that


dichotomous


calculus


completion


question


was


more


valid


students


in this


study.


pattern


relationships


between


these


tests,


calculus


question,

indicated


and

that


the

for


number

these


mathematics


college


students


credit


earned


calculus


completion


had


a stronger


relationship


the


test


scores


.51)


than


the


number


of mathematics


credits


earned


.40)


see


Tabl


In a continuation


pilot


study,


examinees


reported


they


had


successfully


taken


a college


calculus


Table


Correlations


of Calculus


Completion,


SAT-M,


GRE-O


, and


College


Mathematics


Credits


SAT-M


GRE


Credits


culus


Compl


etion


.51(58)


.50(55)


.49(141)


Total


Credits


.08(58)


.40(55)











course,


and


examinees


reported


they


had


not


successfully

examinees re


taken


porting


a college


success


calculu


course.


completion


of a college


calculus


course


had


earned


an average


13.3


college


mathematics


credits


The


students


reporting


they


had


successfully


completed


a college


calculus


course


had


earned


an average


of 5.7


college


mathematics


credit


Therefore,


thi


sample


there


was


substantial


evidence


that


calculus


courses


serve


as a waters


hed


to other


more


advanced


mathemati


courses,


and


that


completion


calculus


course


could


use


to differentiate


students


terms


mathematics


background.


Subsequently,


mathemati


background


was


operationalized


having


each


examine


answer


following


question


: "Have


you


successfully


completed


college-level


calculus


course?"


Examinee


responding


were


classified


as having


a sub


stantial


background,


and


examinees


background.


responding


Utili


no were


zing


examinee


ass


ified


as having


responses


the


little


question


calculus


completion


was


justified


because


high


degree


students


' colle


agreement

ge course


between


calculus


backgrounds,


completion


the


higher


correlation


calculus


completion


to student


' SAT-M


and










sample


mathematics


background


applying


DIF


procedures.


Analv


Testing


Procedures


SubDODulation


Definitions


Prior


taking


released


GRE


examinees


answered


the


Differential


Item


Function


Ques


tionnaire


(see


Appendix


RTA


scale.


It contained


Examinee s


demographic


provided


questions


information


and


regarding


the

their


gender, m

Examinees


mathematics


were


background,


classified


and


as having


test


anxiety.


substantial


or little


mathematics

concerning


background by

completion of


answering

a college


the qu

calculu


estion


course.


the


1263


participants


reported


that


they


had


completed


a coll


calculu


course


and


reported


that


they


had


completed


a coll


ege


calculus


course


Frequency


counts


percentage


mathematics


background


gender


are


presented


Table


Men


and


women


did


not


ssess


similar


the


class


men


mathemati


reported


, whereas


backgrounds.


completing


the


women


the


a college


reported


sample,


calculus


completing


coll


ege


calculus


ass


High


test


anxious


groups


were


formed


following


manner


Examinees


scoring


approximately











Table


Frequencies
Background


and


Percentages


Gender


and


Mathematics


Mathematic


Background


Total


Subs


tantial


Little


Women


Pct.

Men


Pct.


14.6


40.3


Total


Pet.


50.4


Examinees


scoring


middle


percent


the


tribution


were


defined


as p


possess


moderate


level


test


anxiety


Examinees


scoring


in approximately


lowest


percent


stribution


were


defin


possessing


low


level


test


anxiety


For


analysis


examinees


class


ified


as po


ssess


moderate


level


test











test


anxiety


examines.


Women


tended


to be


classified


having


high


test


anxiety


at greater


rates


than


men.


Following


completion


the


questionnaire,


examinee


answered


-item


GRE


Examinees


received


a standard


instructions


were


told


they


had


minute


to complete


test.


Examinees


were


requ


ested


their


following


the


test,


they


sired,


they


could


learn


their


results


DIF


Estimation


The


five


different


methods


estimating


were


Mantel


-Haenszel


(MH)


(Holland


& Thayer,


1988),


Item


Response


Theory-Signed


Area


(IRT


-SA)


and


Item


Response


Theory-Unsigne


Area


(IRT-UA


(Raju,


1988,


1990),


Simultaneous


Item


Bias


Test


SIBTES


(Shealy


& Stout,


1993b),


and


logis


regression


(Swaminathan


& Rogers


1990).


A di


stinction


was


made


between


uniform


and


alternate


measures


Uniform


nonuniform


methods


estimate


DIF


fundamentally


different


ways.


nonuniform


DIF


exits


, the


two


approaches


produce


unique


findings


(Shepard,


Camilli,


Williams,


1984).


Consequently,


five


method


were


divided


into


two


groups


Mante


-Haenszel,


-SA,


and


SIBTEST


formed


uniform


measures


of DIF.


Logis


Regres


sion


and


-UA,











not


have


designed


used


measure


extensively


nonuniform


indicating


DIF,


test


that


practitioners


actual


testing


circumstances


they


assume


nonuniform


DIF


either


trivial


or a stati


stical


artifact.


examining


the


relationship


between


the


DIF


indi


ces


estimated


those


estimated


IRT-UA


and


logis


regression,


researchers


will


able


to determine


important


information


lost


when


only


uniform


methods


are


used.


Mantel


-Haensze


indi


ces


tests


of significance


were


estimated


using


SIBTEST


Stout


Roussos


, 1992


Item


Response


Theory


signed


and


unsigned


indices


and


tests


significance


were


estimate


using


PC-BILOG


(Mis


levy


Bock,


1990)


combination


with


6.03


(SAS


titute,


Inc.,


1988).


SIBTEST


indi


ces


and


tests


significance


were


estimated


using


SIBTEST


Stout


Roussos,


1992).


Logi


stic


regression


indi


ces


ests of


significance


were


estimated


through


6.03


Institute


Inc.,


1988).


Thu


each


test


items


was


analy


with


three


different


subpopulation


definitions


five


different


procedures


, producing


each


item


tinct


indices


significance


Structural


tests


Validation


The


structural


component


of construct


validation











(Mess


ick,


1988)


tructural


component


appraised


analyze


the


interrelationships


test


items


The


released


GRE-Q


was


structurally


validated


through


factor


analysis


the


matrix


tetrachoric


coefficients


the


item


test


a subsample


examines


Initially,


sample


1263


examinees


was


randomly


split


into


two


amples.


first


subsample


was


used


the


exploratory


study,


the


second


ample


was


used


cross-


validate


findings


derived


from


the


exploratory


analyst i


The


tetrachoric


coeffi


ent


matrix


was


generated


with


PRELIS

using


(Joreskog

an unweight


(Joreskog


& Sorbom,

ed least


& Sorbom,


1989a)

squares


1989b)


were


Factor

solution


used


analytic

through

assess it


model

LISREL


em


dimensionality


potential


nuisance


determinant


Research


Design


Prior


to validation,


assessed


the


consis


tency


combination


five


DIF


methods


and


three


subpopulation


definitions.


The


inter-method


consistency


of DIF


indices


was


asses


through


a multitrait-


multimethod


of DIF


(MTMM)


significant


matrix.


tests


The


was


inter-method


assessed


consistency


comparing


percent-of-agreement


rates


between


DIF


methods


when











A subset


unidimensional


items


was


identified


applying


factor


analytic


procedures.


Problematic


items


and


items


contaminated


nuisance


determinants


were


identified.


Following


structural


validation,


the


DIF


analysis


was


repeated.


Utilizing


combination


DIF


methods


subpopulation


definitions,


DIF


indices


and


significant


tests


were


generated


the


subset


items.


The


consi


stency


indices


associated


inferential


stati


stic


was


assessed.


The


findings


ass


imilating


validation


were


compared


the


preceding


findings


to appraise


effect


of structural


validation


on DIF


DIF


analy


Research


ses.


Questions


Research


question


one


through


four


addressed


the


consi


stency


DIF


indices


through


two


MTMM


matrices


correlation


coefficients


Research


questions


one


through


four


were


first


applied


to the


analy


uniform


DIF


procedures


and


the


MTMM


matrix


derived


from


these


coefficients


(see


Table


on page


The


same


set


questions


were


then


applied


the


alternate


DIF


procedures


and


the


MTMM


matrix


derived


from


these


coefficients


(see


Table


on page


10).


The


first


que


stion


applied


uniform


DIF










the


correlation


indices


when


the


subgroup


trait


gender


methods


are


and


-SA).


Were


the


convergent


coeffi


clients


based


upon


the


subpopulations


mathematics


background


and


test


anxiety


greater


than


convergent


coefficients


based


upon


gender


subpopulations?


Specific


stati


stical


hypotheses


were


formulated


provide


criteria


addressing


research


questions


Let


represent


correlation


between


MH and


IRT-


SA DIF


indices


items


when


examinee


subpopulations


are


defined


gender.


Let


represent


correlation


between


SIBTEST


indices


items


when


examine


ees


are


defined


gender


Let


PIS(G)


represent


the


correlation


between


the


IRT-SA


and


SIBTEST


indices


gender.


Comparable


items


when


notation


examinees


will


are


represent


defined


examinee


subpopulation


defined


mathematics


background


and


test


anxiety


(TA).


Three


families


stati


tical


tests


each


with


two


a priori


hypotheses


were


defined


answer


first


research


question


the


uniform


methods


They


were


follows


Hla: PI(M)

Hib: P(TA)


PMI(G)
PMI(G) t











H3a


: Pzs(M)


PIS(G)'


H3b:


The


PIS(TA)

first


PIS(G)


question


applied


the


alternate


DIF


procedures


also


addre


sse


convergent


or monotrait-


heteromethod


coefficients


Were


convergent


coefficients


based


upon


subgroup


mathematics


background


test


anxiety


greater


than


the


convergent


coefficients


based


upon


gender


subpopulations?


Similarly,


the


alternate


procedures


represent


corre


lation


between


the


and


-UA


DIF


indices


items


when


examine


subpopulations


are


defined


gender.


represent


the


corre


lation


between


MH and


logis


regression


indi


ces


the


items


when


examinees


are


defined


gender


Let


PIL(G)


represent


the


correlation


between


the


IRT-UA


logis


tic


regres


sion


indices


the


items


when


examinees


are


defined


gender


Comparable


notation


will


repre


sent


examinee


subpopulations


defined


mathematics


background


and


test


anxiety


(TA).


In a similar


manner,


three


families


states


tical


tests


each


with


two


a priori


hypotheses


were


defined


answer


the


first


research


question


the


alternate


methods.


They


were


follows


Hla: PMI(M)


PHI(G),