Citation
Scale Stability

Material Information

Title:
Scale Stability Optimal Choices for Anchor Test Length and Equating Links in a Non-Equivalent Anchor Test Design
Creator:
Akande, Christiana Aikenosi
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (44 p.)

Thesis/Dissertation Information

Degree:
Master's ( M.A.E.)
Degree Grantor:
University of Florida
Degree Disciplines:
Research and Evaluation Methodology
Human Development and Organizational Studies in Education
Committee Chair:
MILLER,DAVID
Committee Co-Chair:
MANLEY,ANNE CORINNE
Graduation Date:
5/2/2015

Subjects

Subjects / Keywords:
Educational evaluation ( jstor )
Estimation methods ( jstor )
High school students ( jstor )
Item response theory ( jstor )
Mathematical independent variables ( jstor )
Maximum likelihood estimations ( jstor )
Parametric models ( jstor )
Population estimates ( jstor )
Standard error ( jstor )
Test scores ( jstor )
Human Development and Organizational Studies in Education -- Dissertations, Academic -- UF
anchor-test-length -- chained-equating -- item-parameter-drift -- neat-design -- scale-stability
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Research and Evaluation Methodology thesis, M.A.E.

Notes

Abstract:
This study explores optimal choices for the number of anchor items and equating links that yield satisfactorily stable equating results in a Non-Equivalent Anchor Test (NEAT) design. Despite diverse research in this area, little is known about the influence of standard errors of equating on scale stability. The objective of this study is to investigate if standard errors associated with equating tests with (a) longer anchor items and (b) longer equating links would be significantly different from those having shorter anchor items and equating links respectively. A bank of 30 anchor items was simulated to mimic the difficulty and discrimination parameters on the 2006/2007 Canadian Provincial Mathematics Assessment which was used to show the equating relationships between two forms. Person parameters were simulated from a normal distribution with varying mean ability levels to depict an operational testing situation. Dichotomously scored item responses were generated using the two parameter logistic (2PL) function. The results indicate that, for conditions without IPD, the standard errors of equating were significantly different for test forms with longer anchor as compared to those with shorter anchor items. On the other hand, the standard errors of the chained equating coefficients were virtually the same when two links and three links, respectively, were used. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (M.A.E.)--University of Florida, 2015.
Local:
Adviser: MILLER,DAVID.
Local:
Co-adviser: MANLEY,ANNE CORINNE.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2017-05-31
Statement of Responsibility:
by Christiana Aikenosi Akande.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Embargo Date:
5/31/2017
Classification:
LD1780 2015 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

SCALE STABILITY: OPTIMAL CHOICES FOR ANCHOR TEST LENGTH AND EQUATING LINKS IN A N ON E QUIVALENT A NCHOR T EST DESIGN By CHRISTIANA AIKENOSI AKANDE A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION UNIVERSITY OF FLORIDA 2015

PAGE 2

2015 Christiana Aikenosi Akande

PAGE 3

To my family

PAGE 4

4 ACKNOWLEDGMENTS This thesis would not have been successfully completed without the grace of God and the diverse contributions of some individuals to whom I am indebted. Foremost, I would like to express my deepest gratitude to the almighty God for the gra ce to successfully complete my m my advisor and director of Collaborative Assessment and Program Evaluation Serv ices (CAPES), for his guidance, patience, financial support and providing me with an excellent atmosphere for doing research. I would not have enrolled in the m program without the financial support from CAPES. I also want to thank my committee memb er, Dr. Corinne Huggins Manley for guiding me with through computer coding task, reading through my drafts and making comments that helped to refine this study. I would like thank my colleagues who supported me in the process of completing this study; Hali l Sari rendered some help when I had some trouble with my computer codes and Su Yu gave some intelligent suggestions on how to organize my write up. I am highly indebted to my beloved husband who made so much sacrific e throughout the pursuit of my m degree; I could not have done this but for his support and sacrifices. He edited my drafts, pushed me to get things done on time, took care of our children when I was away at school, and stood by me through the good times and bad. I would also like to exp ress my sincere gratitude to my daughter, Oluwatise and my son, Oluwatosin for their understanding and sacrifices in the course of this research. They saw less of me during this period and I know that they always prayed for me. Finally, I would like to exp ress my unreserved gratitude to everyone who has contributed directly or indirectly towards the successful completion of this m names are not mentioned.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 LIST OF FIGURES ................................ ................................ ................................ .......... 8 ABSTRACT ................................ ................................ ................................ ..................... 9 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 11 2 LITERATURE REVIEW ................................ ................................ .......................... 14 Item Parame ter Drift ................................ ................................ ................................ 14 Mean Mean Method of equating ................................ ................................ ............. 17 High stakes testing and Scale stability ................................ ................................ .... 18 Standard Erro rs of Equating ................................ ................................ ................... 21 Item Response Theory ................................ ................................ ............................ 23 Anchor Test Length and Scale Stability ................................ ................................ .. 24 Chained Equating/Linking ................................ ................................ ....................... 25 3 METHODOLOGY ................................ ................................ ................................ ... 26 Study 1 ................................ ................................ ................................ .................... 26 Data Generation ................................ ................................ ............................... 26 Data Analysis ................................ ................................ ................................ ... 28 Study 2 ................................ ................................ ................................ .................... 29 Data Generation ................................ ................................ ............................... 29 Data Analys is ................................ ................................ ................................ ... 29 Evaluation Criteria ................................ ................................ ................................ .. 30 4 RESULTS ................................ ................................ ................................ ............... 32 Study 1 ................................ ................................ ................................ .................... 32 Study 2 ................................ ................................ ................................ .................... 33 5 DISCUSSION AND CONCLUSION ................................ ................................ ........ 36 Conclusion ................................ ................................ ................................ .............. 37 Limitation and Recommendation for Future Research ................................ ............ 38

PAGE 6

6 APPENDIX OTHER FIGURES ................................ ................................ ................................ ......... 39 LIST OF REFERENCES ................................ ................................ ............................... 40 BIOGRAPHICAL SKETCH ................................ ................................ ............................ 44

PAGE 7

7 LIST OF TABLES Table page 3 1 Item Parameters for the Anchor Items ................................ ................................ 31 3 2 Distribution of Items according to Condition for Study 1 ................................ ..... 31 3 3 Distribution of Items according to Condition for Study 2 ................................ ..... 31 4 1 Equating Coefficients and Corresponding Standard Errors ................................ 34 4 2 ANOVA Results for Anchor Test Length and Length of Equating links ............... 34

PAGE 8

8 LIST OF FIGURES Figure page 4 1 Depiction of Chain Equating Link 125 ................................ ................................ 35 4 2 Depiction of Chain Equating Link 1265 ................................ ............................... 35 A 1 Graphical depiction of the Non Equivalent Test Design used in this Study ........ 39

PAGE 9

9 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Arts in Education SCALE STABILITY: OPTIMAL CHOICES FOR ANCHOR TEST LENGTH AND EQUATING LINKS IN A N ON EQUIVALENT A N C HOR T EST DESIGN By Chris tiana Aikenosi Akande May 2015 Chair: David Miller Major: Research and Evaluation Methodology This study explores optimal choices for the number of anchor items and equating links that yield satisfactorily stable equating results in a Non Equivalent Anchor Test (NEAT) design. Despite diverse research in this area, little is known about the influenc e of standard errors of equating on scale stability. The objective of this study is to investigate if standard errors associated with equating tests with (a) longer anchor items and (b) longer equating links would be significantly different from those havi ng shorter anchor items and equating links respectively. A bank of 30 anchor items was simulated to mimic the difficulty and discrimination parameters on the 2006/2007 Canadian Provincial Mathematics Assessment which was used to show the equating relations hips between two forms. Person parameters were simulated from a normal distribution with varying mean ability levels to depict an operational testing situation. Dichotomously scored item responses were generated using the two parameter logistic (2PL) funct ion. The results indicate that, for conditions without IPD, the standard errors of equating were significantly different for test forms with longer anchor as compared to those with shorter anchor items. On the other hand, the standard errors of the chained equating

PAGE 10

10 coefficients were virtually the same when two links and three links, respectively, were used.

PAGE 11

11 CHAPTER 1 I NTRODUC TION In this era of standardized testing, it is of utmost importance to ensure fairness in the meaning and interpretations attached to test scores, such that these scores are comparable from one administration to another. Despite conscientious efforts by test developers to ensure equivalency of test forms, it seems practically a Herculean task to achieve. Therefore, linking scales and equating scores is necessary for assessing the changes in student performance over time and help to ensure score comparability across years, administrations and test forms. Score comparability, as noted above, can be achieved through anchor item linking designs. Anchor item linking designs are used for non equivalent groups such that a test form is equated to another by utilizing certain items that are common to both which must be representative of the entire test. These linkin g designs mostly adopt Item Response Theory ( IRT ) methods in which item parameters are estimated and are assumed to be invariant under linear transformation (Lord, 1980). This invariant property of IRT has the assumption that the parameters estimated from different samples of the same population are the same up to a linear transformation. In practice, however, this assumption does not hold true in all cases. When an item performs differently for examinees of similar ability, it is defined as differential it em functioning (DIF), (Holland & Wainer, 1993). Whereas, changes in the statistical properties (difficulty and discrimination) of a test in different testing occasions is referred to as Item parameter drift (IPD) (Goldstein, 1983; Bock, Muraki & Pfeiffenbe rger, 1988). estimates and item parameter estimates

PAGE 12

12 groups or over time, when the parameters would remain unchanged if they were o n the same scale (Chen, 2013). Nevertheless, there could be some changes in item parameter as a result of changes in curriculum, frequent exposure of items, content of items and other reasons. Such changes can threaten the validity of test scores by introd ucing trait estimates, in which case scores may not even be comparable. Therefore, it is of great importance and necessity to monitor scale stability over time. Also, in order to maintain a stable scale over time to ensure scor e comparability, the number of anchor items on a test as well as the number of equating links to be utilized have a role to play. Test developers need to know what number of anchor items on a test and/or number of equating links that produce satisfactorily stable equating results under different conditions. Although, research on DIF, IPD and scale stability has been on the increase, the estimates has produced diverse results. Some of these research studies found that IPD c ould compound over multiple testing occasions estimates (Wollack, Sung & Kang, 2005; 2006). In addition, previous research studies on anchor test length (Ricker and von Davier, 2007) o nly considered 1 scenario; when IPD is not present and therefore conclude that longer anchor produces better equating results. Also, existing research on chain linking (Battauz M, 2011) compared various methods of conducting chained equating and found that the results from equating are method dependent. The simulation study and operational example used in his study produced contrasting results due to the equating designs utilized. Despite diverse research on the subject matter, nothing is known about the in fluence of anchor test length in the

PAGE 13

13 presence of IPD on the magnitude of standard errors of equating. Also, little is known about the impact of length of equating links on the magnitude of standard errors of equating, which in turn affects scale stabilit y. The objective of this study is to assess optimal choices for anchor test length in two scenarios when IPD is present and when IPD is not present. The study also seeks to assess the number of equating links that produce satisfactorily stable equating re sults using a 2PL model. Specifically, research questions were put forward : 1. What are optimal choices for anchor test length that produce satisfactorily stable equating results when IPD is present and when IPD is not present? (Study 1) 2. What are optimal cho ices for the number of equating links that yield satisfactorily stable equating results when IPD is not present? (Study 2) To answer the research questions, 2 simulation studies were conducted to examine the relationship between anchor test lengths in two scenarios as stated in the first research question and standard errors of equating of equating as well as the influence of the number of equating links on standard errors of equating. Data are generated through 2 simulation studies using R. This research will be divided as follows; C hapter 1 is the introduction, C hapter 2 is the liter ature review while C hapter 3 is the methodology. Chapter 4 explains t he resu lts of the analysis and C hapter 5 is the summary and conclusion.

PAGE 14

14 CHAPTER 2 LITERATURE REVIEW Item Parameter Drift Item parameter drift is a phenomenon where changes in the statistical properties (difficulty and discrimination) of a test in different testing occasions are different. (Goldstein, 1983; Bock, Muraki & Pfeiffenberger, 1988). This make test scores incomparable. Due to the growing concern about the impact of IPD on the meanings attached to reported test scores, there have been several studies on IPD and scale stability. For example, Miller and Fitzpatrick (2008) conducte Equating Error Resulting from Incorrect Handling of Item Parameter Drift among the handling of IPD is mainly attributable to the magnitude of item b pa rameter drift and the proportion of items with b parameter drift. Also, Ye and Xin (2013) conducted a study on hievement estimates got worse conditional upon the pair of test levels between which IPD occurred, and the effect of grade to grade growth and effect size were distorted for the grade pair corresponding to the test pair that involved IPD. In addition, Tate (2003) proposed an IRT based equating method for the long term scale maintenance of a mixed format test consisting of constructed response items and multiple choice items; a modification of the traditional common item nonequivalent group design. Rather th an the collection of equated item parameters for each year as required by the traditional approaches, the author proposed that the collection should consist of actual item responses from a representative sample of examinees for all

PAGE 15

15 items in the test for ea ch year, that would include the original 1, 0 coding for the multiple choice items and the original examinee responses to the constructed response items. The author provided an illustration of equating over five years based on simulated data and applicatio n of the proposed equating method resulted in estimated trends of equating coefficients and average student ability that followed the assumed true values. However, the author recommended that optimal choices for the number of common items to be included in the test, among other things, should be the focus for further research. Furthermore, Guo et al. (2011) used data from 44 SAT administrations to investigate the effect of accumulated equating error in equating conversions and the effect of the use of multi ple links in equating. The authors used the data to produce multiple and single link equating results and their findings indicate that the single link equating results showed increased variability in the applied conversions, and they drifted away from the multiple link equating results over time. Meanwhile, according to the authors, the multiple link conversions were more stable and had less variation across years of administration. They, however, recommended that for further research, the number of equatin g links that are enough to produce a satisfactorily stable equating result should be investigated. Equating and Scale Stability Equating is the process of transforming scores from one scale to a different scale while maintaining the comparability of scores across those scales to a degree that the scores can be treated as interchangeable (Dorans & Holland, 2000; Holland, 2007; Kolen & Brennan, 2004). For example, large scale testing companies must create numerous forms of a particular assessment to ensure te st security and an equating

PAGE 16

16 process is used to put scores from each of the different forms onto the same scale. Because the forms to which an examinee is exposed should not perform differently from others, interchangeability of scores across the forms is r equired in order to prevent bias that emanates from a group receiving a relatively easier form. In order to produce interchangeable scores from an equating process, Dorans & Holland, (2000) proposed five conditions that must be met. The conditions are that : (a) the tests or forms being equated must measure the same construct, (b) the tests or forms being equated must have equal reliability, (c) the equating function used to transform scores from form 1 to form 0 must be the inverse of the equating function used to transform scores from form 0 to form 1, (d) the examinees must have the same expected score and expected distribution of performance on both of the scales being equated, and (e) the equating function must be invariant to subpopulations of test take rs. It is always possible to use a function to place scores from one scale onto another, but having that process result in scores that are interchangeable across the two scales depends on how well the test construction, equating design, and equating method meet the five requirements mentioned above. In the case of large scale standardized tests used for high stakes purposes, it is very important that equating of scores be confirmed to meet the five requirements and thus produce interchangeable test score. M eanwhile, even when these five requirements are met, different equating methods produce different magnitude of standard errors of equating. Morrison and Fitzpatrick (1992) prescribed evaluating equating and scale stability could be achieved by comparing re sult of direct and indirect equating function. While the direct equating function was derived from equating new forms directly back to base

PAGE 17

17 forms, indirect equating function is obtained from equating new form back to base form through an intermediate chann el. The differences between the two functions indicate whether equated scores are normal or scale drift actually exist that make the reported scores unreliable and unstable Mean Mean Method of equating The mean mean method of transforming scores from one f orm to another, utilized in this study, is relatively straightforward. It utilizes the means of the a parameter estimates of the anchor items of both forms to estimate the A constant and the b parameters of the anchor items of both forms to be equated to esti mate the B constant as seen in E quations 2 1 and 2 2 below: .. ( 2 1 ) ( 2 2 ) W here A and B are constants (otherwise called equating coefficients), and are means of the a paramet ers of forms I and J respectively; and are the means of the b parameters of the form J and form I respectively. These constants are, in turn, substituted into the following functions to scale the ability estimates and item parameters from one form to another. ( 2 3 ) Where A and B are constants in the linear equation and and are values for for individuals i on scale J and I. Also, the item parameters on both scales are related as follows: ........................................... ........................... ................... ............. .. ( 2 4 ) .. ................ ................ ( 2 5 )

PAGE 18

18 Where and are the item parameters for item j on scale J, while and are the item parameters for item j on scale I. High stakes testing and Scale stability The advent of standardized and high stakes testing has brought a huge concern, not only about the comparability of test scores but also about the reliability and validity of test scores (Haladyna & Downing, 2005; Brennan, 2001). The focus of school curriculums have shifted from learning to high performance in high stakes tests. Considering the effects that high stakes testing has on the test development process, the question of reliability comes into play. F or this reason, it is important to discuss these issues in some detail in order to move forward on the subject matter. Concerning the quality of learning in American public school system, high stakes have been attached to tests. This is enacted by policy m aker with the hope that setting high standard of achievements will propel much effort on the part of the students, teachers and educational administrators. But a concern that emanates from high stakes testing is the question of reliability and validity of these test scores. Proponents of high stakes testing argue that high stakes testing has improved student achievement and consequently streamlined achievement gap by race and income, but unfortunately the low income earners, Latinos and Blacks, according to McNeil, (2000), have been negatively affected. Some other prominent authors worthy of mention are Orfield and students, the authors discovered that twice as many Black and Latino students were found to have failed this high stakes test as compared to their White counterparts, the authors concluded that

PAGE 19

19 high stakes standardized tests will continue to hinder the ab ility of African American and Latino students to graduate from high school. For instance, since the introduction of high increased in the proportion of middle grades dropouts and these are predominantly African American and Latinos. If the implementation of high stakes testing programs is in circumstances where educational resources are inadequate or where tests lack sufficient reliability and validity for their intended purposes there is potential for serious harm (AERA, 2000). In fact, policy makers and the public could be misled by the bogus test score increases which are completely unrelated to basic educational improvements. Consequently, students are exposed to increased r isk of failure which may eventually lead them to dropping out of school. In addition, teachers may be blame, and/or even punished, for inequitable resources which are even be beyond their control. According to French, (2003), the number of test errors by t he major testing companies are too many to detail, with many having catastrophic effects on the students. It has also been confirmed that high stakes test scores could also be used as a measure of evaluating teachers in a way that there is 25 percent chanc tossed driven performance. In classical test theory, reliability is closely linked to stability. This can also be applied to IRT such that reporte d scores for individuals and schools are expected to be sufficiently precise so as to support the meanings and interpretations attached to them. Reliability refers to the accuracy or precision of test scores (AERA, 2000, Haladyna & Downing, 2005; Brennan, 2001, Green & Yang, 2009). More broadly, reliability refers to

PAGE 20

20 the proportion of variance in observed scores that is due to variance in true scores. According to AERA (2000) a test score is reliable if; Scores reported for individuals or for schools suffic iently and accurately support each intended interpretation of student achievement. Since tests validity is relative, separate assessments are needed for each purpose of the test. Accuracy is maintained and adequately examined for scores that was actually u sed. This way, information regarding the reliability of raw scores may not adequately describe the accuracy of percentiles and information about the reliability of school means may be insufficient if scores for subgroups are also used in reaching decisions about schools. A testing program should establish specific scores to determine passing or proficient achievement. The program must be able to demonstrate the validity of scores as a true representative of student attainment. Because of the high stakes attached to test scores today and the standardization of tests in American public schools, several equating methods have been proposed to equate tests between different forms and/or administrations. Some of these methods employ t he Non Equivalent Anchor Test (NEAT) design as discussed above and utilize anchor items to depict equating relationships. Meanwhile, this process of equating test scores is never free from errors in practical terms. In fact, the magnitude of errors associa ted with this process vary across different methods/designs and can be contained if certain methods are used. Unfortunately, standard errors of equating are rarely reported by test developers and it has become a major concern of test developers and test us ers to determine the extent to which standard error of equating influence test performance. However, the errors could make scale unstable. In fact, when the standard errors of equating are significantly high, they mar the reliability of such test scores, t hereby, making the scaled scores less stable such that it will be recommended that test developers pick the anchor items with desirable attributes, such as stability

PAGE 21

21 throughout the period of item exposure (Yu & Osborn, 2005). In the same vein, the more rel iable a test score is, the more stable the scale becomes. Standard Errors of Equating Standard errors of equating are useful in estimating the amount of error in test form equating that results from sampling. It can as well be defined as the standard devia tion of equating errors over theoretical repeated samples. Standard errors of equating are relevant in estimating equating error when scores are reported and as a basis of determining equating methods and designs. It is not unusual for equating functions t o have some statistical errors, since all equating functions are defined by statistical estimates of parameters, but these errors should be thoroughly examined as a major part of equating so that test developers know when the errors becomes negligible or r elevant. As discussed by Kolen and Brennan (2004, pp. 231 265), two methods of estimating standard errors of equating examined for various equating designs and methods. This includes the bootstrap and analytical methods. The bootstrap method can be applie d to any equating designs and methods while the analytical method provides standard errors of estimation procedures based on the method. For the purpose of this study, the method approach is applicable. The method is useful in deriving approximate standard error of a statistics function for which an expression of standard error exists. Kolen and Brennan (2004, pp.245), derive method of standard errors of equating, using Taylor series expansion, from defining populations eqY as an equating fun ction of test scores and moment parameters The approximation for the expression of the sampling variance for the method becomes

PAGE 22

22 ( 2 6 ) Where is a sample is estimate of and is the partial derivative of with respect to and evaluated at The standard e rror is the square root of the E quation 2 1 which is equivalent to; 2 7) The error indexed in E quation s 2 1 and 2 2 above is generated through a random error that results from the sampling of test takers to estimate the population function = The equation requires that expression for the sample variances and sampling covariance are made available. The estimated standard errors can be considered for specific data collection designs. In a random group d esign, for instance, a single population of test takers is considered such that a random sample of size is drawn from the population and administered Form X and another sample is drawn from size from test administration Form Y and equating is thu s, conducted or performed on this data. The process is repeated at a number of times and variability at each score point is tabulated to obtain standard errors for each design. Certain analytic for estimating standard errors of equating have also been made available for IRT equating methods. For instance, Lord (1982) and Ogasawara (2010) developed a delta method for deriving standard errors of equating and Ogasawara (2003) developed a delta method for deriving standard errors of IRT observed score equating. Battauz (2011), on his part, derived the asymptotic standard errors of IRT

PAGE 23

23 equating coefficients when several forms are equated and the linkage plan involves chains and double or multiple linking. A hands on application of this method is available in the EquateIRT Package (Battauz, 2014) which was utilized in this study. Item Response Theory It is important to have a brief overview of Item Response Theory (IRT) because the methods used in this study are based on it. IRT is a model based, item level, latent trait measurement theory (Lord, 1980), in which item characteristic curves are used to It is also a modern based theory that provides a test of item equivalence ac ross groups. The theory can test whether an item is behaving differently across groups or class. Test characteristic curves can be estimated from the item characteristic curves of all items on a test and they model expected true scores on a test given a pe There are different IRT based equating methods and designs among which is NEAT design. This design utilizes anchor items (Items common to both test and forms) which aim to show equating relationships between two tests. The thre e parameter logistic model, according to Birnbaum (1968) is given by, .. 2 8) Although the three parameter logistic model is so popular and commonly used but because of the inherent convergence problem of this model and for the purpose of this research, work the model would be adjusted further to two parameter logistic model. The ad justed model is obtain by setting the C parameter of three parameter

PAGE 24

24 logistic model to zero therefore, the probability of correct response on item j in form g for a person with ability becomes; 2 9) The parameters included in E quation 2 9 above i ncludes discriminating and item difficulty parameter therefore, agj is the item discriminating parameter, bj is the item difficulty parameter and D is a constant that is set at 1.7. Person parameter is a random variable with standard normal distributio n of mean zero and variance 1 and the item parameters are estimated using the marginal maximum likelihood method (Bock and Aitkin, 1981). The parameter vector of form j is defined as where Anchor Tes t Length and Scale Stability Anchor tests refer to the number of items that are common to two or more tests/test forms which are used to establish equating relationships between non equivalent groups of examinees. Research has shown that the length of anch or tests have a role to play in the stability of equating results. For example, Ricker and von Davier (2007) conducted a study on the impact of anchor test length on equating results and found that longer anchor tests produce more stable equating results. Yang and Youang (1996) also conducted a similar study, using different equating methods and found that irrespective of the methods used, longer anchor tests produced more stable equating results. Furthermore, Fidzpatrick (2008) conducted a study on the imp act of

PAGE 25

25 more than 15 anchor items) produce more stable equating results. Therefore, it is clear that, in order to maintain scale stability, longer anchor tests are preferr ed. Chained Equating/Linking Chained equating is used to show indirect equating relationships between two test forms. For example, the equating relationship between forms X and Y could be determined through different links such that form X is equated to fo rm X1; form X1 to form X2 and form X2 to form Y. This could be could be achieved used equipercentile methods or linear methods. Previous studies (Harris & Kolen, 1990; Livingston, Dorans, & Wright, 1990; Puhan, 2010; Sinharay & Holland, 2007; Wang, Lee, Br ennan, & Kolen, 2008) have compared chained equating to post stratification methods of equating and found that chained equating produce more stable equating results. However, some studies on chained equating (Battauz, 2013) indicate that longer equating ch ains produce less stable equating results. Other studies comparing linear chained equating methods to others have produced different results. The function for calculating linear chained equating which has been programmed into the equatIRT Package that was utilized in this study is presented below: ... 2 10) and .. 2 11) Where is the coefficient that links form g to form l which is referred to as indirect equating coefficients.

PAGE 26

26 CHAPTER 3 M ETHODOLOGY The 2PL model that was utilized in generating item responses is the adjusted 3PL model captured in E quation 2 9, which is as follows: R package, equateIRT, automatically calculates the standard errors and the 2PL model was utilized in generating the item responses. Study 1 Data Generation A simulation study was conducted to assess optimal choices for anchor test length that produce satisfactorily stable equa ting results. All aspects of data generation and analysis were done in the R environment (R Development Core Team, 2014). Data was generated for two testing administrations such that form1 and form2 were administered one year apart. The test forms were cre ated using the Non Equivalent Anchor Test design, each having a total of 40 items. Of these 40 items are varying numbers of unique and common (anchor) items according to condition. The number of anchor items were manipul ated under four conditions ( T abl e 3 2 ). In condition 1, both forms have eight items in common, with no item showing any form of IPD; condition two reflects a situation where both forms have twelve items in common, with none of these items showing IPD; condition three depicts a situation w here both forms have eight items in common with 10% (i.e. 4 items) showing a unidirectional IPD of .2 on the b

PAGE 27

27 parameters (.2 being the difference between form 1 and form 2 b parameters); and condition four depicts a situation where both forms have twelve items in common with 10% (i.e. 4 items) showing a unidirectional IPD of .2 on the b parameters The proportion of items (i.e, 10%) that was selected to drift was chosen to represent a realistic scenario with respect to the typical maximum number of items o n a large scale test that are detected as drifting (Kirkpatrick & Meng, 2014). Only anchor items were manipulated to have IPD and all anchor items used in this study were internal; meaning ranging from 0 40. Also, the magnitude of .2 IPD is to illustrate a situation of moderate IPD (Chen 2013). A bank of 30 anchor items was simulated to mimic the difficulty and discrimination parameters on the 2006/2007 Canadian provincial mathematics asse ssment and the guessing parameters wer e set equal to zero (T able 3 1 ). These parameters were used in a previous study on IPD (Chen 2013). Person parameters were generated to differ in each form according to condition with means of .35, .30; .25, .25; .35, .30 and .25, .25 respectively and standard deviations of 1 for each form. The differences in the mean ability levels (person parameters) depicts an operational test situation where examinees who take a test at every point in time are of different abi lity levels and the anchor items were generated to provide a means for estimating the equating relationship between 2 or more forms. In order to align item difficulty with true normal distribution that was set equal to the mean of person parameters in each form. Although, this may not be the case in operational testing, it is done to distinguish the parameters

PAGE 28

28 were generated from a uniform distribution, using the same discrimination parameters as those of the anchor items. All anchor items maintained the same position in all test forms (that is, the first 8 or first 12 items, as the case may be). In this study, item responses were simulated for 1000 examinees on each test form by u sing the 2PL function giv en by E quation 2 4 and 100 trials were deemed adequate to obtain relatively stable results. Specific details of the make up of each tes t form are pres ent ed on T able 3 2 and the equating d esign is presented in F igures 1 and 2 Data Analysis The generated data was analyzed in the R Language Environment. Generated data from all test forms were read in and named accordingly. All item calibrations for each f orm were done separately through the marginal maximum likelihood approach using the ltm package (Rizopoulos, 2006). Variance covariance matrices of the item parameter estimates were also generated using the vcov function in ltm. All equating coefficients w ere computed using the EquateIRT Package (Battauz, 2014). Study 1 seeks to investigate optimal choices for the number of common items to be included in a test in order to obtain satisfactorily stable equating result under 2 conditions; (i) when none of the linking items showed any form of IPD and (ii) when 10% of the entire test (that is, 4 items) was showing a magnitude of IPD of .2 (.2 being the difference between the b parameters of both forms to be equated). To achieve this, this process was decomposed into four parts ; (a) equating form 1 to form 2 that have 20% (that is, 8) anchor items, with none of the items showing IPD, (b) equating form 3 to form 4 that have 30% (that is, 12 items) anchor items embedded in them and none of these items showing any f orm of IPD, (C) equating form 1 to form 7 that have 20% (that is, 8 items) anchor items embedded in them out of that anchor items showed IPD of

PAGE 29

29 magnitude .2, (d) equating form 3 to form 9 that have 30% (that is, 12 items) anchor items embedded in them and out of that anchor items showed IPD of magnitude .2. The equating coefficients were estimated by using the mean mean method. This method was selected because it is more stable than the mean sigma method (Baker & Al Karni, 1 991). Results are discussed in c hapter 4. Study 2 Data Generation Study 2 seeks to investigate the number of equating links that yield satisfactorily stable equating results. The data generation for this study is similar to that of study 1. However, only four test forms (form1, form2, form3 and form4) were generated for this study and all forms have eig ht items in common ( T able 3 3 ). Also, all anchor items were taken from the anchor bank and they maintain the same position in all forms with no item showing IPD. Furthermore, item res ponses were simulated for 1000 examinees on each test form by u sing the 2PL function given by E quation 2 4 and 100 trials were deemed adequate to obtain relatively stable results. Data Analysis Similar to study 1, the generated data was analyzed in the R Lan guage Environment. Generated data from all test forms were read in and named accordingly. All item calibrations for each form were done separately through the marginal maximum likelihood approach using the ltm package (Rizopoulos, 2006). Variance covarianc e matrices of the item parameter estimates were also generated using the vcov function in ltm. All equating coefficients were computed using the EquateIRT Package (Battauz, 2014).

PAGE 30

30 To assess the optimal choice for the number of links that produce satisfacto rily stable equating results, chain equating was conducted using the chainec function in the EquateIRT package which performs chain equating and produce indirect equating coefficients with their corresponding standard errors. Forms 1, 2, 3 and 4 were used to conduct the chain linking, using the mean mean method under. The equating process was conducted under 2 conditions; (a) form1 was linked to form 3 through the path 1, 2, 3; and (b) form1 was linked to 3 through the path 1, 2, 4, 3. As mentioned earlier, these forms all have eight items in common, with none of these items showing IPD. Other than the number of links, there is no difference between these two conditions. Figures 4 1 and 4 2 depict the chains considered in this study. Evaluation Criteria The stability of the equating results is evaluated by comparing the standard errors of equating produced in different conditions; smaller standard errors are considered better. However, differences in equating results that yield significantly little or no effect on the equating results are considered satisfactory. Therefore, an Analysis of Variance (ANOVA) is run using the Statistical Package for Social Sciences (SPSS) to assess the statistical significance of the mean differences in the equating errors ac ross the different conditions as well as the corresponding effect s izes. Results are presented in C hapter 4.

PAGE 31

31 Table 3 1. Item Parameters for the Anchor Items Parameter N Minimum Maximum Mean Std Deviation A 30 0.2621 1.1495 0.5624 0.1885 B 30 2.1495 1.7984 0.1199 1.1225 C 30 0.0000 0.0000 0.0000 0.0000 Source: Chen 2013 Table 3 2 Distribution of Items according to Condition for Study 1 Condition Form Number of Items Number of Anchor Items Number of Unique Items IPD Frequency of IPD Magnitude of IPD Person Parameters 1 1 40 8 32 No NA NA .35 2 40 8 32 No NA NA .30 2 1 40 12 28 No NA NA .25 2 40 12 28 No NA NA .25 3 1 40 8 32 No NA NA .35 2 40 8 32 Yes 4 .2 .30 4 1 40 12 28 No NA NA .25 2 40 12 28 Yes 4 .2 .25 Note: IPD = Item Parameter Drift Table 3 3 Distribution of Items according to Condition for Study 2 Form Number of Items Number of Anchor Items Number of Unique Items IPD Frequency of IPD Magnitude of IPD Person Parameters 1 40 8 32 No NA NA .35 2 40 8 32 No NA NA .30 3 40 8 32 No NA NA .30 4 40 8 32 No NA NA .35 Note: IPD = Item Parameter Drift

PAGE 32

32 CHAPTER 4 RESULTS Study 1 Equating coefficients and their corresponding standard errors were estimated using the equateIRT package The results are displayed on T able 4 1 For conditions without IPD, the results indicate that test forms with shorter anchor items produced larger standard error of equating (0.036) for the A constants when compared with those of test forms with lon ger anchor items (0.035). Meanwhile, the reverse was the case with the standard errors of equating associated with the B constant of the aforementioned test forms; ( 0.062) for shorter anchor test and (0.063) for longer anchor tests. Furthermore, an ANOVA was run to determine the significance of the mean differences between the standard errors associated with the shorter anchor test and the longer anchor test. The ANOVA results indicate that the mean difference between the A constant standard errors of equa ting of both tests (0.00065) is statistically significant (p=0.024) with an effect size of zero. On the other hand, the mean difference between the B constant standard errors of equating of both tests ( 0.00183) is not statistically significant (p=0.269) w ith an effect size of zero. For forms with IPD, on the other hand, forms with shorter anchor produced a higher standard error of equating (0.0363) for the A constant than test forms with longer anchor (0.0357). In the same vein, the standard errors of equating associated with the B c onstant of the shorter anchor test (0.0650) was higher than that of the longer anchor test (0.0639). The mean difference between the A constant standard errors of equating of both tests (0.00060) is not statistically significant (p=0.05) with an effect siz e of zero. On the other hand, the mean difference between the B constant standard errors of

PAGE 33

33 equating of both tests (0.00102) is not statistically significant (p=0.838) with an effect size of zero The results are displayed on T able 4 2 Study 2 The chained e quating coefficients were estimated using the equateIRT package. The mean mean method was utilized for estimating equating coefficients while the delta method was used to estimate the corresponding standard errors. The results indicate that the estimated e quating coefficients and the corresponding standard errors for both the two links (path125) and three links (path1265) used in this study were not significantly different from each other. In fact, the two links and three links produced exact same estimat es and corresponding standard errors (A=1.0038, B= 0.3252, S.E ( A )=0.0376, S.E.( B ) =0.0668) The results are displayed in T able 4 1

PAGE 34

34 Table 4 1. Equating Coefficients and Corresponding Standard Errors Link Study Mean of A Mean of B Mean of S.e(A) Mean of S.e(B) 12 1 0.9611 0.1772 .0359 .0618 34 1 0.9668 0.2024 .0353 .0637 17 1 0.9632 0.0392 .0363 .0650 39 1 0.9639 0.0205 .0357 .0639 125 2 1.0039 0.3252 .0376 .0668 1265 2 1.0039 0.3252 .0376 .0668 Table 4 2 ANOVA Results for Anchor Test Length and Length of Equating links Condition Study IPD Magnitude Mean Difference S.E (A) Mean Difference S.E (B) 8 vs 12 anchor 1 0.0000 0.00065* 0.00183 0.0000 8 vs 12 anchor 1 0.2000 0.00060 0.00102 0.0000 2 vs 3 Links 2 0.0000 0.0000 0.0000 0.0000 Note: *The mean difference is significant at 0.05 level

PAGE 35

35 8 8 8 Figure 4 1. Depiction of Chain Equating Link 125 8 8 8 8 8 8 8 Figure 4 2. Depiction of Chain Equating Link 1265 2

PAGE 36

36 CHAPTER 5 DISCUSSION AND CONCLUSION This study investigated optimal choices for the number of anchor items and equating links that produce satisfactorily stable equating results in a NEAT design. A simulation was conducted using the R software to investigate the impact of anchor test length with and w i t hout drift as well as the impact of number of equating links on standard error of equating. A bank of 30 anchor items was simulated to mimic the difficulty and discrimination parameters on the 2006/2007 Canadian provincial mathematics assessment and thi s set of items were used to show the equating relationships between two forms or more forms. Ten test forms were generated with varying number of anchor items according to the condition. The length of anchor items was manipulated in two scenarios; when IPD is present and when it is not. Each scenario had two pairs of test forms; one pair (forms1_2 and Form3_4) having 8 and 12 anchor items respectively. The other pair (form1_7 and form3_9) also having 8 and 12 anchor items respectively. Equating coefficients were estimated using the mean mean method and the corresponding standard errors were estimated using the delta method embedded in equateIRT Package. The standard errors of equating were used to assess the stability of the results; equating relationships w ith the least standard errors is considered a better choice simply because less error would produce a narrower confidence interval and more stable equating results. The results indicate that for forms without IPD, test forms with shorter anchor produced la rger A constant standard errors than test forms with longer anchor. In addition, the standard errors of equating associated with the B constant of the shorter anchor test are smaller than those of the longer anchor test. The mean difference of the

PAGE 37

37 standard errors of equating associated with the A constant in this condition are found to be statistically significant, while the mean difference associated with the B constant are not. However, these mean differences do not have a significant effect on the stabil ity equating results. On the other hand, the result from the condition showing IPD indicate that the shorter anchor test produced higher standard errors for both the A and B constants. However, the difference between the means of the standard errors for b oth A and B constants are not statistically significant and have no significant effect on the estimates. Furthermore, results from the assessing the stability of results of different equating links indicate that when two and three links are considered, wit h all test forms having equal number of anchor items, the estimated equating coefficients and their corresponding standard errors are the same. This means that choosing between two and three equating links as used in this study will not significantly affec t the stability of equating results. Conclusion Study one reveals that longer anchor tests produce more stable equating results with and without IPD, since they yielded smaller standard errors of equating. This confirms the results of Ricker & von Davier ( 2007) and other studies on anchor test length, despite that a different method was utilized. On the other hand, the results from study two reveal that both two and three links produce satisfactorily stable results. This result actually contrasts Battauz (2 013) in that the number of chain did not affect the equating results in this study, whereas they did in Battauz (2013). This could be explained by the relationship of the forms being equated. In that study, the forms being equated have no items in common, while in this study, they have eight items in common.

PAGE 38

38 Recent studies on Equating focus on maintaining scale stability and the process of equating have two major sources or errors; random error and systematic error. Systematic errors, in turn, cause scale d rift which makes scores incomparable across test forms, time and administrations. Despite this concern for scale stability, test developers are also concerned about test security which they consistently make conscious efforts to protect. Having said that, there should be a balance as regards maintenance of scale stability and test security. The implication of these results depend on how critical the decisions made based test scores are. For example, if the lest is low stakes, more equating errors (as well as measurement error) may be considered acceptable, whereas, the reverse would be the case for high stakes tests. Limitation and Recommendation for Future Research Typical of every research study, this study is limited in that IPD was only considered when assessing the impact of anchor test length on equating results. Therefore, impact of IPD on chained equating results should be considered for future research.

PAGE 39

39 APPENDIX OTHER FIGURES Figure A 1 Graphical depiction of the Non Equivalent Test Design used in this Study Form 1 Form 2 Anchor items used in estimating Equating functions Figure A 2 Graphical depiction of the Non Equivalent Test Design used in this Study Form 3 Form 4 Anchor items used in estimating Equating functions Item 1 Item 8 Item 9a Item 10a Item 11a . Item 40a Item 1 Item 8 Item 9b Item 10b Item 11b . Item 40b Item 1 Item 12 Item 13a Item 14a Item 15a . Item 40a Item 1 Item 12 Item 13b Item 14b Item 15b . Item 40b

PAGE 40

40 LIST OF REFER E NCES AERA, (2000): AERA Position Statement on High Stakes Testing in Pre K 12 Education. Rereived from: http://w ww.aera.net/AboutAERA/AERARulesPolicies/AERAPolicyStatements/Pos itionStatementonHighStakesTesting/tabid/11083/Default.aspx Baker, F. B., & Al Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational M easurement, 28, 147 162. Battauz M., (2011): IRT Methods for Chain and Multiple Equating. Retrieved from www.dies.uniud. it /wpdies.en.html?download=74.File Battauz, M. (2013). IRT Test Equating in Complex Linkage Plans. Psychometrika, 78, 464 480. ls and Their Use in Inferring an Statistical Theories of Mental Test Scores (pp. 397 472), Reading, MA: Addison Wesley. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of it em parameters: Application of an EM algorithm. Psychometrika, 46, 443 449 Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25(4), 275 285. Brennan, R. L. (2001). Some problems, pitfalls, and paradoxes in educational measurement. Educational Measurement: Issues and Practice 20 (4), 6 18. Chen, Q. (2013). Remove or keep: Linking items showing item parameter drift. (3560290, Michigan State University). ProQ uest Dissertations and Theses, 225. Retrieved from http://search.proquest.com/docview/1355760693?accountid=10920. (1355760693). Dorans, N.J., & Holland, P.W. (2000). Population invariance and the equitability of tests: Basic theory and the linear case. Jou rnal of Educational Measurement, 37, 281 306. In Huggins 2012. French, D. (2003). A new vision of authentic assessment to overcome the flaws in high stakestesting. Middle School Journal, 5(1), 2 13. Fitzpatrick, A. R. (2008). NCME 2008 presidential address : The impact of anchor test configuration on student proficiency rates. Educational Measurement: Issues and Practice, 27(4), 34 40.

PAGE 41

41 Goldstein, H. (1983). Measuring changes in educational attainment over time: problems and possibilities. Journal of Educatio nal Measurement, 20(4), 369 377. Green, S.A., & Yang, Y. (2009). Reliability of summed item scores using structural equation modeling: an alternativeto coefficient alpha. Psychometrika. doi:10.1007/s11336 008 9099 3. Gulliksen, H. (1950). Theory of Mental Tests. Hillsdale, NJ: Lawrence Erlbaum. In French D. (2003). Guo H., Lui J., Dorans N., and Feigenbaum M. (2011). Multiple Linking in Equating and Random Scale Drift: Research Repot, ETS rr 11 46. Haladyna, T. M., & Downing, S. M. (2004). Construct irrelev ant variance in high stakes testing. Educational Measurement : Issues and Practice 23 (1), 17 27. Haney, W. (1999). Study of Texas education agency statistics on cohorts of Texas high school students, 1978 1998. Unpublished manuscript. Boston College, Center for the Study of Testing, Evaluation, and Educational Policy. Harris, D. J., & Kolen, M. J. (1990). A comparison of two equipercentile equating methods for common item equating. Educational and Psychological Measurement, 50, 61 71. Holland, P.W., & Wainer, H. (1993). Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates. Kolen, M. J. & Brennan, R. L. (2004). Test Equating, Scaling, and Linking: Methods and Practices (Second Edition). New York: Springer Verlag. Linden, W. J. v. d. & Hambleton, R. K. (1997). Handbook of Modern Item Response Theory. Springer Verlag. Livingston, S. A., Dorans, N. J., & Wright, N. K. (1990). What combination of sampling andequating methods works best? Applied Measurement in Education, 3(1), 73 95. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. McNeil, L. (2000, June). Creating new inequalities: Contradictions of reform. Phi Delta Kappa. 729 734. Measurement, 32, 62 80. In French D (2003) Morrison, C. A., & Fitzpatrick, S. J. (1992). Direct and Indirect equating: A Comparison of four methods using the Raschmodel. Austine: Measurement and Evaluation Center. The University of Texas, Austin.(ERIC Document Reproduction S ervices No.ED 375 152.)

PAGE 42

42 Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review (Otaru University of Commerce), 51, 1 23. Ogasawara, H. (2001). Item response theory true score equating and their standard errors. Journal of Educational and Behavioral Statistics, 26, 31 50. Ogasawara, H. (2003). Asymptotic standard errors of IRT observed score equating methods. Psychometrika, 68, 193 211. Orfield, G. and Wald, J. (2000). Testing, testing. The Nation 38 40. In French D (2003). Puhan, G. (2010). A comparison of chained linear and post stratification linear equating underdifferent testing conditions. Journal of Educational Measurement, 47, 54 75. R Development Core Team, 2014. R: A language and environment for statistical computing, reference index version 3.3.1. R Foundation for Statistical Computing. Vienna: Austria. URL http://www.R project.org. Ricker K. L. and von Davier A. A. : ETS Research Report Series Volume 2007, Issue 2, pages i 19, December 2007. Rizopoulos, D. (2006). ltm: An R package for latent variable modelling and item response theory analyses. Journal of Statistical Software, 17, 1 25.subgroups of examin ees taking a science achievement test. Applied Psychological Sinharay, S., & Holland, P. W. (2007). Is it necessary to make anchor tests mini versions of the tests being equated or can some restrictions be relaxed? Journal of Educational Measurement, 44, 2 49 275. Tate R. L. (2003). Equating for Long Term Scale Maintenance of Mixed Format Tests Containing Multiple Choice and Constructed Response Items. Journal of Educational and Psychological Measurement 2003 63:839Random Scale Drift: Research Repot, ETS rr 11 46. Wang, T., Lee, W., Brennan, R. L. & Kolen, M. J. (2008). A comparison of frequency estimation and chained equipercentile methods under the common item non equivalent groups design. Applied Psychological Measurement, 32, 632 651. Wollack, J. A., Sung H. J., & Kang, T. (2005, April). Longitudinal effects of item parameter drift. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal. Wollack, J. A., Sung, H. J., & Kang, T. (2006, April). The impact of compo unding item parameter drift on ability estimation. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.

PAGE 43

43 Yang, W. L. & Houang, R. T. (1996). The effect of anchor length and equating method on the accu racy of test equating: comparisons of linear and IRT based equating using an anchor item design. Paper presented at the annual meeting of the American Educational Research Association, New York. Yu, C. H. & Osborn Popp, S. E. (2005). Test equating by commo n items and common subjects: Concepts and applications. Practical Assessment, Research and Evaluation, 10(4).

PAGE 44

44 BIOGRAPHICAL SKETCH Christiana Aikenosi Akande attended University of Benin, Nigeria, where she evaluation methodology in the spr ing of 2015. She continued to pursue a doctoral degree in same program and same institution with an expected graduation date of summer 2018.