title	year	volume	issue	pages	authors	abstract
	Examination of Indices of High School Performance Based on the Graded Response Model.	2019	38	2	41-52	Allen, Jeff and Mattern, Krista	We examined summary indices of high school performance (coursework, grades, and test scores) based on the graded response model (GRM). The indices varied by inclusion of ACT test scores and whether high school courses were constrained to have the same difficulty and discrimination across groups of schools. The indices were examined with respect to skewness, incremental prediction of college degree attainment, and differences across racial/ethnic and socioeconomic subgroups. The most difficult high school courses to earn an "A" grade included calculus, chemistry, trigonometry, other advanced math, physics, algebra 2, and geometry. The GRM?based indices were less skewed than simple high school grade point average (HSGPA) and had higher correlations with ACT Composite score. The index that included ACT test scores and allowed item parameters to vary by school group was most predictive of college degree attainment, but had larger subgroup differences. Implications for implementing multiple measure models for college readiness are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	An Empirically Derived Index of High School Academic Rigor.	2019	38	1	6-15	Allen, Jeff and Mattern, Krista and Ndum, Edwin	We derived an index of high school academic rigor (HSAR) by optimizing the prediction of first?year college GPA (FYGPA) based on high school courses taken, grades, and indicators of advanced coursework. Using a large data set and nominal parameterization of high school course outcomes, the HSAR index capitalizes on differential contributions across courses and nonlinear relationships between course grades and FYGPA. Test scores from eighth grade were incorporated in the model to isolate the contribution of HSAR. High school courses with the largest relationships with FYGPA were English 11, English 12, Chemistry, English 10, Calculus, and Algebra 2. Participation in Advanced Placement, accelerated, or honors courses increased HSAR. The correlation of the HSAR index and FYGPA was.52 and the HSAR index led to modest improvement in overall prediction when combined with high school GPA and ACT Composite score. HSAR index subgroup differences were smaller than subgroup differences in ACT Composite score. Implications for high school counselors, researchers, and postsecondary student service personnel are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Examining Estimates of Intervention Effectiveness Using Sensitivity Analysis.	2018	37	2	45-53	An, Chen and Braun, Henry and Walsh, Mary E.	Abstract: Making causal inferences from a quasi?experiment is difficult. Sensitivity analysis approaches to address hidden selection bias thus have gained popularity. This study serves as an introduction to a simple but practical form of sensitivity analysis using Monte Carlo simulation procedures. We examine estimated treatment effects for a school?based support intervention designed to address student strengths and needs in academic and nonacademic areas by leveraging partnerships with community agencies. Middle school (Grades 6–8) statewide standardized test scores in mathematics and English language arts (ELA) were examined for students in a large urban district who participated in City Connects during elementary school. Results showed that the estimated treatment effects in both subjects were reduced slightly with the inclusion of U, a hypothesized unobserved binary variable. However, simulated effects fell within one?sided 90% confidence intervals for original treatment effects, suggesting only a mild sensitivity to hidden bias. Moreover, almost identical estimated treatment effects were observed when the magnitude of the mathematical difference between each pair of the conditional probabilities of U given the treatment indicator Z was the same. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Gauging Item Alignment Through Online Systems While Controlling for Rater Effects.	2015	34	1	22-33	Anderson, Daniel and Irvin, Shawn and Alonzo, Julie and Tindal, Gerald A.	The alignment of test items to content standards is critical to the validity of decisions made from standards-based tests. Generally, alignment is determined based on judgments made by a panel of content experts with either ratings averaged or via a consensus reached through discussion. When the pool of items to be reviewed is large, or the content-matter experts are broadly distributed geographically, panel methods present significant challenges. This article illustrates the use of an online methodology for gauging item alignment that does not require that raters convene in person, reduces the overall cost of the study, increases time flexibility, and offers an efficient means for reviewing large item banks. Latent trait methods are applied to the data to control for between-rater severity, evaluate intrarater consistency, and provide item-level diagnostic statistics. Use of this methodology is illustrated with a large pool (1,345) of interim-formative mathematics test items. Implications for the field and limitations of this approach are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Putting Rubrics to the Test: The Effect of a Model, Criteria Generation, and Rubric-Referenced Self-Assessment on Elementary School Students' Writing.	2008	27	2	3-13	Andrade, Heidi L. and Ying Du and Xiaolei Wang	The purpose of this study was to investigate the effect of reading a model written assignment, generating a list of criteria for the assignment, and self-assessing according to a rubric, as well as gender, time spent writing, prior rubric use, and previous achievement on elementary school students' scores for a written assignment (N = 116). Participants were in grades 3 and 4. The treatment involved using a model paper to scaffold the process of generating a list of criteria for an effective story or essay, receiving a written rubric, and using the rubric to self-assess first drafts. The comparison condition involved generating a list of criteria for an effective story or essay, and reviewing first drafts. Findings include a main effect of treatment and of previous achievement on total writing scores, as well as main effects on scores for the individual criteria on the rubric. The results suggest that using a model to generate criteria for an assignment and using a rubric for self-assessment can help elementary school students produce more effective writing. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Components of Variance of Scales With a Bifactor Subscale Structure From Two Calculations of ?.	2016	35	4	25-30	Andrich, David	Since Cronbach's (1951) elaboration of ? from its introduction by Guttman (1945), this coefficient has become ubiquitous in characterizing assessment instruments in education, psychology, and other social sciences. Also ubiquitous are caveats on the calculation and interpretation of this coefficient. This article summarizes a recent contribution (Andrich, 2015) on the use of coefficient ? which complements these many caveats. It shows that in the presence of a simple bifactor structure of a scale where unique components of variance are homogeneous in magnitude, three components of variance and the common latent common correlation among the subscales can be calculated from the ratio of two calculations of ?, one at the level of the items, the other at the level of the subscales. It was suggested that these two ready calculations and their interpretation, and the reporting of all four indices in the analysis of scales with a subscale structure, would reduce the misinterpretation of this coefficient. An illustrative example of the application of the calculations is also shown. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Rater Certification Tests: A Psychometric Approach.	2019	38	2	6-13	Attali, Yigal	Rater training is an important part of developing and conducting large?scale constructed?response assessments. As part of this process, candidate raters have to pass a certification test to confirm that they are able to score consistently and accurately before they begin scoring operationally. Moreover, many assessment programs require raters to pass a calibration test before every scoring shift. To support the high?stakes decisions made on the basis of rater certification tests, a psychometric approach for their development, analysis, and use is proposed. The circumstances and uses of these tests suggest that they are expected to have relatively low reliability. This expectation is supported by empirical data. Implications for the development and use of these tests to ensure their quality are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Teaching Introductory Measurement: Suggestions for What to Include and How to Motivate Students.	2012	31	2	8-13	Bandalos, Deborah L. and Kopp, Jason P.	In this article, we discuss the importance of measurement literacy and some issues encountered in teaching introductory measurement courses. We present results from a survey of introductory measurement instructors, including information about the topics included in such courses and the amount of time spent on each. Topics that were included by the largest percentages of respondents were: validity, reliability, item analysis, item development, norms, standardized scores, classical test theory, instrument interpretation, and the history of testing, each of which was covered by at least 50% of respondents. Respondents were also asked the number of class sessions spent on each topic, and were asked to rate the importance of each. Responses to these questions closely paralleled those regarding the percentages of respondents who included these topics in their courses. We also report suggestions for class activities, arguing that those teaching introductory measurement courses should emphasize the relevance of measurement concepts to students' lives and future careers. To this end, we provide suggestions for activities that might help to accomplish this goal. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Reliability of College Grades.	2015	34	4	31-40	Beatty, Adam S. and Walmsley, Philip T. and Sackett, Paul R. and Kuncel, Nathan R. and Koch, Amanda J.	Little is known about the reliability of college grades relative to how prominently they are used in educational research, and the results to date tend to be based on small sample studies or are decades old. This study uses two large databases (N > 800,000) from over 200 educational institutions spanning 13 years and finds that both first-year and overall college GPA can be expected to be highly reliable measures of academic performance, with reliability estimated at .86 for first-year GPA and .93 for overall GPA. Additionally, reliabilities vary moderately by academic discipline, and within-school grade intercorrelations are highly stable over time. These findings are consistent with a hierarchical structure of academic ability. Practical implications for decision making and measurement using GPA are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Diagnosing Teachers' Understandings of Rational Numbers: Building a Multidimensional Test Within the Diagnostic Classification Framework.	2014	33	1	2-14	Bradshaw, Laine and Izsák, Andrew and Templin, Jonathan and Jacobson, Erik	We report a multidimensional test that examines middle grades teachers' understanding of fraction arithmetic, especially multiplication and division. The test is based on four attributes identified through an analysis of the extensive mathematics education research literature on teachers' and students' reasoning in this content area. We administered the test to a national sample of 990 in-service middle grades teachers and analyzed the item responses using the log-linear cognitive diagnosis model. We report the diagnostic quality of the test at the item level, mastery classifications for teachers, and attribute relationships. Our results demonstrate that, when a test is grounded in research on cognition and is designed to be multidimensional from the onset, it is possible to use diagnostic classification models to detect distinct patterns of attribute mastery. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Quality of Local District Assessments Used in Nebraska's School-Based Teacher-Led Assessment and Reporting System (STARS).	2005	24	2	14-21	Brookhart, Susan M.	A sample of 293 local district assessments used in the Nebraska STARS (School-based Teacher-led Assessment and Reporting System), 147 from 2004 district mathematics assessment portfolios and 146 from 2003 reading assessment portfolios, was scored with a rubric evaluating their quality. Scorers were Nebraska educators with background and training in assessment. Raters reached an agreement criterion during a training session; however, analysis of a set of 30 assessments double-scored during the main scoring session indicated that the math ratings remained reliable during scoring, while the reading ratings did not. Therefore, this article presents results for the 147 mathematics assessments only. The quality of local mathematics assessments used in the Nebraska STARS was good overall. The majority were of high quality on characteristics that go to validity (alignment with standards, clarity to students, appropriateness of content). Professional development for Nebraska teachers is recommended on aspects of assessment related to reliability (sufficiency of information and scoring procedures). [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Connotative Meanings of Student Performance Labels Used in Standard Setting.	2010	29	4	28-38	Burt, Winona M. and Stapleton, Laura M.	The purpose of this study was to investigate the connotation of performance labels used in standard setting. For example, do the performance labels basic, proficient, and advanced hold different connotations than limited knowledge, satisfactory, and distinguished? If these terms hold different connotations, such differences may play a role in the standard-setting process. A nationally representative sample of participants (n = 167) provided connotation ratings to an online instrument containing an experimental manipulation. Results suggested that the selected terms themselves do hold different connotations. After definitions were provided with the terms, the differences in the evaluative nature of the labels were mitigated. However, some differences remained; the term limited knowledge was persistently perceived as less favorable than basic and apprentice, and satisfactory was persistently perceived as less favorable than proficient. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	A Meta-Analysis of Research on the Read Aloud Accommodation.	2014	33	3	17-30	Buzick, Heather and Stone, Elizabeth	Read aloud is a testing accommodation that has been studied by many researchers, and its use on K-12 assessments continues to be debated because of its potential to change the measured construct or unfairly increase test scores. This study is a summary of quantitative research on the read aloud accommodation. Previous studies contributed information to compute average effect sizes for students with disabilities, students without disabilities, and the difference between groups for reading and mathematics using a random effects meta-analytic approach. Results suggest that (1) effect sizes are larger for reading than for math for both student groups, (2) the read aloud accommodation increases reading test scores for both groups, but more so for students with disabilities, and (3) mathematics scores gains due to the read aloud accommodation are small for both students with and without disabilities, on average. There was some evidence to suggest larger effects in elementary school relative to middle and high school and possible mode effects, but more studies are needed within levels of the moderator variables to conduct statistical tests. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Using Test Scores From Students With Disabilities in Teacher Evaluation.	2015	34	3	28-38	Buzick, Heather M. and Jones, Nathan D.	Much of the recent focus of educational policymakers has been on improving the measurement of teacher effectiveness. Linking student growth to teacher effects has been a large part of reform efforts. To date, neither researchers nor practitioners have arrived at a consensus on how to treat test scores from students with disabilities in growth-based teacher effectiveness indicators, despite the fact that these students make up approximately 13% of the K-12 student population. In this study, we leverage longitudinal data from the population of teachers in one state to explore practical questions related to including general assessment scores from students with disabilities in teacher evaluation. Findings suggest that including test scores from students with disabilities allows more teachers to be evaluated and does not substantially affect teachers' scores. Moreover, including disability-related covariates can allow for fairer evaluations for teachers with many students with disabilities in their class. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Longitudinal Analysis of Early Mathematics Learning.	2018	37	3	4-10	Camilli, Gregory and Kim, Sunhee	Abstract: The trend in mathematics achievement from preschool to kindergarten is studied with a longitudinal growth item response theory model. The three measurement occasions included the spring of preschool and the spring and fall of kindergarten. The growth trend was nonlinear, with a steep drop between spring of preschool and fall of kindergarten. The modeling results provide validation for the argument that a classroom assessment in mathematics can be used to assess developmental skill levels that are consistent with a theory of early mathematics acquisition. The statistical model employed enables an effective illustration of overall gains and individual variability. Implications of the summer loss are discussed as well as model limitations. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Identifying Essential Topics in General and Special Education Introductory Assessment Textbooks.	2007	26	1	9-18	Campbell, Cynthia and Collins, Vicki L.	We reviewed the five top-selling introductory assessment textbooks in both general and special education to identify topics contained in textbooks and to determine the extent of agreement among authors regarding the essentialness of topics within and across discipline. Content analysis across the 10 assessment textbooks yielded 73 topics related to 13 categories: Decisions, Law, Technical Adequacy, Plan Assessment, Create Assessment, Score Assessment, Assessment Target, Assessment Type, Assessment Method, Interpret Assessment, Communicate Assessment Results, Assessment Population, and Computer-Assisted Assessment. Many of the topics identified were consistent with traditional assessment expectations of general and special education environments, while other, arguably important, topics were not identified as essential. The idea of core assessment topics for all teachers is introduced. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Accommodations for Students Who Are Deaf or Hard of Hearing in Large-Scale, Standardized Assessments: Surveying the Landscape and Charting a New Direction.	2009	28	2	41-49	Cawthon, Stephanie W.	Students who are deaf or hard of hearing (SDHH) often use test accommodations when they participate in large-scale, standardized assessments. The purpose of this article is to present findings from the The “big five” accommodations were reported by at least two-thirds of the 389 participants: extended time, small group/individual administration, test directions interpreted, test items read aloud, and test items interpreted. In a regression analysis, language used in instruction showed the most significant effects on accommodations use. The article considers these findings in light of a more proactive role for the in providing evidence for the effectiveness of accommodations with SDHH. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Examining the Role of Advanced Placement® Exam Participation in 4-Year College Enrollment.	2011	30	4	16-27	Chajewski, Michael and Mattern, Krista D. and Shaw, Emily J.	The purpose of the current study was to examine the relationship between Advanced Placement (AP) exam participation and enrollment in a 4-year postsecondary institution. A positive relationship was expected given that the primary purpose of offering AP courses is to allow students to engage in college-level academic work while in high school, and potentially receive college credit by earning qualifying scores on the corresponding AP exam. Therefore, college preparation and planning is an implicit and explicit part of AP participation. Analyzing a national sample of over 1.5 million students, the current study found that AP participation was related to college enrollment, even after controlling for student demographic and ability characteristics and high school level predictors. For example, the odds of attending a 4-year postsecondary institution increased by at least 171% for all three AP participation groups (taking either one AP exam, two or three AP exams, or four or more AP exams) as compared to students who took no AP exams. Given the current political environment and the renewed interest in readying high school students for college, these results may help inform and shape educational initiatives targeted at the school, district, state, or even national level. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Children Left Behind in AYP and Non-AYP Schools: Using Student Progress and the Distribution of Student Gains to Validate AYP.	2007	26	3	21-32	Choi, Kilchan and Seltzer, Michael and Herman, Joan and Yamashiro, Kyo	The No Child Left Behind Act ( NCLB, 2002 ) establishes ambitious goals for increasing student learning and attaining equity in the distribution of student performance. Schools must assure that all students, including all significant subgroups, show adequate yearly progress (AYP) toward the goal of 100% proficiency by the year 2014. In this paper, we illustrate an alternative way of evaluating AYP that both emphasizes individual student growth over time and focuses on the distribution of student growth between performance subgroups. We do so through analyses of a longitudinal data set from an urban school district in the state of Washington. We also examine what these patterns tell us about schools that have been designated as meeting their AYP targets and those that have not. This alternative way of measuring AYP helps bring to light potentially important aspects of school performance that might be masked if we limit our focus to classifying schools based only on current AYP criteria. In particular, we are able to identify some schools meeting Washington state's AYP criteria in which above-average students are making substantial progress but below-average students making little to no progress. In contrast, other schools making AYP have below-average students making adequate progress but above-average students showing little gains. These contrasts raise questions about the meaning of “adequate” progress and to whom the notion of progress refers. We believe that closely examining the distribution of student progress may provide an important supplementary or alternative measure of AYP. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Can High School Achievement Tests Serve to Select College Students?	2010	29	2	3-12	Cimetta, Adriana D. and D'Agostino, Jerome V. and Levin, Joel R.	Postsecondary schools have traditionally relied on admissions tests such as the SAT and ACT to select students. With high school achievement assessments in place in many states, it is important to ascertain whether scores from those exams can either supplement or supplant conventional admissions tests. In this study we examined whether the Arizona Instrument to Measure Standards (AIMS) high school tests could serve as a useful predictor of college performance. Stepwise regression analyses with a predetermined order of variable entry revealed that AIMS generally did not account for additional performance variation when added to high school grade-point average (HSGPA) and SAT. However, in a cohort of students that took the test for graduation purposes, AIMS did account for about the same proportion of variance as SAT when added to a model that included HSGPA. The predictive value of both SAT and AIMS was generally the same for Caucasian, Hispanic, and Asian American students. The ramifications of universities using high school achievement exams as predictors of college success, in addition to or in lieu of traditional measures, are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	An Investigation of Rater Cognition in the Assessment of Projects.	2012	31	3	10-20	Crisp, Victoria	In the United Kingdom, the majority of national assessments involve human raters. The processes by which raters determine the scores to award are central to the assessment process and affect the extent to which valid inferences can be made from assessment outcomes. Thus, understanding rater cognition has become a growing area of research in the United Kingdom. This study investigated rater cognition in the context of the assessment of school-based project work for high-stakes purposes. Thirteen teachers across three subjects were asked to 'think aloud' whilst scoring example projects. Teachers also completed an internal standardization exercise. Nine professional raters across the same three subjects standardized a set of project scores whilst thinking aloud. The behaviors and features attended to were coded. The data provided insights into aspects of rater cognition such as reading strategies, emotional and social influences, evaluations of features of student work (which aligned with scoring criteria), and how overall judgments are reached. The findings can be related to existing theories of judgment. Based on the evidence collected, the cognition of teacher raters did not appear to be substantially different from that of professional raters. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Validating Student Score Inferences With Person-Fit Statistic and Verbal Reports: A Person-Fit Study for Cognitive Diagnostic Assessment.	2013	32	1	34-42	Cui, Ying and Roduta Roberts, Mary	The goal of this study was to investigate the usefulness of person-fit analysis in validating student score inferences in a cognitive diagnostic assessment. In this study, a two-stage procedure was used to evaluate person fit for a diagnostic test in the domain of statistical hypothesis testing. In the first stage, the person-fit statistic, the hierarchy consistency index (HCI; Cui, 2007 ; Cui & Leighton, 2009 ), was used to identify the misfitting student item-score vectors. In the second stage, students' verbal reports were collected to provide additional information about students' response processes so as to reveal the actual causes of misfits. This two-stage procedure helped to identify the misfits of item-score vectors to the cognitive model used in the design and analysis of the diagnostic test, and to discover the reasons of misfits so that students' problem-solving strategies were better understood and their performances were interpreted in a more meaningful way. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Disaggregated Effects of Device on Score Comparability.	2017	36	3	35-45	Davis, Laurie and Morrison, Kristin and Kong, Xiaojing and McBride, Yuanyuan	The use of tablets for large-scale testing programs has transitioned from concept to reality for many state testing programs. This study extended previous research on score comparability between tablets and computers with high school students to compare score distributions across devices for reading, math, and science and to evaluate device effects for gender and ethnicity subgroups. Results indicated no significant differences between tablets and computers for math and science. For reading, a small device effect favoring tablets was found for the middle to lower part of the score distribution. This effect seemed to be driven by increases in performance for male students when testing on tablets. No interactions of device with ethnicity were observed. Consistent with previous research, this study provides additional evidence for a relatively high degree of comparability between tablets and computers. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Examining Contextual Effects in a Practice Analysis: An Application of Dual Scaling.	2007	26	3	3-10	De Champlain, André F. and Cuddy, Monica M. and LaDuca, Tony	Practice analyses are routinely used in support of the development of occupational and professional certification and licensure examinations. These analyses usually survey incumbents to obtain importance ratings of (1) specific tasks and (2) knowledge, skill, and ability (KSA) statements deemed by subject matter experts as essential to safe and effective practice. Several researchers have made important criticisms of traditional practice analysis procedures, particularly the lack of attention to contextual constructs and the resulting problematic interpretation of mean importance ratings. The present study provides a framework for assessing the impact of context in practice analysis studies. It focuses on a practice analysis of a health profession that sought to enhance the meaning of incumbents' importance ratings by embedding the statements in the context of patient acuities. Results indicate that incumbents' importance ratings varied as a function of patient acuity. Dual scaling analysis was used to obtain a multidimensional visual representation of the associations between importance ratings and contextual content. The implications of the contextual component of the study design for future practice analysis studies are discussed as well as possible applications of this approach to professions in education. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Developing a Teacher Evaluation Instrument to Provide Formative Feedback Using Student Ratings of Teaching Acts.	2015	34	3	18-27	der Lans, Rikkert M. and de Grift, Wim J.C.M. and Veen, Klaas	This study reports on the development of a teacher evaluation instrument, based on students' observations, which exhibits cumulative ordering in terms of the complexity of teaching acts. The study integrates theory on teacher development with theory on teacher effectiveness and applies a cross-validation procedure to verify whether teaching acts have a cumulative order. The resulting teacher evaluation instrument comprises 32 teaching acts with cumulative ordering in terms of complexity. This ordering aligns with prior teacher development research. It also represents a valuable extension in that the instrument can provide feedback about a teacher's current phase of development and advice for improvement. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Holding Schools Accountable for the Growth of Nonproficient Students: Coordinating Measurement and Accountability.	2009	28	4	27-41	Dunn, Jennifer L. and Allen, Jessica	A key intent of the NCLB growth pilot is to reward low-status schools who are closing the gap to proficiency. In this article, we demonstrate that the capability of proposed models to identify those schools depends on how the growth model is incorporated into accountability decisions. Six pilot-approved growth models were applied to vertically scaled mathematics assessment data from a single state collected over 2 years. Student and school classifications were compared across models. Accountability classifications using status and growth to proficiency as defined by each model were considered from two perspectives. The first involved adding the number of students moving toward proficiency to the count of proficient students, while the second involved a multitier accountability system where each school was first held accountable for status and then held accountable for the growth of their nonproficient students. Our findings emphasize the importance of evaluating status and growth independently when attempting to identify low-status schools with insufficient growth among nonproficient students. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Building Validity Evidence for Scores on a State-Wide Alternate Assessment: A Contrasting Groups, Multimethod Approach.	2007	26	2	30-43	Elliott, Stephen N. and Compton, Elizabeth and Roach, Andrew T.	The relationships between ratings on the Idaho Alternate Assessment (IAA) for 116 students with significant disabilities and corresponding ratings for the same students on two norm-referenced teacher rating scales were examined to gain evidence about the validity of resulting IAA scores. To contextualize these findings, another group of 54 students who had disabilities, but were not officially eligible for the alternate assessment also was assessed. Evidence to support the validity of the inferences about IAA scores was mixed, yet promising. Specifically, the relationship among the reading, language arts, and mathematics achievement level ratings on the IAA and the concurrent scores on the ACES-Academic Skills scales for the eligible students varied across grade clusters, but in general were moderate. These findings provided evidence that IAA scales measure skills indicative of the state's content standards. This point was further reinforced by moderate to high correlations between the IAA and Idaho State Achievement Test (ISAT) for the not eligible students. Additional evidence concerning the valid use of the IAA was provided by logistic regression results that the scores do an excellent job of differentiating students who were eligible from those not eligible to participate in an alternate assessment. The collective evidence for the validity of the IAA scores suggests it is a promising assessment for NCLB accountability of students with significant disabilities. The methods of establishing this evidence have the potential to advance validation efforts of other states' alternate assessments. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Application of Think Aloud Protocols for Examining and Confirming Sources of Differential Item Functioning Identified by Expert Reviews.	2010	29	2	24-35	Ercikan, Kadriye and Arim, Rubab and Law, Danielle and Domene, Jose and Gagnon, France and Lacroix, Serge	This paper demonstrates and discusses the use of think aloud protocols (TAPs) as an approach for examining and confirming sources of differential item functioning ( DIF). The TAPs are used to investigate to what extent surface characteristics of the items that are identified by expert reviews as sources of DIF are supported by empirical evidence from examinee thinking processes in the English and French versions of a Canadian national assessment. In this research, the TAPs confirmed sources of DIF identified by expert reviews for 10 out of 20 DIF items. The moderate agreement between TAPs and expert reviews indicates that evidence from expert reviews cannot be considered sufficient in deciding whether DIF items are biased and such judgments need to include evidence from examinee thinking processes. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Comparability in Balanced Assessment Systems for State Accountability.	2017	36	3	24-34	Evans, Carla M. and Lyons, Susan	The purpose of this study was to test methods that strengthen the comparability claims about annual determinations of student proficiency in English language arts, math, and science (Grades 3-12) in the New Hampshire Performance Assessment of Competency Education (NH PACE) pilot project. First, we examined the literature in order to define comparability outside the bounds of strict score interchangeability and explored methods for estimating comparability that support a balanced assessment system for state accountability such as the NH PACE pilot. Second, we applied two strategies-consensus scoring and a rank-ordering method-to estimate comparability in Year 1 of the NH PACE pilot based upon the expert judgment of 85 teachers using 396 student work samples. We found the methods were effective for providing evidence of comparability and also detecting when threats to comparability were present. The evidence did not indicate meaningful differences in district average scoring and therefore did not support adjustments to district-level cut scores used to create annual determinations. The article concludes with a discussion of the technical challenges and opportunities associated with innovative, balanced assessment systems in an accountability context. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Guidelines for Interpreting and Reporting Subscores.	2017	36	1	5-13	Feinberg, Richard A. and Jurich, Daniel P.	Recent research has proposed a criterion to evaluate the reportability of subscores. This criterion is a value-added ratio ( VAR), where values greater than 1 suggest that the true subscore is better approximated by the observed subscore than by the total score. This research extends the existing literature by quantifying statistical significance and effect size for using VAR to provide practical guidelines for subscore interpretation and reporting. Findings indicate that subscores with VAR ? 1.1 are a minimum requirement for a meaningful contribution to a user's score interpretation; subscores with .9 < VAR < 1.1 are redundant with the total score and subscores with VAR ? .9 would be misleading to report. Additionally, we discuss what to do when subscores do not add value, yet must be reported, as well as when VAR ? 1.1 may be undesirable. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Repeat Testing Effects on Credentialing Exams: Are Repeaters Misinformed or Uninformed?	2015	34	1	34-39	Feinberg, Richard A. and Raymond, Mark R. and Haist, Steven A.	To mitigate security concerns and unfair score gains, credentialing programs routinely administer new test material to examinees retesting after an initial failing attempt. Counterintuitively, a small but growing body of recent research suggests that repeating the identical form does not create an unfair advantage. This study builds upon and extends this research by investigating changes in responses to specific items encountered on both the first and repeat attempts. Results indicate that scores gains for repeat examinees who were assigned an identical form were not different from repeat examinees who received a different, but parallel, form. Analyses of responses to individual items answered incorrectly on the initial attempt found that examinees 68% of the time selected the same incorrect option on their second attempt, suggesting repeaters are misinformed rather than uninformed. Implications for feedback, remediation, and retesting policies are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	A Simple Equation to Predict a Subscore's Value.	2014	33	3	55-56	Feinberg, Richard A. and Wainer, Howard	Subscores are often used to indicate test-takers' relative strengths and weaknesses and so help focus remediation. But a subscore is not worth reporting if it is too unreliable to believe or if it contains no information that is not already contained in the total score. It is possible, through the use of a simple linear equation provided in this note, to determine if a particular subscore adds enough value to be worth reporting. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Test Development with Performance Standards and Achievement Growth in Mind.	2011	30	4	3-15	Ferrara, Steve and Svetina, Dubravka and Skucha, Sylvia and Davidson, Anne H.	Items on test score scales located at and below the Proficient cut score define the content area knowledge and skills required to achieve proficiency. Alternately, examinees who perform at the Proficient level on a test can be expected to be able to demonstrate that they have mastered most of the knowledge and skills represented by the items at and below the Proficient cut score. It is important that these items define intended knowledge and skills, especially increasing levels of knowledge and skills, on tests that are intended to portray achievement growth across grade levels. Previous studies show that coherent definitions of growth occur often as a result of good fortune rather than by design. In this paper, we use grades 3, 4, and 5 mathematics tests from a state assessment program to examine how well (a) descriptors for Proficient performance define achievement growth across grades, and (b) the knowledge and skill demands of test items that define Proficient performance at each grade level may or may not define achievement growth coherently. Our purpose is to demonstrate (a) the results of one state assessment program's first attempt to train item writers to hit assigned proficiency level targets, and (b) how those efforts support and undermine coherent inferences about what it means to achieve Proficient performance from one grade to the next. Item writers' accuracy in hitting proficiency level targets and resulting inferences about achievement growth are mixed but promising. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	A Comparison of Two Alternate Scaling Approaches Employed for Task Analyses in Credentialing Examination Development.	2019	38	1	78-86	Fidler, James R. and Risk, Nicole M.	Credentialing examination developers rely on task (job) analyses for establishing inventories of task and knowledge areas in which competency is required for safe and successful practice in target occupations. There are many ways in which task?related information may be gathered from practitioner ratings, each with its own advantage and limitation. Two of the myriad alternative task analysis rating approaches are compared in situ: one establishing relative task saliency through a single scale of rated importance and another employing a composite of several independent scales. Outcomes regarding tasks ranked by two practitioner groups are compared. A relatively high degree of association is observed between tasks ranked through each approach, yielding comparable, though not identical examination blueprints. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Methodologies for Investigating and Interpreting Student–Teacher Rating Incongruence in Noncognitive Assessment.	2019	38	1	63-77	Flake, Jessica Kay and Petway, Kevin Terrance	Numerous studies merely note divergence in students' and teachers' ratings of student noncognitive constructs. However, given the increased attention and use of these constructs in educational research and practice, an in?depth study focused on this issue was needed. Using a variety of quantitative methodologies, we thoroughly investigate student–teacher in congruence with two commonly assessed noncognitive constructs: intrinsic motivation and time management. We present ways to describe, visualize, and predict differences between student and teacher ratings and discuss implications for interpretation. We show how descriptive and predictive analyses that consider the nesting of students within teachers expand our understanding of the incongruence. We demonstrate the importance of considering ancillary variables in predictive analysis, and latent variable methods for comparing measurement models. We found that student and teacher factors exhibited only small?to?moderate correlations, reinforcing the need for more measurement research in this area. Further, we report that teachers tended to rate students more favorably than students rate themselves, and teachers' ratings were more related to student performance. We discuss how these methodologies can be used to better understand the incongruence between students and teachers and how they can be incorporated into construct validation studies. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Exploring the Utility of Sequential Analysis in Studying Informal Formative Assessment Practices.	2017	36	1	28-38	Furtak, Erin Marie and Ruiz?Primo, Maria Araceli and Bakeman, Roger	Formative assessment is a classroom practice that has received much attention in recent years for its established potential at increasing student learning. A frequent analytic approach for determining the quality of formative assessment practices is to develop a coding scheme and determine frequencies with which the codes are observed; however, these frequencies do not necessarily reflect the temporal and sequential nature of teacher-student interactions. In this article, we explore the utility of sequential analysis as an alternative strategy to capture the nature of informal formative assessment interactions that take place in whole-classroom conversations as compared to frequencies alone. We coded transcriptions of video recordings of four middle school science teachers' whole-class discussions about density for different types of teacher statements associated with effective approaches to formative assessment, as well as the quality of the ideas students shared. Using sequential analysis, we then calculated transitional probabilities and odds ratios for those sequences. Results indicate that sequential analysis revealed differences across the four classrooms analyzed, particularly with respect to the way teachers responded to different kinds of student ideas. Recommendations are framed for the future use of sequential analysis in studying formative assessment classroom practice. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Performance, Perseverance, and the Full Picture of College Readiness.	2015	34	2	20-33	Gaertner, Matthew N. and McClarty, Katie Larsen	Although college readiness is a centerpiece of major educational initiatives such as the Common Core State Standards, few systems have been implemented to track children's progress toward this goal. Instead, college-readiness information is typically conveyed late in a student's high-school career, and tends to focus solely on academic accomplishments-grades and admissions test scores. Late-stage feedback can be problematic for students who need to correct course, so the purpose of this research is to develop a system for communicating more comprehensive college-readiness diagnoses earlier in a child's K-12 career. This article introduces college-readiness indicators for middle-school students, drawing on the National Education Longitudinal Study of 1988 (NELS), a nationally representative longitudinal survey of educational inputs, contexts, and outcomes. A diversity of middle-school variables was synthesized into six factors: achievement, behavior, motivation, social engagement, family circumstances, and school characteristics. Middle-school factors explain 69% of the variance in college readiness, and results suggest a variety of factors beyond academic achievement-most notably motivation and behavior-contribute substantially to preparedness for postsecondary study. The article concludes with limitations and future directions, including the development of college-readiness categories to support straightforward communication of middle-school indicators to parents, teachers, and students. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Characterizing and Diagnosing Complex Professional Competencies—An Example of Intrapreneurship.	2019	38	2	89-100	George, Ann Cathrice and Bley, Sandra and Pellegrino, James	We describe an approach to characterizing and diagnosing complex professional competencies (CPCs) for the field of Intrapreneurship, i.e. activities of an entrepreneurial nature engaged by employees within their existing organizations. Our approach draws upon prior conceptual, empirical, and analytical efforts by researchers in Germany. Results are presented from an application of a cognitive diagnostic modeling approach to the performance of late stage apprentices on tasks derived from a previously developed competence model of Intrapreneurship. The results are discussed in terms of the type of cognitive diagnosis model (CDM) most appropriate for the domain and task battery, and patterns of performance are presented for seven diagnosable Intrapreneurship skills. By interpreting the assessment task response data in terms of a CDM, diagnostic, skill?based information is obtained which verifies the strengths and weaknesses of the apprentices at a late stage in their training and has the potential to provide feedback to training programs triggering the improvement of individual apprentice learning and subsequent work?related performance. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	How Well Does the Sum Score Summarize the Test? Summability as a Measure of Internal Consistency.	2018	37	2	54-63	Goeman, J. J. and De Jong, N. H.	Abstract: Many researchers use Cronbach's alpha to demonstrate internal consistency, even though it has been shown numerous times that Cronbach's alpha is not suitable for this. Because the intention of questionnaire and test constructers is to summarize the test by its overall sum score, we advocate summability, which we define as the proportion of total test variation that is explained by the sum score. This measure is closely related to Loevinger's H. The mathematical derivation of summability as a measure of explained variation is given for both scale and dichotomously scored items. Using computer simulations, we show that summability performs adequately and we apply it to an existing productive vocabulary test. An open?source tool to easily calculate summability is provided online ( https://sites.google.com/view/summability). [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	A Systematic Review of Assessment Literacy Measures.	2014	33	2	14-18	Gotch, Chad M. and French, Brian F.	This work systematically reviews teacher assessment literacy measures within the context of contemporary teacher evaluation policy. In this study, the researchers collected objective tests of assessment knowledge, teacher self-reports, and rubrics to evaluate teachers' work in assessment literacy studies from 1991 to 2012. Then they evaluated the psychometric work from these measures against a set of claims related to score interpretation and use. Across the 36 measures reviewed, they found support for these claims was weak. This outcome highlights the need for increased work on assessment literacy measures in the educational measurement field. The authors conclude with recommendations and a resource to inform a research agenda focused on assessment literacy measurement to inform policy and practice. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	A Review of Recent Research on Individual?Level Score Reports.	2018	37	3	46-54	Gotch, Chad M. and Roduta Roberts, Mary	Abstract: As the primary interface between test developers and multiple educational stakeholders, score reports are a critical component to the success (or failure) of any assessment program. The purpose of this review is to document recent research on individual?level score reporting to advance the research and practice of score reporting. We conducted a search for research studies published or presented between 2005 and 2015, examining 60 scholarly works for (1) the research focus, (2) stated or implied theoretical frameworks of communication, and (3) the characteristics of data sets employed in the studies. Results show that research on score properties, especially subscores, and score report design/layout are well?represented in the literature base. The predominant approach to score reporting has been through a cybernetics tradition of communication. Data sets were often small or localized to a single context. We present example research questions from novel communication frameworks, and encourage our colleagues to adopt new roles in their relationships to stakeholders to advance score reporting research and practice. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Multiple Sources of Evidence: An Analysis of Stakeholders' Perceptions of Various Indicators of Student Learning.	2007	26	1	19-27	Guskey, Thomas R.	This study compared different stakeholders' perceived validity of various indicators of student learning used to judge the quality of students' academic performance. Data were gathered from the questionnaire responses of 314 educators in three states that have implemented comprehensive state-wide assessment programs with high-stakes consequences both for educators and for students. MANOVA results showed that while educators generally hold similar perceptions, significant differences exist between school administrators and teachers. Administrators perceived the results from nationally normed standardized assessments, state assessments, and district assessments to be more valid indicators of student achievement than did teachers. In contrast, teachers granted more validity to classroom observations and homework completion and quality than did administrators. The implications of these differences for reform initiatives are discussed, particularly with regard to teachers' motivation to improve results. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Estimating High School GPA Weighting Parameters With a Graded Response Model.	2019	38	1	16-24	Hansen, John and Sadler, Philip and Sonnert, Gerhard	The high school grade point average (GPA) is often adjusted to account for nominal indicators of course rigor, such as "honors" or "advanced placement." Adjusted GPAs—also known as weighted GPAs—are frequently used for computing students' rank in class and in the college admission process. Despite the high stakes attached to GPA, weighting policies vary considerably across states and high schools. Previous methods of estimating weighting parameters have used regression models with college course performance as the dependent variable. We discuss and demonstrate the suitability of the graded response model for estimating GPA weighting parameters and evaluating traditional weighting schemes. In our sample, which was limited to self?reported performance in high school mathematics courses, we found that commonly used policies award more than twice the bonus points necessary to create parity for standard and advanced courses. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Conceptualizing the Classroom of Target Students: A Qualitative Investigation of Panelists' Experiences during Standard Setting.	2010	29	2	36-44	Hein, Serge F. and Skaggs, Gary	Increasingly, research has focused on the cognitive processes associated with various standard-setting activities. This qualitative study involved an examination of 16 third-grade reading teachers' experiences with the cognitive task of conceptualizing an entire classroom of hypothetical target students when the single-passage bookmark method or the yes/no method was used during a one-day mock panel meeting. Data were collected using in-depth focus group interviews with eight participants from each of the panel meetings, and a whole-text analysis revealed three categories. Most participants experienced difficulty in attempting to conceive of an entire classroom of target students. Most of them were ultimately unable to do so and made use of alternative cognitive strategies. More fundamental issues also contributed to the difficulties experienced in attempting to complete the required cognitive task. Implications of the findings for standard setting and for further research are also discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	From Evidence to Action: A Seamless Process in Formative Assessment?	2009	28	3	24-31	Heritage, Margaret and Kim, Jinok and Vendlinski, Terry and Herman, Joan	Based on the results of a generalizability study of measures of teacher knowledge for teaching mathematics developed at the National Center for Research on Evaluation, Standards, and Student Testing at the University of California, Los Angeles, this article provides evidence that teachers are better at drawing reasonable inferences about student levels of understanding from assessment information than they are at deciding the next instructional steps. We discuss the implications of the results for effective formative assessment and end with considerations of how teachers can be supported to know what to teach next. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Role of Socioeconomic Status in SAT-Freshman Grade Relationships Across Gender and Racial Subgroups.	2016	35	1	21-28	Higdem, Jana L. and Kostal, Jack W. and Kuncel, Nathan R. and Sackett, Paul R. and Shen, Winny and Beatty, Adam S. and Kiger, Thomas B.	Recent research has shown that admissions tests retain the vast majority of their predictive power after controlling for socioeconomic status (SES), and that SES provides only a slight increment over SAT and high school grades (high school grade point average [HSGPA]) in predicting academic performance. To address the possibility that these overall analyses obscure differences by race/ethnicity or gender, we examine the role of SES in the test-grade relationship for men and women as well as for various racial/ethnic subgroups within the United States. For each subgroup, the test-grade relationship is only slightly diminished when controlling for SES. Further, SES is a substantially less powerful predictor of academic performance than both SAT and HSGPA. Among the indicators of SES (i.e., father's education, mother's education, and parental income), father's education appears to be strongest predictor of freshman grades across subgroups, with the exception of the Asian subgroup. In general, SES appears to behave similarly across subgroups in the prediction of freshman grades with SAT scores and HSGPA. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Managing What We Can Measure: Quantifying the Susceptibility of Automated Scoring Systems to Gaming Behavior.	2014	33	3	36-46	Higgins, Derrick and Heilman, Michael	As methods for automated scoring of constructed-response items become more widely adopted in state assessments, and are used in more consequential operational configurations, it is critical that their susceptibility to gaming behavior be investigated and managed. This article provides a review of research relevant to how construct-irrelevant response behavior may affect automated constructed-response scoring, and aims to address a gap in that literature: the need to assess the degree of risk before operational launch. A general framework is proposed for evaluating susceptibility to gaming, and an initial empirical demonstration is presented using the open-source short-answer scoring engines from the Automated Student Assessment Prize (ASAP) Challenge. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Reporting the Percentage of Students above a Cut Score: The Effect of Group Size.	2011	30	1	36-43	Hollingshead, Lynne and Childs, Ruth A.	Large-scale assessment results for schools, school boards/districts, and entire provinces or states are commonly reported as the percentage of students achieving a standard--that is, the percentage of students scoring above the cut score that defines the standard on the assessment scale. Recent research has shown that this method of reporting is sensitive to small changes in the cut score, especially when comparing results across years or between groups. This study builds on that work, investigating the effects of reporting group size on the stability of results. In Part 1 of this study, Grade 6 students' results on Ontario's 2008 and 2009 Junior Assessments of Reading, Writing and Mathematics were compared, by school, for different sizes of schools. In Part 2, samples of students' results on the 2009 assessment were randomly drawn and compared, for 10 group sizes, to estimate the variability in results due to sampling error. The results showed that the percentage of students above a cut score (PAC) was unstable for small schools and small randomly drawn groups. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Characterizing Mathematics Classroom Practice: Impact of Observation and Coding Choices.	2012	31	1	14-26	Ing, Marsha and Webb, Noreen M.	Large-scale observational measures of classroom practice increasingly focus on opportunities for student participation as an indicator of instructional quality. Each observational measure necessitates making design and coding choices on how to best measure student participation. This study investigated variations of coding approaches that may be feasible in large-scale studies, and the ramifications of these variations for drawing inferences about instructional quality. Using data from classroom videos, we found that decisions about whether to keep track of individual students in the coding, observe multiple contexts in the classroom (e.g., whole-class and small-group discussions), and capture nuances of student participation changed the resulting characterizations of classroom practice. Most importantly, simplifying the coding approach did not fully capture and even misrepresented the level and nature of student participation in many classrooms. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	How Robust Are Cross?Country Comparisons of PISA Scores to the Scaling Model Used?	2018	37	4	28-39	Jerrim, John and Parker, Philip and Choi, Alvaro and Chmielewski, Anna Katyn and Sälzer, Christine and Shure, Nikki	The Programme for International Student Assessment (PISA) is an important international study of 15?olds' knowledge and skills. New results are released every 3 years, and have a substantial impact upon education policy. Yet, despite its influence, the methodology underpinning PISA has received significant criticism. Much of this criticism has focused upon the psychometric scaling model used to create the proficiency scores. The aim of this article is to therefore investigate the robustness of cross?country comparisons of PISA scores to subtle changes to the underlying scaling model used. This includes the specification of the item?response model, whether the difficulty and discrimination of items are allowed to vary across countries (item?by?country interactions) and how test questions not reached by pupils are treated. Our key finding is that these technical choices make little substantive difference to the overall country?level results. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Using Evidence?Centered Design to Create a Special Educator Observation System.	2018	37	2	35-44	Johnson, Evelyn S. and Crawford, Angela and Moylan, Laura A. and Zheng, Yuzhu	Abstract: The evidence?centered design framework was used to create a special education teacher observation system, Recognizing Effective Special Education Teachers. Extensive reviews of research informed the domain analysis and modeling stages, and led to the conceptual framework in which effective special education teaching is operationalized as the ability to effectively implement evidence?based practices for students with disabilities. In the assessment implementation stage, four raters evaluated 40 videos and provided evidence to support the scores assigned to teacher performances. An inductive approach was used to analyze the data and to create empirically derived, item?level performance descriptors. In the assessment delivery stage, four different raters evaluated the same videos using the fully developed rubric. Many?facet Rasch measurement analyses showed that the item, teacher, lesson, and rater facets achieved high psychometric quality. This process can be applied to other content areas to develop teacher observation systems that provide accurate evaluations and feedback to improve instructional practice. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Universal Design and Multimethod Approaches to Item Review.	2008	27	1	25-36	Johnstone, Christopher J. and Thompson, Sandra J. and Bottsford-Miller, Nicole A. and Thurlow, Martha L.	Test items undergo multiple iterations of review before states and vendors deem them acceptable to be placed in a live statewide assessment. This article reviews three approaches that can add validity evidence to states' item review processes. The first process is a structured sensitivity review process that focuses on universal design considerations for items. The second method is a series of statistical analyses intended to increase the limited amount of information that can be derived from analyses on low-incidence populations (such as students who are blind, deaf, or have cognitive disabilities). Finally, think aloud methods are described as a method for understanding why particular items might be problematic for students. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Differentials of a State Reading Assessment: Item Functioning, Distractor Functioning, and Omission Frequency for Disability Categories.	2009	28	2	28-40	Kato, Kentaro and Moen, Ross E. and Thurlow, Martha L.	Large data sets from a state reading assessment for third and fifth graders were analyzed to examine differential item functioning (DIF), differential distractor functioning (DDF), and differential omission frequency (DOF) between students with particular categories of disabilities (speech/language impairments, learning disabilities, and emotional behavior disorders) and students without disabilities. Multinomial logistic regression was employed to compare response characteristic curves (RCCs) of individual test items. Although no evidence for serious test bias was found for the state assessment examined in this study, the results indicated that students in different disability categories showed different patterns of DIF, DDF, and DOF, and that the use of RCCs helps clarify the implications of DIF and DDF. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Formative Assessment: A Meta-Analysis and a Call for Research.	2011	30	4	28-37	Kingston, Neal and Nash, Brooke	An effect size of about .70 (or .40-.70) is often claimed for the efficacy of formative assessment, but is not supported by the existing research base. More than 300 studies that appeared to address the efficacy of formative assessment in grades K-12 were reviewed. Many of the studies had severely flawed research designs yielding uninterpretable results. Only 13 of the studies provided sufficient information to calculate relevant effect sizes. A total of 42 independent effect sizes were available. The median observed effect size was .25. Using a random effects model, a weighted mean effect size of .20 was calculated. Moderator analyses suggested that formative assessment might be more effective in English language arts (ELA) than in mathematics or science, with estimated effect sizes of .32, .17, and .09, respectively. Two types of implementation of formative assessment, one based on professional development and the other on the use of computer-based formative systems, appeared to be more effective than other approaches, yielding mean effect size of .30 and .28, respectively. Given the wide use and potential efficacy of good formative assessment practices, the paucity of the current research base is problematic. A call for more high-quality studies is issued. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Using State Assessments for Predicting Student Success in Dual-Enrollment College Classes.	2013	32	3	3-10	Kingston, Neal M. and Anderson, Gretchen	Scores on state standards-based assessments are readily available and may be an appropriate alternative to traditional placement tests for assigning or accepting students into particular courses. Many community colleges do not require test scores for admissions purposes but do require some kind of placement scores for first-year English and math courses. In this study, we examine the efficacy of using the reading and math portions of the Kansas State Assessment (KSA) for predicting the success of high school students taking College Algebra and College English I at a Kansas community college. Results showed that in this sample KSA scores predicted as well or better than more traditional placement tests and with no extra cost to the institution. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Multiple-Use of Accountability Assessments: Implications for the Process of Validation.	2013	32	4	2-15	Koch, Martha J.	Implications of the multiple-use of accountability assessments for the process of validation are examined. Multiple-use refers to the simultaneous use of results from a single administration of an assessment for its intended use and for one or more additional uses. A theoretical discussion of the issues for validation which emerge from multiple-use is provided focusing on the increased stakes that result from multiple-use and the need to consider the interactions that may take place between multiple-uses. To further explore this practice, an empirical study of the multiple-use of the Education Quality and Accountability Office Grade 9 Assessment of Mathematics, a mandatory assessment administered in Ontario, Canada, is presented. Drawing on data gathered in an in-depth case study, practices associated with two of the multiple-uses of this assessment are considered and evidence of ways these two uses interact is presented. Given these interactions, the limitations of an argument-based approach to validation for this instance of multiple-use are demonstrated. Some ways that the process of validation might better address the practice of multiple-use are suggested and areas for further investigation of this frequently occurring practice are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Do Proper Accommodation Assignments Make a Difference? Examining the Impact of Improved Decision Making on Scores for English Language Learners.	2007	26	3	11-20	Kopriva, Rebecca J. and Emick, Jessica E. and Hipolito-Delgado, Carlos Porfirio and Cameron, Catherine A.	Does it matter if students are appropriately assigned to test accommodations? Using a randomized method, this study found that individual students assigned accommodations keyed to their particular needs were significantly more efficacious for English language learners (ELLs) and that little difference was reported between students receiving incomplete or not recommended accommodations and no accommodations whatsoever. A sample of third and fourth grade ELLs in South Carolina (N = 272) were randomly assigned to various types of test accommodations on a mathematics assessment. Results indicated that those students who received the appropriate test accommodations, as recommended by a version of a computerized accommodation taxonomy for ELLs (the selection taxonomy for English language learners accommodations; STELLA), had significantly higher test scores than ELLs who received no accommodations or those who received incomplete or not recommended accommodation packages. Additionally, students who were given no test accommodations scored no differently than those students that received accommodation packages that were incomplete or not recommended, given the students' particular needs and challenges. These findings are important in light of research and anecdotal reports that suggest a general lack of systematicity in the current system of assigning accommodations and a tendency to give all available accommodations regardless of individual child characteristics. The results also have important implications for how future accommodation research should be structured to determine the benefits of particular accommodations and accommodation packages. This study would suggest that control and treatment groups should be assembled based on specific student needs in order for direct comparisons to be made. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Predicting Freshman Grade?Point Average from Test Scores: Effects of Variation Within and Between High Schools.	2018	37	2	9-19	Koretz, D. and Langi, M.	Abstract: Most studies predicting college performance from high?school grade point average (HSGPA) and college admissions test scores use single?level regression models that conflate relationships within and between high schools. Because grading standards vary among high schools, these relationships are likely to differ within and between schools. We used two?level regression models to predict freshman grade point average from HSGPA and scores on both college admissions and state tests. When HSGPA and scores are considered together, HSGPA predicts more strongly within high schools than between, as expected in the light of variations in grading standards. In contrast, test scores, particularly mathematics scores, predict more strongly between schools than within. Within?school variation in mathematics scores has no net predictive value, but between?school variation is substantially predictive. Whereas other studies have shown that adding test scores to HSGPA yields only a minor improvement in aggregate prediction, our findings suggest that a potentially more important effect of admissions tests is statistical moderation, that is, partially offsetting differences in grading standards across high schools. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Automated Scoring of Students’ Small?Group Discussions to Assess Reading Ability.	2018	37	2	20-34	Kosh, Audra E. and Greene, Jeffrey A. and Murphy, P. Karen and Burdick, Hal and Firetto, Carla M. and Elmore, Jeff	Abstract: We explored the feasibility of using automated scoring to assess upper?elementary students’ reading ability through analysis of transcripts of students’ small?group discussions about texts. Participants included 35 fourth?grade students across two classrooms that engaged in a literacy intervention called Quality Talk. During the course of one school year, data were collected at 10 time points for a total of 327 student?text encounters, with a different text discussed at each time point. To explore the possibility of automated scoring, we considered which quantitative discourse variables (e.g., variables to measure language sophistication and latent semantic analysis variables) were the strongest predictors of scores on a multiple?choice and constructed?response reading comprehension test. Convergent validity evidence was collected by comparing automatically calculated quantitative discourse features to scores on a reading fluency test. After examining a variety of discourse features using multilevel modeling, results showed that measures of word rareness and word diversity were the most promising variables to use in automated scoring of students’ discussions. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Grade Inflation Marches On: Grade Increases from the 1990s to 2000s.	2016	35	1	11-20	Kostal, Jack W. and Kuncel, Nathan R. and Sackett, Paul R.	Grade inflation threatens the integrity of college grades as indicators of academic achievement. In this study, we contribute to the literature on grade inflation by providing the first estimate of the size of grade increases at the student level between the mid-1990s and mid-2000s. By controlling for student characteristics and course-taking patterns, we are able to eliminate alternative explanations for grade increases. Our results suggest that grade inflation has occurred across decades, at a small yet non-negligible rate. Suggestions for future research are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Within-High-School Versus Across-High-School Scaling of Admissions Assessments: Implications for Validity and Diversity Effects.	2017	36	1	39-46	Kostal, Jack W. and Sackett, Paul R. and Kuncel, Nathan R. and Walmsley, Philip T. and Stemig, Melissa S.	Previous research has established that SAT scores and high school grade point average (HSGPA) differ in their predictive power and in the size of mean differences across racial/ethnic groups. However, the SAT is scaled nationally across all test takers while HSGPA is scaled locally within a school. In this study, the researchers propose that this difference in how SAT scores and HSGPA are scaled partially explains differences in validity and subgroup differences. Using a large data set consisting of 170,390 students each of whom matriculated at one of 114 separate colleges, the researchers find that awarding SAT scores by ranking SAT within a high school generally results in substantial reduction in the size of subgroup mean differences for this predictor. However, validity for predicting first-year GPA is also reduced by a small amount. Conversely, placing HSGPA onto a nationally normed metric through the use of multiple regression procedures results in a moderate increase in the size of subgroup mean differences, while also producing a small increase in validity. Taken together, these findings suggest that differences in predictor scaling can partially explain differences in the size of subgroup mean differences between HSGPA and SAT scores and have implications for predictive power. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Test Preparation: Examining Teacher Perceptions and Practices.	2008	27	2	28-45	Lai, Emily R. and Waltman, Kris	This study analyzed questionnaire and interview data on teachers' practices and perceptions with respect to test preparation. Questionnaire respondents were asked to rate the ethicality of various test-preparation practices and indicate the extent to which they utilize these practices in their instruction. On the basis of questionnaire results, interviews were conducted with a smaller sample of teachers to determine their views on the appropriateness of particular test-preparation practices, and to determine the factors affecting teacher perceptions about a given activity. Contrary to previous empirical work, questionnaire results indicated that neither use of a given practice nor teacher perceptions of the ethicality of the practice vary across levels of student achievement. On the other hand, consistent with previous empirical work, both use and perceptions varied across grade-level configuration. Estimates of the prevalence of particular teacher practices and perceptions were obtained and compared with those from the literature. In addition, dimensions of teacher reasoning were explored, indicating that when considering the appropriateness of a given practice, teachers consider the following factors: score meaning, learning, the potential for raising student scores, professional ethics, equity, and external perceptions. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Evaluating Growth for ELL Students: Implications for Accountability Policies.	2013	32	3	11-26	Lakin, Joni M. and Young, John W.	In recent years, many U.S. states have introduced growth models as part of their educational accountability systems. Although the validity of growth-based accountability models has been evaluated for the general population, the impact of those models for English language learner (ELL) students, a growing segment of the student population, has not received sufficient attention. We evaluated three commonly used growth models: value tables or transition matrices, projection models, and student growth percentiles (SGP). The value table model identified more ELL students as on track to proficiency, but with lower accuracy for ELL students. The projection and SGP models were more accurate overall, but classified the fewest ELL students as on track and were less likely to identify ELL students who would later be proficient. We found that each model had significant trade-offs in terms of the decisions made for ELL students. These findings should be replicated in additional state contexts and considered in the development of future growth-based accountability policies. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Effects of Read-Aloud Accommodations for Students With and Without Disabilities: A Meta-Analysis.	2014	33	3	3-16	Li, Hongli	Read-aloud accommodations have been proposed as a way to help remove barriers faced by students with disabilities in reading comprehension. Many empirical studies have examined the effects of read-aloud accommodations; however, the results are mixed. With a variance-known hierarchical linear modeling approach, based on 114 effect sizes from 23 studies, a meta-analysis was conducted to examine the effects of read-aloud accommodations for students with and without disabilities. In general, both students with disabilities and students without disabilities benefited from the read-aloud accommodations, and the accommodation effect size for students with disabilities was significantly larger than the effect size for students without disabilities. Further, this meta-analysis reveals important factors that influence the effects of read-aloud accommodations. For instance, the accommodation effect was significantly stronger when the subject area was reading than when the subject area was math. The effect of read-aloud accommodations was also significantly stronger when the test was read by human proctors than when it was read by video/audio players or computers. Finally, the implications, limitations, and directions for future research are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Consistency of Standard Setting in an Augmented State Testing System.	2008	27	2	46-55	Lissitz, Robert W. and Wei, Hua	In this article we address the issue of consistency in standard setting in the context of an augmented state testing program. Information gained from the external NRT scores is used to help make an informed decision on the determination of cut scores on the state test. The consistency of cut scores on the CRT across grades is maintained by forcing a consistency model based on the NRT scores and translating that information back to the CRT scores. The inconsistency of standards and the application of this model are illustrated using data from the Maryland MSA large state testing program involving cut points for basic, proficient and advanced in mathematics and reading across years and across grades. The model is discussed in some detail and shown to be a promising approach, although not without assumptions that must be made and issues that might be raised. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Automated Scoring of Constructed-Response Science Items: Prospects and Obstacles.	2014	33	2	19-28	Liu, Ou Lydia and Brew, Chris and Blackmore, John and Gerard, Libby and Madhok, Jacquie and Linn, Marcia C.	Content-based automated scoring has been applied in a variety of science domains. However, many prior applications involved simplified scoring rubrics without considering rubrics representing multiple levels of understanding. This study tested a concept-based scoring tool for content-based scoring, c-rater™, for four science items with rubrics aiming to differentiate among multiple levels of understanding. The items showed moderate to good agreement with human scores. The findings suggest that automated scoring has the potential to score constructed-response items with complex scoring rubrics, but in its current design cannot replace human raters. This article discusses sources of disagreement and factors that could potentially improve the accuracy of concept-based automated scoring. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Building and Supporting a Validity Argument for a Standards-Based Classroom Assessment of English Proficiency Based on Teacher Judgments.	2008	27	3	32-42	Llosa, Lorena	Using an argument-based approach to validation, this study examines the quality of teacher judgments in the context of a standards-based classroom assessment of English proficiency. Using Bachman's (2005) assessment use argument (AUA) as a framework for the investigation, this paper first articulates the claims, warrants, rebuttals, and backing needed to justify the link between teachers' scores on the English Language Development (ELD) Classroom Assessment and the interpretations made about students' language ability. Then the paper summarizes the findings of two studies—one quantitative and one qualitative—conducted to gather the necessary backing to support the warrants and, in particular, address the rebuttals about teacher judgments in the argument. The quantitative study examined the assessment in relation to another measure of the same ability—the California English Language Development Test—using confirmatory factor analysis of multitrait-multimethod data and provided evidence in support of the warrant that states that the ELD Classroom Assessment measures English proficiency as defined by the California ELD Standards. The qualitative study examined the processes teachers engaged in while scoring the classroom assessment using verbal protocol analysis. The findings of this study serve to support the rebuttals in the validity argument that state that there are inconsistencies in teachers' scoring. The paper concludes by providing an explanation for these seemingly contradictory findings using the AUA as a framework and discusses the implications of the findings for the use of standards-based classroom assessments based on teacher judgments. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Reliably Assessing Growth with Longitudinal Diagnostic Classification Models.	2019	38	2	68-78	Madison, Matthew J.	Recent advances have enabled diagnostic classification models (DCMs) to accommodate longitudinal data. These longitudinal DCMs were developed to study how examinees change, or transition, between different attribute mastery statuses over time. This study examines using longitudinal DCMs as an approach to assessing growth and serves three purposes: (1) to define and evaluate two reliability measures to be used in the application of longitudinal DCMs; (2) through simulation, demonstrate that longitudinal DCM growth estimates have increased reliability compared to longitudinal item response theory models; and (3) through an empirical analysis, illustrate the practical and interpretive benefits of longitudinal DCMs. A discussion describes how longitudinal DCMs can be used as practical and reliable psychometric models when categorical and criterion?referenced interpretations of growth are desired. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Impact of Examinee Performance Information on Judges' Cut Scores in Modified Angoff Standard-Setting Exercises.	2014	33	1	15-22	Margolis, Melissa J. and Clauser, Brian E.	This research evaluated the impact of a common modification to Angoff standard-setting exercises: the provision of examinee performance data. Data from 18 independent standard-setting panels across three different medical licensing examinations were examined to investigate whether and how the provision of performance information impacted judgments and the resulting cut scores. Results varied by panel but in general indicated that both the variability among the panelists and the resulting cut scores were affected by the data. After the review of performance data, panelist variability generally decreased. In addition, for all panels and examinations pre- and post-data cut scores were significantly different. Investigation of the practical significance of the findings indicated that nontrivial fail rate changes were associated with the cut score changes for a majority of standard-setting exercises. This study is the first to provide a large-scale, systematic evaluation of the impact of a common standard setting practice, and the results can provide practitioners with insight into how the practice influences panelist variability and resulting cut scores. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Effect of Content Knowledge on Angoff-Style Standard Setting Judgments.	2016	35	1	29-37	Margolis, Melissa J. and Mee, Janet and Clauser, Brian E. and Winward, Marcia and Clauser, Jerome C.	Evidence to support the credibility of standard setting procedures is a critical part of the validity argument for decisions made based on tests that are used for classification. One area in which there has been limited empirical study is the impact of standard setting judge selection on the resulting cut score. One important issue related to judge selection is whether the extent of judges' content knowledge impacts their perceptions of the probability that a minimally proficient examinee will answer the item correctly. The present article reports on two studies conducted in the context of Angoff-style standard setting for medical licensing examinations. In the first study, content experts answered and subsequently provided Angoff judgments for a set of test items. After accounting for perceived item difficulty and judge stringency, answering the item correctly accounted for a significant (and potentially important) impact on expert judgment. The second study examined whether providing the correct answer to the judges would result in a similar effect to that associated with knowing the correct answer. The results suggested that providing the correct answer did not impact judgments. These results have important implications for the validity of standard setting outcomes in general and on judge recruitment specifically. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	How Should Colleges Treat Multiple Admissions Test Scores?	2018	37	3	11-23	Mattern, Krista and Radunzel, Justine and Bertling, Maria and Ho, Andrew D.	Abstract: The percentage of students retaking college admissions tests is rising. Researchers and college admissions offices currently use a variety of methods for summarizing these multiple scores. Testing organizations such as ACT and the College Board, interested in validity evidence like correlations with first?year grade point average (FYGPA), often use the most recent test score available. In contrast, institutions report using a variety of composite scoring methods for applicants with multiple test records, including averaging and taking the maximum subtest score across test occasions (“superscoring”). We compare four scoring methods on two criteria. First, we compare correlations between scores and FYGPA by scoring method. We find them similar ( r ? . 40). Second, we compare the extent to which test scores differentially predict FYGPA by scoring method and number of retakes. We find that retakes account for additional variance beyond standardized achievement and positively predict FYGPA across all scoring methods. Superscoring minimizes this differential prediction—although it may seem that superscoring should inflate scores across retakes, this inflation is “true” in that it accounts for the positive effects of retaking for predicting FYGPA. Future research should identity factors related to retesting and consider how they should be used in college admissions. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Why Do Achievement Measures Underpredict Female Academic Performance?	2017	36	1	47-57	Mattern, Krista and Sanchez, Edgar and Ndum, Edwin	In the context of college admissions, the current study examined whether differential prediction of first-year grade point average (FYGPA) by gender could be explained by an omitted variable problem-namely, academic discipline, or the amount of effort a student puts into schoolwork and the degree to which a student sees him/herself as hardworking and conscientious. Based on nearly 10,000 college students, the current study found that differences in intercepts by gender were reduced by 45% with the inclusion of academic discipline in a model that already included high school grade point average (HSGPA) and ACT Composite score. Moreover, academic discipline resulted in an additional 4% of variance accounted for in FYGPA. Gender differences in slopes were not statistically significant ( p > .001) regardless if academic discipline was included in the model. The findings highlight the utility of taking a more holistic approach when making college admission decisions. Namely, the inclusion of noncognitive measures has benefits that are twofold: increased predictive validity and reduced differential prediction. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Uncovering Multivariate Structure in Classroom Observations in the Presence of Rater Errors.	2015	34	2	34-46	McCaffrey, Daniel F. and Yuan, Kun and Savitsky, Terrance D. and Lockwood, J. R. and Edelen, Maria O.	We examine the factor structure of scores from the CLASS-S protocol obtained from observations of middle school classroom teaching. Factor analysis has been used to support both interpretations of scores from classroom observation protocols, like CLASS-S, and the theories about teaching that underlie them. However, classroom observations contain multiple sources of error, most predominantly rater errors. We demonstrate that errors in scores made by two raters on the same lesson have a factor structure that is distinct from the factor structure at the teacher level. Consequently, the 'standard' approach of analyzing on teacher-level average dimension scores can yield incorrect inferences about the factor structure at the teacher level and possibly misleading evidence about the validity of scores and theories of teaching. We consider alternative hierarchical estimation approaches designed to prevent the contamination of estimated teacher-level factors. These alternative approaches find a teacher-level factor structure for CLASS-S that consists of strongly correlated support and classroom management factors. Our results have implications for future studies using factor analysis on classroom observation data to develop validity evidence and test theories of teaching and for practitioners who rely on the results of such studies to support their use and interpretation of the classroom observation scores. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Impact of Process Instructions on Judges' Use of Examinee Performance Data in Angoff Standard Setting Exercises.	2013	32	3	27-35	Mee, Janet and Clauser, Brian E. and Margolis, Melissa J.	Despite being widely used and frequently studied, the Angoff standard setting procedure has received little attention with respect to an integral part of the process: how judges incorporate examinee performance data in the decision-making process. Without performance data, subject matter experts have considerable difficulty accurately making the required judgments. Providing data introduces the very real possibility that judges will turn their content-based judgments into norm-referenced judgments. This article reports on three Angoff standard setting panels for which some items were randomly assigned to have incorrect performance data. Judges were informed that some of the items were accompanied by inaccurate data, but were not told which items they were. The purpose of the manipulation was to assess the extent to which changing the instructions given to the judges would impact the extent to which they relied on the performance data. The modified instructions resulted in the judges making less use of the performance data than judges participating in recent parallel studies. The relative extent of the change judges made did not appear to be substantially influenced by the accuracy of the data. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Cross?Validated Prediction of Academic Performance of First?Year University Students: Identifying Risk Factors in a Nonselective Environment.	2019	38	1	36-47	Meijer, Eline and Cleiren, Marc P. H. D. and Dusseldorp, Elise and Buurman, Vincent J. C. and Hogervorst, Roel M. and Heiser, Willem J.	Early prediction of academic performance is important for student support. The authors explored, in a multivariate approach, whether pre?entry data (e.g., high school study results, preparative activities, expectations, capabilities, motivation, and attitude) could predict university students' first?year academic performance. Preregistered applicants for a bachelor's program filled out an intake questionnaire before study entry. Outcome data (first?year grade point average, course credits, and attrition) were obtained 1 year later. Prediction accuracy was assessed by cross?validation. Students who performed better in preparatory education, followed a conventional educational path before entering, and expected to spend more time on a program?related organization performed better during their first year at university. Concrete preuniversity behaviors were more predictive than psychological attributions such as self?efficacy. Students with a "love of learning" performed better than leisure?oriented students. The intake questionnaire may be used for identifying up front who may need additional support, but is not suitable for student selection. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Examining the Reliability of Student Growth Percentiles Using Multidimensional IRT.	2015	34	4	21-30	Monroe, Scott and Cai, Li	Student growth percentiles (SGPs, Betebenner, 2009) are used to locate a student's current score in a conditional distribution based on the student's past scores. Currently, following Betebenner (2009), quantile regression (QR) is most often used operationally to estimate the SGPs. Alternatively, multidimensional item response theory (MIRT) may also be used to estimate SGPs, as proposed by Lockwood and Castellano (2015). A benefit of using MIRT to estimate SGPs is that techniques and methods already developed for MIRT may readily be applied to the specific context of SGP estimation and inference. This research adopts a MIRT framework to explore the reliability of SGPs. More specifically, we propose a straightforward method for estimating SGP reliability. In addition, we use this measure to study how SGP reliability is affected by two key factors: the correlation between prior and current latent achievement scores, and the number of prior years included in the SGP analysis. These issues are primarily explored via simulated data. In addition, the QR and MIRT approaches are compared in an empirical application. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Affordances of Item Formats and Their Effects on Test?Taker Cognition under Uncertainty.	2019	38	1	54-62	Moon, Jung Aa and Keehner, Madeleine and Katz, Irvin R.	The current study investigated how item formats and their inherent affordances influence test?takers' cognition under uncertainty. Adult participants solved content?equivalent math items in multiple?selection multiple?choice and four alternative grid formats. The results indicated that participants' affirmative response tendency (i.e., judge the given information as True) was affected by the presence of a grid, type of grid options, and their visual layouts. The item formats further affected the test scores obtained from the alternatives keyed True and the alternatives keyed False, and their psychometric properties. The current results suggest that the affordances rendered by item design can lead to markedly different test?taker behaviors and can potentially influence test outcomes. They emphasize that a better understanding of the cognitive implications of item formats could potentially facilitate item design decisions for large?scale educational assessments. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Evaluating the Predictive Value of Growth Prediction Models.	2014	33	2	5-13	Murphy, Daniel L. and Gaertner, Matthew N.	This study evaluates four growth prediction models-projection, student growth percentile, trajectory, and transition table-commonly used to forecast (and give schools credit for) middle school students' future proficiency. Analyses focused on vertically scaled summative mathematics assessments, and two performance standards conditions (high rigor and low rigor) were examined. Results suggest that, when 'status plus growth' is the accountability metric a state uses to reward or sanction schools, growth prediction models offer value above and beyond status-only accountability systems in most, but not all, circumstances. Predictive growth models offer little value beyond status-only systems if the future target proficiency cut score is rigorous. Conversely, certain models (e.g., projection) provide substantial additional value when the future target cut score is relatively low. In general, growth prediction models' predictive value is limited by a lack of power to detect students who are truly on-track. Limitations and policy implications are discussed, including the utility of growth projection models in assessment and accountability systems organized around ambitious college-readiness goals. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Determining Sufficient Measurement Opportunities When Using Multiple Cut Scores.	2008	27	1	37-46	Norman, Rebecca L. and Buckendahl, Chad W.	Many educational testing programs report examinee performance at more than two levels of proficiency. Whether these assessments have the capacity to support these multiple inferences, though, is a topic that has not been widely discussed. This study proposes a method for evaluating the minimum number of measurement opportunities for reporting students’ performance at multiple achievement levels and describes an application of the method for reading and mathematics assessments that are used by some school districts in Nebraska. Analyses were based on judgments collected from 110 teachers about characteristics of items and tasks from multiple assessments in reading and mathematics at grades 4 and 8, and in high school. Results suggested that there were generally enough items on the mathematics assessments to classify students into two or three performance levels, but rarely enough to make the four classifications that the state reported. Items on the reading assessments were generally distributed across the proficiency levels and tended to allow reporting for all four classification levels. These findings have implications for both practitioners and policymakers in how scores are interpreted. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	A Snapshot of Industry and Academic Professional Activities, Compensation, and Engagement in Educational Measurement.	2010	29	3	15-24	Packman, Sheryl and Camara, Wayne J. and Huff, Kristen	This paper provides a snapshot of educational measurement professionals-their educational, professional and demographic backgrounds, as well as their workplace settings, job tasks, professional involvement, and compensation practices. Two previous studies have surveyed employers, but this is the first attempt to conduct a comprehensive survey of measurement professionals. Five hundred and forty-two (31.5% response rate) measurement professionals, the vast majority who held a doctorate degree, responded to the survey from January to April 2007. Overall, these individuals were primarily employed in academic settings, research and testing organizations, and educational or governmental agencies. Results were reported across and within work setting, degree, and other demographic and background factors that may influence work, behavior, and compensation in educational measurement and assessment. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Mean Effects of Test Accommodations for ELLs and Non-ELLs: A Meta-Analysis of Experimental Studies.	2011	30	3	10-28	Pennock-Roman, Maria and Rivera, Charlene	The objective was to examine the impact of different types of accommodations on performance in content tests such as mathematics. The meta-analysis included 14 U.S. studies that randomly assigned school-aged English language learners (ELLs) to test accommodation versus control conditions or used repeated measures in counter-balanced order. Individual effect sizes (Glass's d) were calculated for 50 groups of ELLs and 32 groups of non-ELLs. Individual effect sizes for English language and native language accommodations were classified into groups according to type of accommodation and timing conditions. Means and standard errors were calculated for each category. The findings suggest that accommodations that require extra printed materials need generous time limits for both the accommodated and unaccommodated groups to ensure that they are effective, equivalent in scale to the original test, and therefore more valid owing to reduced construct-irrelevant variance. Computer-administered glossaries were effective even when time limits were restricted. Although the Plain English accommodation had very small average effect sizes, inspection of individual effect sizes suggests that it may be much more effective for ELLs at intermediate levels of English language proficiency. For Spanish-speaking students with low proficiency in English, the Spanish test version had the highest individual effect size (+1.45). [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Reliability and Validity of Bookmark-Based Methods for Standard Setting: Comparisons to Angoff-Based Methods in the National Assessment of Educational Progress.	2011	30	2	3-14	Peterson, Christina Hamme and Schulz, E. Matthew and Engelhard Jr., George	Historically, Angoff-based methods were used to establish cut scores on the National Assessment of Educational Progress (NAEP). In 2005, the National Assessment Governing Board oversaw multiple studies aimed at evaluating the reliability and validity of Bookmark-based methods via a comparison to Angoff-based methods. As the Board considered adoption of Bookmark-based methods, it considered several criteria, including reliability of the cut scores, validity of the cut scores as evidenced by comparability of results to those from Angoff, and procedural validity as evidenced by panelist understanding of the method tasks and instructions and confidence in the results. As a result of their review, a Bookmark-based method was adopted for NAEP, and has been used since that time. This article goes beyond the Governing Board's initial evaluations to conduct a systematic review of 27 studies in NAEP research conducted over 15 years. This research is used to evaluate Bookmark-based methods on key criteria originally considered by the Governing Board. Findings suggest that Bookmark-based methods have comparable reliability, resulting cut scores, and panelist evaluations to Angoff. Given that Bookmark-based methods are shorter in duration and less costly, Bookmark-based methods may be preferable to Angoff for NAEP standard setting. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Methodological Choices in the Content Analysis of Textbooks for Measuring Alignment With Standards.	2015	34	3	10-17	Polikoff, Morgan S. and Zhou, Nan and Campbell, Shauna E.	With the recent adoption of the Common Core standards in many states, there is a need for quality information about textbook alignment to standards. While there are many existing content analysis procedures, these generally have little, if any, validity or reliability evidence. One exception is the Surveys of Enacted Curriculum (SEC), which has been widely used to analyze the alignment among standards, assessments, and teachers' instruction. However, the SEC can be time-consuming and expensive when used for this purpose. This study extends the SEC to the analysis of entire mathematics textbooks and investigates whether the results of SEC alignment analyses are affected if the content analysis procedure is simplified. The results indicate that analyzing only every fifth item produces nearly identical alignment results with no effect on the reliability of content analyses. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Quality of Content Analyses of State Student Achievement Tests and Content Standards.	2008	27	4	2-14	Porter, Andrew C. and Polikoff, Morgan S. and Zeidner, Tim and Smithson, John	This article examines the reliability of content analyses of state student achievement tests and state content standards. We use data from two states in three grades in mathematics and English language arts and reading to explore differences by state, content area, grade level, and document type. Using a generalizability framework, we find that reliabilities for four coders are generally greater than .80. For the two problematic reliabilities, they are partly explained by an odd rater out. We conclude that the content analysis procedures, when used with at least five raters, provide reliable information to researchers, policymakers, and practitioners about the content of assessments and standards. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Equating Subscores under the Nonequivalent Anchor Test (NEAT) Design.	2011	30	1	23-35	Puhan, Gautam and Liang, Longjuan	The study examined two approaches for equating subscores. They are (1) equating subscores using internal common items as the anchor to conduct the equating, and (2) equating subscores using equated and scaled total scores as the anchor to conduct the equating. Since equated total scores are comparable across the new and old forms, they can be used as an anchor to equate the subscores. Both chained linear and chained equipercentile methods were used. Data from two tests were used to conduct the study and results showed that when more internal common items were available (i.e., 10-12 items), then using common items to equate the subscores is preferable. However, when the number of common items is very small (i.e., five to six items), then using total scaled scores to equate the subscores is preferable. For both tests, not equating (i.e., using raw subscores) is not reasonable as it resulted in a considerable amount of bias. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Using Response Time to Detect Item Preknowledge in Computer-Based Licensure Examinations.	2016	35	1	38-47	Qian, Hong and Staniewska, Dorota and Reckase, Mark and Woo, Ada	This article addresses the issue of how to detect item preknowledge using item response time data in two computer-based large-scale licensure examinations. Item preknowledge is indicated by an unexpected short response time and a correct response. Two samples were used for detecting item preknowledge for each examination. The first sample was from the early stage of the operational test and was used for item calibration. The second sample was from the late stage of the operational test, which may feature item preknowledge. The purpose of this research was to explore whether there was evidence of item preknowledge and compromised items in the second sample using the parameters estimated from the first sample. The results showed that for one nonadaptive operational examination, two items (of 111) were potentially exposed, and two candidates (of 1,172) showed some indications of preknowledge on multiple items. For another licensure examination that featured computerized adaptive testing, there was no indication of item preknowledge or compromised items. Implications for detected aberrant examinees and compromised items are discussed in the article. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Evaluating the Comparability of Paper- and Computer-Based Science Tests Across Sex and SES Subgroups.	2012	31	4	2-12	Randall, Jennifer and Sireci, Stephen and Li, Xueming and Kaira, Leah	As access and reliance on technology continue to increase, so does the use of computerized testing for admissions, licensure/certification, and accountability exams. Nonetheless, full computer-based test (CBT) implementation can be difficult due to limited resources. As a result, some testing programs offer both CBT and paper-based test (PBT) administration formats. In such situations, evidence that scores obtained from different formats are comparable must be gathered. In this study, we illustrate how contemporary statistical methods can be used to provide evidence regarding the comparability of CBT and PBT scores at the total test score and item levels. Specifically, we looked at the invariance of test structure and item functioning across test administration mode across subgroups of students defined by SES and sex. Multiple replications of both confirmatory factor analysis and Rasch differential item functioning analyses were used to assess invariance at the factorial and item levels. Results revealed a unidimensional construct with moderate statistical support for strong factorial-level invariance across SES subgroups, and moderate support of invariance across sex. Issues involved in applying these analyses to future evaluations of the comparability of scores from different versions of a test are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Same-Form Retest Effects on Credentialing Examinations.	2009	28	2	19-27	Raymond, Mark R. and Neustel, Sandra and Anderson, Dan	Examinees who take high-stakes assessments are usually given an opportunity to repeat the test if they are unsuccessful on their initial attempt. To prevent examinees from obtaining unfair score increases by memorizing the content of specific test items, testing agencies usually assign a different test form to repeat examinees. The use of multiple forms is expensive and can present psychometric challenges, particularly for low-volume credentialing programs; thus, it is important to determine if unwarranted score gains actually occur. Prior studies provide strong evidence that the same-form advantage is pronounced for aptitude tests. However, the sparse research within the context of achievement and credentialing testing suggests that the same-form advantage is minimal. For the present experiment, 541 examinees who failed a national certification test were randomly assigned to receive either the same test or a different (parallel) test on their second attempt. Although the same-form group had shorter response times on the second administration, score gains for the two groups were indistinguishable. We discuss factors that may limit the generalizability of these findings to other assessment contexts. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Aligning an Early Childhood Assessment to State Kindergarten Content Standards: Application of a Nationally Recognized Alignment Framework.	2010	29	1	25-37	Roach, Andrew T. and McGrath, Dawn and Wixson, Corinne and Talapatra, Devadrita	This article describes an alignment study conducted to evaluate the alignment between Indiana's Kindergarten content standards and items on the Indiana Standards Tool for Alternate Reporting. Alignment is the extent to which standards and assessments are in agreement, working together to guide educators' efforts to support children's learning and development. The alignment process in this study represented a modification of Webb's nationally recognized method of alignment analysis to early childhood assessments and standards. The alignment panel (N = 13) in this study consisted of early childhood educators and educational leaders from all geographic regions of the state. Panel members were asked to rate the depth of knowledge (DOK) stage of each objective in Kindergarten standards; rate the DOK stage for each item on the ISTAR rating scale; and identify the one or two objectives from the standards to which each ISTAR item corresponded. Analysis of the panel's responses suggested the ISTAR inconsistently conformed to Webb's DOK consistency and ROK correspondence criteria for alignment. A promising finding was the strong alignment of the ISTAR Level F1 and F2 scales to the Kindergarten standards. This result provided evidence of the developmental continuum of skills and knowledge that are assessed by the ISTAR items. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Three Options Are Optimal for Multiple-Choice Items: A Meta-Analysis of 80 Years of Research.	2005	24	2	3-13	Rodriguez, Michael C.	Multiple-choice items are a mainstay of achievement testing. The need to adequately cover the content domain to certify achievement proficiency by producing meaningful precise scores requires many high-quality items. More 3-option items can be administered than 4- or 5-option items per testing time while improving content coverage, without detrimental effects on psychometric quality of test scores. Researchers have endorsed 3-option items for over 80 years with empirical evidence—the results of which have been synthesized in an effort to unify this endorsement and encourage its adoption. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Is Teaching Experience Necessary for Reliable Scoring of Extended English Questions?	2009	28	2	2-8	Royal-Dawson, Lucy and Baird, Jo-Anne	Hundreds of thousands of raters are recruited internationally to score examinations, but little research has been conducted on the selection criteria for these raters. Many countries insist upon teaching experience as a selection criterion and this has frequently become embedded in the cultural expectations surrounding the tests. Shortages in raters for some of England's national examinations has led to non-teachers being hired to score a small minority of items and changes in technology have fostered this approach. For a National Curriculum test in English taken at age 14, this study investigated whether teaching experience was a necessary selection criterion for all aspects of the examination. Fifty-seven raters with different backgrounds were trained in the normal manner and scored the same 97 students' work. Accuracy was investigated using a cross-classified multilevel model of absolute score differences with accuracy measures at level 1 and raters crossed with candidates at level 2. By comparing the scoring accuracy of graduates with a degree in English, teacher trainees, experienced teachers and experienced raters, this study found that teaching experience was not a necessary selection criterion. A rudimentary model for allocation of raters to different question types is proposed and further research to investigate the limits of necessary qualifications for scoring is suggested. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Impact of Both Local Item Dependencies and Cut?Point Locations on Examinee Classifications.	2018	37	3	40-45	Rubright, Jonathan D.	Abstract: Performance assessments, scenario?based tasks, and other groups of items carry a risk of violating the local item independence assumption made by unidimensional item response theory (IRT) models. Previous studies have identified negative impacts of ignoring such violations, most notably inflated reliability estimates. Still, the influence of this violation on examinee ability estimates has been comparatively neglected. It is known that such item dependencies cause low?ability examinees to have their scores overestimated and high?ability examinees' scores underestimated. However, the impact of these biases on examinee classification decisions has been little examined. In addition, because the influence of these dependencies varies along the underlying ability continuum, whether or not the location of the cut?point is important in regard to correct classifications remains unanswered. This simulation study demonstrates that the strength of item dependencies and the location of an examination systems’ cut?points both influence the accuracy (i.e., the sensitivity and specificity) of examinee classifications. Practical implications of these results are discussed in terms of false positive and false negative classifications of test takers. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Measuring Widening Proficiency Differences in International Assessments: Are Current Approaches Enough?	2018	37	4	40-48	Rutkowski, David and Rutkowski, Leslie and Liaw, Yuan?Ling	Participation in international large?scale assessments has grown over time with the largest, the Programme for International Student Assessment (PISA), including more than 70 education systems that are economically and educationally diverse. To help accommodate for large achievement differences among participants, in 2009 PISA offered low?performing systems the option of including an easier set of items in the assessment with an aim of providing improved achievement estimates. However, there remains a lack of evidence on the performance of this design innovation. As such, we simulate a design that closely mirrors the PISA 2015 math assessment in order to empirically examine the benefits of including easy items for low?performing countries. We extend the PISA design to include increased numbers of easy items and items that are easier than currently implemented. Findings show that the current PISA approach provides little advantage compared to a common test for all participants. Our study also demonstrates persistent bias, low coverage rates, and low correlations between generating and estimated proficiency under current designs. Through our simulation we also show that to improve achievement estimation for low performers about half of the items would need to be made significantly easier. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Resistance to Confounding Style and Content in Scoring Constructed-Response Items.	2005	24	2	22-28	Schafer, William D. and Gagné, Phill and Lissitz, Robert W.	An assumption that is fundamental to the scoring of student-constructed responses (e.g., essays) is the ability of raters to focus on the response characteristics of interest rather than on other features. A common example, and the focus of this study, is the ability of raters to score a response based on the content achievement it demonstrates independent of the quality with which it is expressed. Previously scored responses from a large-scale assessment in which trained scorers rated exclusively constructed-response formats were altered to enhance or degrade the quality of the writing, and scores that resulted from the altered responses were compared with the original scores. Statistically significant differences in favor of the better-writing condition were found in all six content areas. However, the effect sizes were very small in mathematics, reading, science, and social studies items. They were relatively large for items in writing and language usage (mechanics). It was concluded from the last two content areas that the manipulation was successful and from the first four that trained scorers are reasonably well able to differentiate writing quality from other achievement constructs in rating student responses. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Covariate Measurement Error Correction for Student Growth Percentiles Using the SIMEX Method.	2015	34	1	4-14	Shang, Yi and VanIwaarden, Adam and Betebenner, Damian W.	In this study, we examined the impact of covariate measurement error (ME) on the estimation of quantile regression and student growth percentiles (SGPs), and find that SGPs tend to be overestimated among students with higher prior achievement and underestimated among those with lower prior achievement, a problem we describe as ME endogeneity in this article. We proceeded to assess the effect of covariate ME correction on SGP estimation at two levels-the individual (student) and the aggregate (classroom). Our ME correction approach was limited to the simulation-extrapolation method known as SIMEX. For both the individual and aggregate SGP, we find SIMEX effective in bias reduction. Further, because SIMEX is especially effective in reducing SGP bias for students with very high or very low prior achievement, it significantly weakens the ME endogeneity. SIMEX is also effective in reducing the MSE of aggregate SGP, provided that the students are sorted to some extent on their latent prior achievement. Our empirical study confirms the pattern of the simulation results: SIMEX mainly affects the mean SGP of classes in the highest and lowest quintiles of the prior score distribution, and significantly lowers the correlation between class SGP and prior achievement. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Bunny Hill or Black Diamond: Differences in Advanced Course?Taking in College as a Function of Cognitive Ability and High School GPA.	2019	38	1	25-35	Shewach, Oren R. and McNeal, Kyle D. and Kuncel, Nathan R. and Sackett, Paul R.	College students commonly have considerable course choice, and they can differ substantially in the proportion of their coursework taken at an advanced level. While advanced coursework is generally viewed as a desirable component of a student's education, research has rarely explored differences in student course?taking patterns as a measure of academic success in college. We examined the relationship between the SAT, high school grade point average (HSGPA), and the amount of advanced coursework taken in a sample of 62 colleges and 188,985 students. We found that both the SAT and HSGPA predict enrollment in advanced courses, even after controlling for advanced placement (AP) credits and demographic variables. The SAT subtests of Critical Reading, Writing, and Math displayed differential relationships with advanced course?taking dependent on student major. Gender and race/ethnicity were also related to advanced course?taking, with women taking more advanced courses in all major categories except for science, technology, engineering, and mathematics (STEM) where they took fewer, even after controlling for other variables. Socioeconomic status had a negligible relationship with advanced course?taking. This research broadens our understanding of academic achievement in college and the goals of admissions in higher education. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Differential Prediction in the Use of the SAT and High School Grades in Predicting College Performance: Joint Effects of Race and Language.	2017	36	3	46-57	Shewach, Oren R. and Shen, Winny and Sackett, Paul R. and Kuncel, Nathan R.	The literature on differential prediction of college performance of racial/ethnic minority students for standardized tests and high school grades indicates the use of these predictors often results in overprediction of minority student performance. However, these studies typically involve native English-speaking students. In contrast, a smaller literature on language proficiency suggests academic performance of those with more limited English language proficiency may be underpredicted by standardized tests. These two literatures have not been well integrated, despite the fact that a number of racial/ethnic minority groups within the United States contain recent immigrant populations or heritage language speakers. This study investigates the joint role of race/ethnicity and language proficiency in Hispanic, Asian, and White ethnic groups across three educational admissions systems (SAT, HSGPA, and their composite) in predicting freshman grades. Our results indicate that language may differentially affect academic outcomes for different racial/ethnic subgroups. The SAT loses predictive power for Asian and White students who speak another best language, whereas it does not for Hispanic students who speak another best language. The differential prediction of college grades of linguistic minorities within racial/ethnic minority subgroups appears to be driven by the verbally loaded subtests of standardized tests but is largely unrelated to quantitative tests. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Predicting College Performance of American Indians: A Large-Sample Examination of the SAT.	2017	36	2	24-33	Shu, Siwen and Kuncel, Nathan R. and Sackett, Paul R.	Extensive research has examined the validity and fairness of standardized tests in academic admissions. However, due to their underrepresentation in higher education, American Indians have gained much less attention in this research. In the present study, we examined for American Indian students (1) group differences on SAT scores, (2) the predictive and incremental validity of SAT over high school grades, (3) the effect of socioeconomic status on SAT validity, (4) differential prediction in the use of SAT scores, and (5) potential omitted variables that could explain differential prediction for American Indian students. Results provided evidence of predictive and incremental validity of SAT scores, and the validity of SAT scores was largely independent of socioeconomic status. Overprediction was found when using SAT scores to predict college performance and it was reduced when including high school grades as an additional predictor. This study provides substantial evidence of the validity and fairness of SAT scores for American Indians. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Subscores Based on Classical Test Theory: To Report or Not to Report.	2007	26	4	21-28	Sinharay, Sandip and Haberman, Shelby and Puhan, Gautam	There is an increasing interest in reporting subscores, both at examinee level and at aggregate levels. However, it is important to ensure reasonable subscore performance in terms of high reliability and validity to minimize incorrect instructional and remediation decisions. This article employs a statistical measure based on classical test theory that is conceptually similar to the test reliability measure and can be used to determine when subscores have any added value over total scores. The usefulness of subscores is examined both at the level of the examinees and at the level of the institutions that the examinees belong to. The suggested approach is applied to two data sets from a basic skills test. The results provide little support in favor of reporting subscores for either examinees or institutions for the tests studied here. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	How Often Is the Misfit of Item Response Theory Models Practically Significant?	2014	33	1	23-35	Sinharay, Sandip and Haberman, Shelby J.	Standard 3.9 of the Standards for Educational and Psychological Testing () demands evidence of model fit when item response theory (IRT) models are employed to data from tests. Hambleton and Han () and Sinharay () recommended the assessment of practical significance of misfit of IRT models, but few examples of such assessment can be found in the literature concerning IRT model fit. In this article, practical significance of misfit of IRT models was assessed using data from several tests that employ IRT models to report scores. The IRT model did not fit any data set considered in this article. However, the extent of practical significance of misfit varied over the data sets. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Generalizability of Cognitive Interview-Based Measures Across Cultural Groups.	2009	28	2	9-18	Solano-Flores, Guillermo and Min Li	We addressed the challenge of scoring cognitive interviews in research involving multiple cultural groups. We interviewed 123 fourth- and fifth-grade students from three cultural groups to probe how they related a mathematics item to their personal lives. Item meaningfulness—the tendency of students to relate the content and/or context of an item to activities in which they are actors—was scored from interview transcriptions with a procedure similar to the scoring of constructed-response tasks. Generalizability theory analyses revealed a small amount of score variation due to the main and interaction effect of rater but a sizeable magnitude of measurement error due to the interaction of person and question (context). Students from different groups tended to draw on different sets of contexts of their personal lives to make sense of the item. In spite of individual and potential cultural communication style differences, cognitive interviews can be reliably scored by well-trained raters with the same kind of rigor used in the scoring of constructed-response tasks. However, to make valid generalizations of cognitive interview-based measures, a considerable number of interview questions may be needed. Information obtained with cognitive interviews for a given cultural group may not be generalizable to other groups. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Use of Generalizability (G) Theory in the Testing of Linguistic Minorities.	2006	25	1	13-22	Solano-Flores, Guillermo and Min Li	We contend that generalizability (G) theory allows the design of psychometric approaches to testing English-language learners (ELLs) that are consistent with current thinking in linguistics. We used G theory to estimate the amount of measurement error due to code (language or dialect). Fourth- and fifth-grade ELLs, native speakers of Haitian-Creole from two speech communities, were given the same set of mathematics items in the standard English and standard Haitian-Creole dialects (Sample 1) or in the standard and local dialects of Haitian-Creole (Samples 2 and 3). The largest measurement error observed was produced by the interaction of student, item, and code. Our results indicate that the reliability and dependability of ELL achievement measures is affected by two facts that operate in combination: Each test item poses a unique set of linguistic challenges and each student has a unique set of linguistic strengths and weaknesses. This sensitivity to language appears to take place at the level of dialect. Also, students from different speech communities within the same broad linguistic group may differ considerably in the number of items needed to obtain dependable measures of their academic achievement. Whether students are tested in English or in their first language, dialect variation needs to be considered if language as a source of measurement error is to be effectively addressed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Effects of Inattentive Responding on Construct Validity Evidence When Measuring Social–Emotional Learning Competencies.	2019	38	2	101-111	Steedle, Jeffrey T. and Hong, Maxwell and Cheng, Ying	Self?report inventories are commonly administered to measure social?emotional learning competencies related to college and career readiness. Inattentive responding can negatively impact the validity of interpreting individual results and the accuracy of construct validity evidence. This study applied nine methods of detecting insufficient effort responding (IER) to a social?emotional learning assessment. Individual methods identified between 0.9% and 20.3% of respondents as potentially exhibiting IER. Removing flagged respondents from the data resulted in negligible or small improvements in criterion?related validity, coefficient alpha, concurrent validity, and confirmatory factor analysis model?data fit. Implications for future validity studies and the operational use of IER detection for social–emotional learning assessments are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Effects of Assigning Raters to Items.	2008	27	1	47-55	Sykes, Robert C. and Ito, Kyoko and Wang, Zhen	Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Setting Standards for English Foreign Language Assessment: Methodology, Validation, and a Degree of Arbitrariness.	2013	32	2	15-25	Tiffin?Richards, Simon P. and Anand Pant, Hans and Köller, Olaf	Cut-scores were set by expert judges on assessments of reading and listening comprehension of English as a foreign language (EFL), using the bookmark standard-setting method to differentiate proficiency levels defined by the Common European Framework of Reference (CEFR). Assessments contained stratified item samples drawn from extensive item pools, calibrated using Rasch models on the basis of examinee responses of a German nationwide assessment of secondary school language performance. The results suggest significant effects of item sampling strategies for the bookmark method on cut-score recommendations, as well as significant cut-score judgment revision over cut-score placement rounds. Results are discussed within a framework of establishing validity evidence supporting cut-score recommendations using the widely employed bookmark method. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Rater Agreement in Test?to?Curriculum Alignment Reviews.	2018	37	3	55-64	Traynor, A. and Merzdorf, H. E.	Abstract: During the development of large?scale curricular achievement tests, recruited panels of independent subject?matter experts use systematic judgmental methods—often collectively labeled “alignment” methods—to rate the correspondence between a given test's items and the objective statements in a particular curricular standards document. High disagreement among the expert panelists may indicate problems with training, feedback, or other steps of the alignment procedure. Existing procedural recommendations for alignment reviews have been derived largely from single?panel research studies; support for their use during operational large?scale test development may be limited. Synthesizing data from more than 1,000 alignment reviews of state achievement tests, this study identifies features of test–standards alignment review procedures that impact agreement about test item content. The researchers then use their meta?regression results to propose some practical suggestions for alignment review implementation. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Motivation for Educational Attainment in Grade 9 Predicts High School Completion.	2019	38	2	27-40	West, Stephen G. and Hughes, Jan N. and Kim, Han Joe and Bauer, Shelby S.	The Motivation for Educational Attainment (MEA) questionnaire, developed to assess facets related to early adolescents' motivation to complete high school, has a bifactor structure with a large general factor and three smaller orthogonal specific factors (teacher expectations, peer aspirations, value of education). This prospective validity study investigated the utility of each factor in predicting high school dropout or completion of a general education development (GED) certificate versus completion of a high school degree. Participants were 474 (55.1% male) ethnically diverse students who were originally recruited into a larger longitudinal study in Grade 1 on the basis of academic risk. Fourteen years later, 373 had obtained a high school diploma, 15 had obtained a GED, and 86 had dropped out of high school. During their first year of Grade 9, participants were administered the MEA. Using multinomial logistic regression with high school graduation as the reference outcome and controlling for Grade 9 letter grades, reading and math test scores, gender, and ethnic/racial group status, scores on the latent general factor and the latent peer aspirations factor predicted high school dropout versus high school graduation status. Neither the general factor nor any of the three specific factors predicted GED completion versus high school graduation. Ethnicity, but not gender, moderated the associations between scores on the general factor and high school graduation versus dropout. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Measurement Invariance in Confirmatory Factor Analysis: An Illustration Using IQ Test Performance of Minorities.	2010	29	3	39-47	Wicherts, Jelte M. and Dolan, Conor V.	Measurement invariance with respect to groups is an essential aspect of the fair use of scores of intelligence tests and other psychological measurements. It is widely believed that equal factor loadings are sufficient to establish measurement invariance in confirmatory factor analysis. Here, it is shown why establishing measurement invariance with confirmatory factor analysis requires a statistical test of the equality over groups of measurement intercepts. Without this essential test, measurement bias may be overlooked. A re-analysis of a study by on ethnic differences on the RAKIT IQ test illustrates that ignoring intercept differences may lead to the conclusion that bias of IQ tests with respect to minorities is small, while in reality bias is quite severe. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Detecting Measurement Disturbances in Rater-Mediated Assessments.	2017	36	4	44-51	Wind, Stefanie A. and Schumacker, Randall E.	The term measurement disturbance has been used to describe systematic conditions that affect a measurement process, resulting in a compromised interpretation of person or item estimates. Measurement disturbances have been discussed in relation to systematic response patterns associated with items and persons, such as start-up, plodding, boredom, or fatigue. An understanding of the different types of measurement disturbances can lead to a more complete understanding of persons or items in terms of the construct being measured. Although measurement disturbances have been explored in several contexts, they have not been explicitly considered in the context of performance assessments. The purpose of this study is to illustrate the use of graphical methods to explore measurement disturbances related to raters within the context of a writing assessment. Graphical displays that illustrate the alignment between expected and empirical rater response functions are considered as they relate to indicators of rating quality based on the Rasch model. Results suggest that graphical displays can be used to identify measurement disturbances for raters related to specific ranges of student achievement that suggest potential rater bias. Further, results highlight the added diagnostic value of graphical displays for detecting measurement disturbances that are not captured using Rasch model-data fit statistics. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Taking the Time to Improve the Validity of Low-Stakes Tests: The Effort-Monitoring CBT.	2006	25	2	21-30	Wise, Steven L. and Bhola, Dennison S. and Sheng-Ta Yang	The attractiveness of computer-based tests (CBTs) is due largely to their capability to expand the ways we conduct testing. A relatively unexplored application, however, is actively using the computer to reduce construct-irrelevant variance while a test is being administered. This investigation introduces the effort-monitoring CBT, in which the computer monitors examinee effort (based on item response time) in a low-stakes test and displays warning messages to those exhibiting rapid-guessing behavior. The results of an experimental study are presented, which showed that an effort-monitoring CBT increased examinee effort and yielded more valid test scores than a conventional CBT. Thus, unlike previous research that has focused on identifying rapid-guessing behavior after it has occurred, the effort-monitoring CBT proactively attempts to suppress rapid-guessing behavior. This innovative testing procedure extends the capabilities of measurement practitioners to manage the psychometric challenges posed by unmotivated examinees. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Validating English Language Proficiency Assessment Uses for English Learners: Academic Language Proficiency and Content Assessment Performance.	2016	35	2	6-18	Wolf, Mikyung Kim and Faulkner?Bond, Molly	States use standards-based English language proficiency (ELP) assessments to inform relatively high-stakes decisions for English learner (EL) students. Results from these assessments are one of the primary criteria used to determine EL students' level of ELP and readiness for reclassification. The results are also used to evaluate the effectiveness of and funding allocation to district or school programs that serve EL students. In an effort to provide empirical validity evidence for such important uses of ELP assessments, this study focused on examining the constructs of ELP assessments as a fundamental validity issue. Particularly, the study examined the types of language proficiency measured in three sample states' ELP assessments and the relationship between each type of language proficiency and content assessment performance. The results revealed notable variation in the presence of academic and social language in the three ELP assessments. A series of hierarchical linear modeling (HLM) analyses also revealed varied relationships among social language proficiency, academic language proficiency, and content assessment performance. The findings highlight the importance of examining the constructs of ELP assessments for making appropriate interpretations and decisions based on the assessment scores for EL students. Implications for policy and practice are discussed. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Application of Latent Trait Models to Identifying Substantively Interesting Raters.	2012	31	3	31-37	Wolfe, Edward W. and McVay, Aaron	Historically, research focusing on rater characteristics and rating contexts that enable the assignment of accurate ratings and research focusing on statistical indicators of accurate ratings has been conducted by separate communities of researchers. This study demonstrates how existing latent trait modeling procedures can identify groups of raters who may be of substantive interest to those studying the experiential, cognitive, and contextual aspects of ratings. We employ two data sources in our demonstration-simulated data and data from a large-scale state-wide writing assessment. We apply latent trait models to these data to identify examples of rater leniency, centrality, inaccuracy, and differential dimensionality; and we investigate the association between rater training procedures and the manifestation of rater effects in the real data. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	The Issue of Range Restriction in Bookmark Standard Setting.	2015	34	2	47-54	Wyse, Adam E.	This article uses data from a large-scale assessment program to illustrate the potential issue of range restriction with the Bookmark method in the context of trying to set cut scores to closely align with a set of college and career readiness benchmarks. Analyses indicated that range restriction issues existed across different response probability (RP) values and item response theory (IRT) models if one were to apply the Bookmark procedure using intact test forms. Results also suggested that range restriction may still be present if one had access to additional data from an item bank. This demonstration critically highlights challenges that may exist in some practical applications of the Bookmark method due items not being designed to cover the full range of examinee abilities. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	An Investigation of Undefined Cut Scores With the Hofstee Standard-Setting Method.	2017	36	4	28-34	Wyse, Adam E. and Babcock, Ben	This article provides an overview of the Hofstee standard-setting method and illustrates several situations where the Hofstee method will produce undefined cut scores. The situations where the cut scores will be undefined involve cases where the line segment derived from the Hofstee ratings does not intersect the score distribution curve based on actual exam performance data. Data from 15 standard settings performed by a credentialing organization are used to investigate how common undefined cut scores are with the Hofstee method and to compare cut scores derived from the Hofstee method with those from the Beuk method. Results suggest that when Hofstee cut scores exist that the Hofstee and Beuk methods often yield fairly similar results. However, it is shown that undefined Hofstee cut scores did occur in a few situations. When Hofstee cut scores are undefined, it is suggested that one extend the Hofstee line segment so that it intersects the score distribution curve to estimate cut scores. Analyses show that extending the line segment to estimate cut scores often yields similar results to the Beuk method. The article concludes with a discussion of what these results may imply for people who want to employ the Hofstee method. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Validity Evidence of an Electronic Portfolio for Preservice Teachers.	2008	27	1	10-24	Yao, Yuankun and Thomas, Matt and Nickens, Nicole and Downing, Joyce Anderson and Burkett, Ruth S. and Lamson, Sharon	This study applied Messick's unified, multifaceted concept of construct validity to an electronic portfolio system used in a teacher education program. The subjects included 128 preservice teachers who recently completed their final portfolio reviews and student teaching experiences. Four of Messick's six facets of validity were investigated for the portfolio in this study, along with a discussion of the remaining facets examined in two previous studies. The evidence provided support for the substantive and generalizability aspects of validity, and limited support for the content, structural, external, and consequential aspects of validity. It was suggested that the electronic portfolio may be used as one requirement for certification purposes, but may not be valid for the purpose of assessing teacher competencies. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Systematic Comparison of Decision Accuracy of Complex Compensatory Decision Rules Combining Multiple Tests in a Higher Education Context.	2018	37	3	24-39	Yocarini, Iris E. and Bouwmeester, Samantha and Smeets, Guus and Arends, Lidia R.	Abstract: This real?data?guided simulation study systematically evaluated the decision accuracy of complex decision rules combining multiple tests within different realistic curricula. Specifically, complex decision rules combining conjunctive aspects and compensatory aspects were evaluated. A conjunctive aspect requires a minimum level of performance, whereas a compensatory aspect requires an average level of performance. Simulations were performed to obtain students' true and observed score distributions and to manipulate several factors relevant to a higher education curriculum in practice. The results showed that the decision accuracy depends on the conjunctive (required minimum grade) and compensatory (required grade point average) aspects and their combination. Overall, within a complex compensatory decision rule the false negative rate is lower and the false positive rate higher compared to a conjunctive decision rule. For a conjunctive decision rule the reverse is true. Which rule is more accurate also depends on the average test reliability, average test correlation, and the number of reexaminations. This comparison highlights the importance of evaluating decision accuracy in high?stake decisions, considering both the specific rule as well as the selected measures. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Grade-Level Invariance of a Theoretical Causal Structure Predicting Reading Comprehension With Vocabulary and Oral Reading Fluency.	2005	24	3	4-12	Yovanoff, Paul and Duesbery, Luke and Alonzo, Julie and Tindal, Gerald	This research investigates the relative importance of vocabulary and oral reading fluency as measurement dimensions of reading comprehension as the student passes from elementary to high school. Invariance of this model over grades 4 through 8 is tested using two independent student samples reading grade-level appropriate passages. Results from structural equation modeling indicate that the model is not invariant across grade levels. Vocabulary knowledge is a significant and constant predictor of overall reading comprehension irrespective of grade level. While significant, fluency effects diminish over grades, especially in the later grades. Lack of grade level invariance was obtained with both samples. Results are discussed in light of vertically linked reading assessments, adequate yearly progress, and instruction. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Predicting College Performance of Homeschooled Versus Traditional Students.	2016	35	4	31-39	Yu, Martin C. and Sackett, Paul R. and Kuncel, Nathan R.	The prevalence of homeschooling in the United States is increasing. Yet little is known about how commonly used predictors of postsecondary academic performance (SAT, high school grade point average [HSGPA]) perform for homeschooled students. Postsecondary performance at 140 colleges and universities was analyzed comparing a sample of traditional students matched to a sample of 732 homeschooled students on four demographic variables, HSGPA, and SAT scores. The matched sample was drawn from 824,940 traditional students attending the same institutions as the homeschooled students, which permitted a very precise level of matching. This comparison did not show a difference in first-year college GPA (FGPA) or retention between homeschooled and traditional students. SAT scores predicted FGPA and retention equally well for both groups, but HSGPA was a weaker predictor for the homeschooled group. These results suggest that, among college students, those who were homeschooled perform similarly to traditionally educated students matched on demographics and academic preparedness, but there are practical implications for college admissions in the use of HSGPA versus standardized test scores for homeschooled students. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
	Are There Gender Differences in How Students Write Their Essays? An Analysis of Writing Processes.	2019	38	2	14-26	Zhang, Mo and Bennett, Randy E. and Deane, Paul and Rijn, Peter W.	This study compared gender groups on the processes used in writing essays in an online assessment. Middle?school students from four grades responded to essays in two persuasive subgenres, argumentation and policy recommendation. Writing processes were inferred from four indicators extracted from students' keystroke logs. In comparison to males, on average females not only obtained higher essay scores but differed from males in their writing processes. Females entered text more fluently, engaged in more macro and local editing, and showed less need to pause at locations associated with planning (e.g., between bursts of text, at sentence boundaries). That these differences were detected after controlling for essay scores suggests that they cannot be attributed solely to disparities in group writing skill. [ABSTRACT FROM AUTHOR]
Copyright of Educational Measurement: Issues & Practice is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)