University Digital Conservancy :: Browsing by Author "Levine, Michael V."

Browsing by Author "Levine, Michael V."

Now showing 1 - 7 of 7

Appropriateness measurement for some multidimensional test batteries
(1991) Drasgow, Fritz; Levine, Michael V.; McLaughlin, Mary E.
Model-based methods for the detection of individuals inadequately measured by a test have generally been limited to unidimensional tests. Extensions of unidimensional appropriateness indices are developed here for multi-unidimensional tests (i.e., multidimensional tests composed of unidimensional subtests). Simulated and real data were used to evaluate the effectiveness of the multitest appropriateness indices. Very high rates of detection of spuriously high and spuriously low response patterns were obtained with the simulated data. These detection rates were comparable to rates obtained for long unidimensional tests (both simulated and real) with approximately the same number of items. For real data, similarly high detection rates were obtained in the spuriously high condition; slightly lower detection rates were observed for the spuriously low condition. Several directions for future research are described. Index terms: appropriateness measurement, item response theory, multidimensional tests, optimal appropriateness measurement, polychotomous measurement.
Detecting inappropriate test scores with optimal and practical appropriateness indices
(1987) Drasgow, Fritz; Levine, Michael V.; McLaughlin, Mary E.
Several statistics have been proposed as quantitative indices of the appropriateness of a test score as a measure of ability. Two criteria have been used to evaluate such indices in previous research. The first criterion, standardization, refers to the extent to which the conditional distributions of an index, given ability, are invariant across ability levels. The second criterion, relative power, refers to indices’ relative effectiveness for detecting inappropriate test scores. In this paper the effectiveness of nine appropriateness indices is determined in an absolute sense by comparing them to optimal indices; an optimal index is the most powerful index for a particular form of aberrance that can be computed from item responses. Three indices were found to provide nearly optimal rates of detection of very low ability response patterns modified to simulate cheating, as well as very high ability response patterns modified to simulate spuriously low responding. Optimal indices had detection rates from 50% to 200% higher than any other index when average ability response vectors were manipulated to appear spuriously high and spuriously low.
Fitting polytomous item response theory models to multiple-choice tests
(1995) Drasgow, Fritz; Levine, Michael V.; Tsien, Sherman; Williams, Bruce; Mead, Alan D.
This study examined how well current software implementations of four polytomous item response theory models fit several multiple-choice tests. The models were Bock’s (1972) nominal model, Samejima’s (1979) multiple-choice Model C, Thissen & Steinberg’s (1984) multiple-choice model, and Levine’s (1993) maximum-likelihood formula scoring model. The parameters of the first three of these models were estimated with Thissen’s (1986) MULTILOG computer program; Williams & Levine’s (1993) FORSCORE program was used for Levine’s model. Tests from the Armed Services Vocational Aptitude Battery, the Scholastic Aptitude Test, and the American College Test Assessment were analyzed. The models were fit in estimation samples of approximately 3,000; cross-validation samples of approximately 3,000 were used to evaluate goodness of fit. Both fit plots and X² statistics were used to determine the adequacy of fit. Bock’s model provided surprisingly good fit; adding parameters to the nominal model did not yield improvements in fit. FORSCORE provided generally good fit for Levine’s nonparametric model across all tests. Index terms: Bock’s nominal model, FORSCORE, maximum likelihood formula scoring, MULTILOG, polytomous IRT.
Item bias in a test of reading comprehension
(1981) Linn, Robert L.; Levine, Michael V.; Hastings, C. Nicholas; Wardrop, James L.
The possibility that certain features of items on a reading comprehension test may lead to biased estimates of the reading achievement of particular subgroups of students was investigated. Eight nonoverlapping subgroups of students were defined by the combinations of three factors: student grade level (fifth or sixth), income level of the neighborhood in which the school was located (low and middle or above), and race of the student (black or white). Estimates of student ability and item parameters were obtained separately for each of the eight subgroups using the three-parameter logistic model. Bias indices were computed based on differences in item characteristic curves for pairs of subgroups. A criterion for labeling an item as biased was developed using the distribution of bias indices for subgroups of the same race that differed only in income level or grade level. Using this criterion, three items were consistently identified as biased in four independent comparisons of subgroups of black and white students. Comparisons of content and format characteristics of items that were identified as biased with those that were not, or between items biased in different directions, did not lead to the identification of any systematic content differences. The study did provide strong support for the viability of the estimation procedure; item characteristics, estimated with samples from different populations were very similar. Some suggestions for improvements in methodology are offered.
Measuring the difference between two models
(1992) Levine, Michael V.; Drasgow, Drasgow, Fritz Fritz; Williams, Bruce; McCusker, Christopher; Thomasson, Gary L.
Two psychometric models with very different parametric formulas and item response functions can make virtually the same predictions in all applications. By applying some basic results from the theory of hypothesis testing and from signal detection theory, the power of the most powerful test for distinguishing the models can be computed. Measuring model misspecification by computing the power of the most powerful test is proposed. If the power of the most powerful test is low, then the two models will make nearly the same prediction in every application. If the power is high, there will be applications in which the models will make different predictions. This measure, that is, the power of the most powerful test, places various types of model misspecification- item parameter estimation error, multidimensionality, local independence failure, learning and/or fatigue during testing-on a common scale. The theory supporting the method is presented and illustrated with a systematic study of misspecification due to item response function estimation error. In these studies, two joint maximum likelihood estimation methods (LOGIST 2B and LOGIST 5) and two marginal maximum likelihood estimation methods (BILOG and ForScore) were contrasted by measuring the difference between a simulation model and a model obtained by applying an estimation method to simulation data. Marginal estimation was found generally to be superior to joint estimation. The parametric marginal method (BILOG) was superior to the nonparametric method only for three-parameter logistic models. The nonparametric marginal method (ForScore) excelled for more general models. Of the two joint maximum likelihood methods studied, LOGIST s appeared to be more accurate than LOGIST 2B. Index terms: BILOG; forced-choice experiment; ForScore; ideal observer method; item response theory, estimation, models; LOGIST; multilinear formula score theory.
Modeling incorrect responses to multiple-choice items with multilinear formula score theory
(1989) Drasgow, Fritz; Levine, Michael V.; Williams, Bruce; McLaughlin, Mary E.; Candell, Gregory L.
Multilinear formula score theory (Levine, 1984, 1985, 1989a, 1989b) provides powerful methods for addressing important psychological measurement problems. In this paper, a brief review of multilinear formula scoring (MFS) is given, with specific emphasis on estimating option characteristic curves (OCCS). MFS was used to estimate OCCS for the Arithmetic Reasoning subtest of the Armed Services Vocational Aptitude Battery. A close match was obtained between empirical proportions of option selection for examinees in 25 ability intervals and the modeled probabilities of option selection. In a second analysis, accurately estimated OCCS were obtained for simulated data. To evaluate the utility of modeling incorrect responses to the Arithmetic Reasoning test, the amounts of statistical information about ability were computed for dichotomous and polychotomous scorings of the items. Consistent with earlier studies, moderate gains in information were obtained for low to slightly above average abilities. Index terms: item response theory, marginal maximum likelihood estimation, maximum likelihood estimation, multilinear formula scoring, option characteristic curves, polychotomous measurement, test information function.
Optimal detection of certain forms of inappropriate test scores
(1986) Drasgow, Fritz; Levine, Michael V.
Optimal appropriateness indices, recently introduced by Levine and Drasgow (1984), provide the highest rates of detection of aberrant response patterns that can be obtained from item responses. In this article they are used to study three important problems in appropriateness measurement. First, the maximum detection rates of two particular forms of aberrance are determined for a long unidimensional test. These detection rates are shown to be moderately high. Second, two versions of the standardized l0 appropriateness index are compared to optimal indices. At low false alarm rates, one standardized l0 index has detection rates that are about 65% as large as optimal for spuriously high (cheating) test scores. However, for the spuriously low scores expected from persons with ill-advised testing strategies or reading problems, both standardized l0 indices are far from optimal. Finally, detection rates for polychotomous and dichotomous scorings of the item responses are compared. It is shown that dichotomous scoring causes serious decreases in the detectability of some aberrant response patterns. Consequently, appropriateness measurement constitutes one practical testing problem in which significant gains result from the use of a polychotomous item response model.

University Digital Conservancy

Browsing by Author "Levine, Michael V."

Results Per Page

Sort Options