Browsing by Author "Drasgow, Fritz"
Now showing 1 - 18 of 18
- Results Per Page
- Sort Options
Item Application of unidimensional item response theory models to mutidimensional data(1983) Drasgow, Fritz; Parsons, Charles K.A simulation model was developed for generating item responses from a multidimensional latent trait space The model permits the prepotency of a general latent trait underlying responses to all simulated items to be varied systematically. Five levels of prepotency were used to generate data sets The levels of prepotency ranged from a truly unidimensional latent trait space to a very weak general latent trait. Simulated item pools with guessing and without guessing were analyzed by the LOGIST computer program The general latent trait was recovered in data sets where the prepotency of the general latent trait was only moderate. Consequently, it appears that item response theory models can be applied to moderately heterogenous item pools under the conditions simulated here.Item Appropriateness measurement for some multidimensional test batteries(1991) Drasgow, Fritz; Levine, Michael V.; McLaughlin, Mary E.Model-based methods for the detection of individuals inadequately measured by a test have generally been limited to unidimensional tests. Extensions of unidimensional appropriateness indices are developed here for multi-unidimensional tests (i.e., multidimensional tests composed of unidimensional subtests). Simulated and real data were used to evaluate the effectiveness of the multitest appropriateness indices. Very high rates of detection of spuriously high and spuriously low response patterns were obtained with the simulated data. These detection rates were comparable to rates obtained for long unidimensional tests (both simulated and real) with approximately the same number of items. For real data, similarly high detection rates were obtained in the spuriously high condition; slightly lower detection rates were observed for the spuriously low condition. Several directions for future research are described. Index terms: appropriateness measurement, item response theory, multidimensional tests, optimal appropriateness measurement, polychotomous measurement.Item Choice of test model for appropriateness measurement(1982) Drasgow, FritzSeveral theoretical and empirical issues that must be addressed before appropriateness measurement can be used by practitioners are investigated in this paper. These issues include selection of a latent trait model for multiple-choice tests, selection of a particular appropriateness index, and the sample size required for parameter estimation. The threeparameter logistic model is found to provide better detection of simulated spuriously low examinees than the Rasch model for the Graduate Record Examination, Verbal Section. All three appropriateness indices proposed by Levine and Rubin (1979) provide good detection of simulated spuriously low examinees but poor detection of simulated spuriously high examinees. A reason for this discrepancy is provided.Item Detecting faking on a personality instrument using appropriateness measurement(1996) Zickar, Michael J.; Drasgow, FritzResearch has demonstrated that people can and often do consciously manipulate scores on personality tests. Test constructors have responded by using social desirability and lying scales in order to identify dishonest respondents. Unfortunately, these approaches have had limited success. This study evaluated the use of appropriateness measurement for identifying dishonest respondents. A dataset was analyzed in which respondents were instructed either to answer honestly or to fake good. The item response theory approach classified a higher number of faking respondents at low rates of misclassification of honest respondents (false positives) than did a social desirability scale. At higher false positive rates, the social desirability approach did slightly better. Implications for operational testing and suggestions for further research are provided. Index terms: appropriateness measurement, detecting faking, item response theory, lying scales, person fit, personality measurement.Item Detecting inappropriate test scores with optimal and practical appropriateness indices(1987) Drasgow, Fritz; Levine, Michael V.; McLaughlin, Mary E.Several statistics have been proposed as quantitative indices of the appropriateness of a test score as a measure of ability. Two criteria have been used to evaluate such indices in previous research. The first criterion, standardization, refers to the extent to which the conditional distributions of an index, given ability, are invariant across ability levels. The second criterion, relative power, refers to indices’ relative effectiveness for detecting inappropriate test scores. In this paper the effectiveness of nine appropriateness indices is determined in an absolute sense by comparing them to optimal indices; an optimal index is the most powerful index for a particular form of aberrance that can be computed from item responses. Three indices were found to provide nearly optimal rates of detection of very low ability response patterns modified to simulate cheating, as well as very high ability response patterns modified to simulate spuriously low responding. Optimal indices had detection rates from 50% to 200% higher than any other index when average ability response vectors were manipulated to appear spuriously high and spuriously low.Item Distinguishing among parametric item response models for polychotomous ordered data(1994) Maydeu-Olivares, Albert; Drasgow, Fritz; Mead, Alan D.Several item response models have been proposed for fitting Likert-type data. Thissen & Steinberg (1986) classified most of these models into difference models and divide-by-total models. Although they have different mathematical forms, divide-by-total and difference models with the same number of parameters seem to provide very similar fit to the data. The ideal observer method was used to compare two models with the same number of parameters-Samejima’s (1969) graded response model (a difference model) and Thissen & Steinberg’s (1986) extension of Masters’ (1982) partial credit model (a divide-by-total model-to investigate whether difference models or divide-by-total models should be preferred for fitting Likert-type data. The models were found to be very similar under the conditions investigated, which included scale lengths from 5 to 25 items (five-option items were used) and calibration samples of 250 to 3,000. The results suggest that both models fit approximately equally well in most practical applications. Index terms: graded response model, IRT, Likert scales, partial credit model, polychotomous models, psychometrics.Item Estimators of the squared cross-validity coefficient: A Monte Carlo investigation(1979) Drasgow, Fritz; Dorans, Neil J.; Tucker, Ledyard R.A monte carlo experiment was used to evaluate four procedures for estimating the population squared cross-validity of a sample least squares regression equation. Four levels of population squared multiple correlation (Rp2) and three levels of number of predictors (n) were factorially crossed to produce 12 population covariance matrices. Random samples at four levels of sample size (N) were drawn from each population. The levels of N, n, and RP2 were carefully selected to ensure relevance of simulation results for much applied research. The least squares regression equation from each sample was applied in its respective population to obtain the actual population squared cross-validity (Rcv2). Estimates of Rcv2 were computed using three formula estimators and the double cross-validation procedure. The results of the experiment demonstrate that two estimators which have previously been advocated in the literature were negatively biased and exhibited poor accuracy. The negative bias for these two estimators increased as Rp2 decreased and as the ratio of N to n decreased. As a consequence, their biases were most evident in small samples where cross-validation is imperative. In contrast, the third estimator was quite accurate and virtually unbiased within the scope of this simulation. This third estimator is recommended for applied settings which are adequately approximated by the correlation model.Item An evaluation of marginal maximum likelihood estimation for the two-parameter logistic model(1989) Drasgow, FritzThe accuracy of marginal maximum likelihood estimates of the item parameters of the two-parameter logistic model was investigated. Estimates were obtained for four sample sizes and four test lengths; joint maximum likelihood estimates were also computed for the two longer test lengths. Each condition was replicated 10 times, which allowed evaluation of the accuracy of estimated item characteristic curves, item parameter estimates, and estimated standard errors of item parameter estimates for individual items. Items that are typical of a widely used job satisfaction scale and moderately easy tests had satisfactory marginal estimates for all sample sizes and test lengths. Larger samples were required for items with extreme difficulty or discrimination parameters. Marginal estimation was substantially better than joint maximum likelihood estimation. Index terms: Fletcher-Powell algorithm, item parameter estimation, item response theory, joint maximum likelihood estimation, marginal maximum likelihood estimation, two-parameter logistic model.Item Fitting polytomous item response theory models to multiple-choice tests(1995) Drasgow, Fritz; Levine, Michael V.; Tsien, Sherman; Williams, Bruce; Mead, Alan D.This study examined how well current software implementations of four polytomous item response theory models fit several multiple-choice tests. The models were Bock’s (1972) nominal model, Samejima’s (1979) multiple-choice Model C, Thissen & Steinberg’s (1984) multiple-choice model, and Levine’s (1993) maximum-likelihood formula scoring model. The parameters of the first three of these models were estimated with Thissen’s (1986) MULTILOG computer program; Williams & Levine’s (1993) FORSCORE program was used for Levine’s model. Tests from the Armed Services Vocational Aptitude Battery, the Scholastic Aptitude Test, and the American College Test Assessment were analyzed. The models were fit in estimation samples of approximately 3,000; cross-validation samples of approximately 3,000 were used to evaluate goodness of fit. Both fit plots and X² statistics were used to determine the adequacy of fit. Bock’s model provided surprisingly good fit; adding parameters to the nominal model did not yield improvements in fit. FORSCORE provided generally good fit for Levine’s nonparametric model across all tests. Index terms: Bock’s nominal model, FORSCORE, maximum likelihood formula scoring, MULTILOG, polytomous IRT.Item Introduction to the Polytomous IRT Special Issue(1995) Drasgow, FritzItem An iterative procedure for linking metrics and assessing item bias in item response theory(1988) Candell, Gregory L.; Drasgow, FritzThe presence of biased items may seriously affect methods used to link metrics in item response theory. An iterative procedure designed to minimize this methodological problem was examined in a monte carlo investigation using the two-parameter item response model. The iterative procedure links the scales of independently calibrated parameter estimates using only those items identified as unbiased. Two methods for transforming parameter estimates to a common metric were incorporated into the iterative procedure. The first method links scales by equating the first two moments of the distributions of estimated item difficulties. The second method determines the linking transformation by minimizing differences across IRT characteristic curve estimates. Results indicate that iterative linking provides a substantial improvement in item bias detection over the noniterative approach. Index terms: Item bias, Item response theory, Iterative method, Linking, Metric linking, Two-parameter item response model.Item Lord's chi-square test of item bias with estimated and with known person parameters(1987) McLaughlin, Mary E.; Drasgow, FritzProperties of Lord’s chi-square test of item bias were studied in a computer simulation. 0 parameters were drawn from a standard normal distribution and responses to a 50-item test were generated using SAT-v item parameters estimated by Lord. One hundred independent samples were generated under each of the four combinations of two sample sizes (N = 1,000 and N = 250) and two logistic models (two- and three-parameter). LOGIST was used to estimate item and person parameters simultaneously. For each of the 50 items, 50 independent chi-square tests of the equality of item parameters were calculated. Proportions of significant chi-squares were calculated over items and samples, at alpha levels of .0005, .001, .005, .01, .05, and .10. The overall proportions significant were as high as 11 times the nominal alpha level. The proportion significant for some items was as high as .32 when the nominal alpha level was .05. When person parameters were held fixed at their true values and only item parameters were estimated, the actual rejection rates were close to the nominal rates.Item Modeling incorrect responses to multiple-choice items with multilinear formula score theory(1989) Drasgow, Fritz; Levine, Michael V.; Williams, Bruce; McLaughlin, Mary E.; Candell, Gregory L.Multilinear formula score theory (Levine, 1984, 1985, 1989a, 1989b) provides powerful methods for addressing important psychological measurement problems. In this paper, a brief review of multilinear formula scoring (MFS) is given, with specific emphasis on estimating option characteristic curves (OCCS). MFS was used to estimate OCCS for the Arithmetic Reasoning subtest of the Armed Services Vocational Aptitude Battery. A close match was obtained between empirical proportions of option selection for examinees in 25 ability intervals and the modeled probabilities of option selection. In a second analysis, accurately estimated OCCS were obtained for simulated data. To evaluate the utility of modeling incorrect responses to the Arithmetic Reasoning test, the amounts of statistical information about ability were computed for dichotomous and polychotomous scorings of the items. Consistent with earlier studies, moderate gains in information were obtained for low to slightly above average abilities. Index terms: item response theory, marginal maximum likelihood estimation, maximum likelihood estimation, multilinear formula scoring, option characteristic curves, polychotomous measurement, test information function.Item Optimal detection of certain forms of inappropriate test scores(1986) Drasgow, Fritz; Levine, Michael V.Optimal appropriateness indices, recently introduced by Levine and Drasgow (1984), provide the highest rates of detection of aberrant response patterns that can be obtained from item responses. In this article they are used to study three important problems in appropriateness measurement. First, the maximum detection rates of two particular forms of aberrance are determined for a long unidimensional test. These detection rates are shown to be moderately high. Second, two versions of the standardized l0 appropriateness index are compared to optimal indices. At low false alarm rates, one standardized l0 index has detection rates that are about 65% as large as optimal for spuriously high (cheating) test scores. However, for the spuriously low scores expected from persons with ill-advised testing strategies or reading problems, both standardized l0 indices are far from optimal. Finally, detection rates for polychotomous and dichotomous scorings of the item responses are compared. It is shown that dichotomous scoring causes serious decreases in the detectability of some aberrant response patterns. Consequently, appropriateness measurement constitutes one practical testing problem in which significant gains result from the use of a polychotomous item response model.Item Paradoxes, contradictions, and illusions(1989) Humphreys, Lloyd G.; Drasgow, FritzThere is no contradiction between a powerful significance test based on a difference score and the necessity for reliable measurement of the dependent measure in a controlled experiment. In fact, the former requires the latter. In this paper we review the conclusions that were drawn by Humphreys and Drasgow (1989) and show that Overall’s (1989) "contradiction" is an illusion derived from imprecise language. Index terms: analysis of covariance, baseline correction, control of individual differences, difference scores, measurement of change, reliability of the marginal distribution, statistical power, within-group reliabilities.Item Recovery of two- and three-parameter logistic item characteristic curves: A monte carlo study(1982) Hulin, Charles L.; Lissak, Robin I.; Drasgow, FritzThis monte carlo study assessed the accuracy of simultaneous estimation of item and person parameters in item response theory. Item responses were simulated using the two- and three-parameter logistic models. Samples of 200, 500, 1,000, and 2,000 simulated examinees and tests of 15, 30, and 60 items were generated. Item and person parameters were then estimated using the appropriate model. The root mean squared error between recovered and actual item characteristic curves served as the principal measure of estimation accuracy for items. The accuracy of estimates of ability was assessed by both correlation and root mean squared error. The results indicate that minimum sample sizes and tests lengths depend upon the response model and the purposes of an investigation. With item responses generated by the two-parameter model, tests of 30 items and samples of 500 appear adequate for some purposes. Estimates of ability and item parameters were less accurate in small sample sizes when item responses were generated by the three-parameter logistic model. Here samples of 1,000 examinees with tests of 60 items seem to be required for highly accurate estimation. Tradeoffs between sample size and test length are apparent, however.Item Robustness of estimators of the squared multiple correlation and squared cross-validity coefficient to violations of multivariate normality(1982) Drasgow, Fritz; Dorans, Neil J.A monte carlo experiment was conducted to evaluate the robustness of two estimators of the population squared multiple correlation (R2p) and one estimator of the population squared cross-validity coefficient (R2cv) to a common violation of multivariate normality. Previous research has shown that these estimators are approximately unbiased when independent and dependent variables follow a joint multivariate normal distribution. The particular violation of multivariate normality studied here consisted of a dependent variable that may assume only a few discrete values. The discrete dependent variable was simulated by categorizing an underlying continuous variable that did satisfy the multivariate normality condition. Results illustrate the attenuating effects of categorization upon R2p and R2cv. In addition, the distributions of sample squared multiple correlations and sample squared cross-validity coefficients are affected by categorization mainly through the attenuations of R2P and R2cv. Consequently, the formula estimators of R2p and R2cv were found to be as accurate and unbiased with discrete dependent variables as they were with continuous dependent variables. Substantive researchers who use categorical dependent variables, perhaps obtained by rating scale judgments, can justifiably employ any of the three estimators examined here.Item Some comments on the relation between reliability and statistical power(1989) Humphreys, Lloyd G.; Drasgow, FritzSeveral articles have discussed the curious fact that a difference score with zero reliability can nonetheless allow a powerful test of change. This statistical legerdemain should not be overemphasized for three reasons. First, although the reliability of the difference score may be unrelated to power, the reliabilities of the variables used to create the difference scores are directly related to the power of the test. Second, with what some will regard as additional legerdemain, it is possible to define reliability in the context of a difference score in such a way that power is a direct function of reliability. The third and most serious objection to the conclusion that the reliability of a difference score is unimportant is that the underlying statistical model used in its derivation is rarely appropriate for psychological data. Index terms: control of individual differences, difference scores, reliability, reliability of the marginal distribution, statistical power, within-group reliabilities.