Applied Psychological Measurement, Volume 19, 1995
Persistent link for this collectionhttps://hdl.handle.net/11299/114835
Browse
Browsing Applied Psychological Measurement, Volume 19, 1995 by Issue Date
Now showing 1 - 20 of 27
- Results Per Page
- Sort Options
Item An alternative approach for IRT observed-score equating of number-correct scores(1995) Zeng, Lingjia; Kolen, Michael J.An alternative approach for item response theory observed-score equating is described. The number-correct score distributions needed in equating are found by numerical integration over the theoretical or empirical distributions of examinees’ traits. The item response theory true-score equating method and the observed-score equating method described by Lord, in which the number-correct score distributions are summed over a sample of trait estimates, are compared in a real test example. In a computer simulation, the observed-score equating methods based on numerical integration and summation were compared using data generated from standard normal and skewed populations. The method based on numerical integration was found to be less biased, especially at the two ends of the score distribution. This method can be implemented without the need to estimate trait level for individual examinees, and it is less computationally intensive than the method based on summation. Index terms: equating, item response theory, numerical integration, observed-score equating.Item Fitting polytomous item response theory models to multiple-choice tests(1995) Drasgow, Fritz; Levine, Michael V.; Tsien, Sherman; Williams, Bruce; Mead, Alan D.This study examined how well current software implementations of four polytomous item response theory models fit several multiple-choice tests. The models were Bock’s (1972) nominal model, Samejima’s (1979) multiple-choice Model C, Thissen & Steinberg’s (1984) multiple-choice model, and Levine’s (1993) maximum-likelihood formula scoring model. The parameters of the first three of these models were estimated with Thissen’s (1986) MULTILOG computer program; Williams & Levine’s (1993) FORSCORE program was used for Levine’s model. Tests from the Armed Services Vocational Aptitude Battery, the Scholastic Aptitude Test, and the American College Test Assessment were analyzed. The models were fit in estimation samples of approximately 3,000; cross-validation samples of approximately 3,000 were used to evaluate goodness of fit. Both fit plots and X² statistics were used to determine the adequacy of fit. Bock’s model provided surprisingly good fit; adding parameters to the nominal model did not yield improvements in fit. FORSCORE provided generally good fit for Levine’s nonparametric model across all tests. Index terms: Bock’s nominal model, FORSCORE, maximum likelihood formula scoring, MULTILOG, polytomous IRT.Item Complex composites: Issues that arise in combining different modes of assessment(1995) Wilson, Mark; Wang, Wen-chungData from the California Learning Assessment System are used to examine certain characteristics of tests designed as the composites of items of different modes. The characteristics include rater severity, test information, and definition of the latent variable. Three different assessment modes-multiple-choice, open-ended, and investigation items (the latter two are referred to as performance- based modes)-were combined in a test across three different test forms. Rater severity was investigated by incorporating a rater parameter for each rater in an item response model that then was used to analyze the data. Some rater severities were found to be quite extreme, and the impact of this variation in rater severities on both total scores and trait level estimates was examined. Within-rater variation in rater severity also was examined and was found to have significant variation. The information contribution of the three modes was compared. Performance-based items provided more information than multiple-choice items and also provided greatest precision for higher levels of the latent variable. A projection-like method was applied to investigate the effects of assessment mode on the definition of the latent variable. The multiple-choice items added information to the performance-based variable. The results of the analysis also showed that the projection-like method did not practically differ from the results when the latent trait was defined jointly by both the multiple-choice and the performance-based items. Index terms: equating, linking, multiple assessment modes, polytomous item response models, rater effects.Item Distinctive and incompatible properties of two common classes of IRT models for graded responses(1995) Andrich, DavidTwo classes of models for graded responses, the first based on the work of Thurstone and the second based on the work of Rasch, are juxtaposed and shown to satisfy important, but mutually incompatible, criteria and to reflect different response processes. Specifically, in the Thurstone models if adjacent categories are joined to form a new category, either before or after the data are collected, then the probability of a response in the new category is the sum of the probabilities of the responses in the original categories. However, the model does not have the explicit property that if the categories are so joined, then the estimate of the location of the entity or object being measured is invariant before and after the joining. For the Rasch models, if a pair of adjacent categories are joined and then the data are collected, the estimate of the location of the entity is the same before and after the joining, but the probability of a response in the new category is not the sum of the probabilities of the responses in the original categories. Furthermore, if data satisfy the model and the categories are joined after the data are collected, then they no longer satisfy the same Rasch model with the smaller number of categories. These differences imply that the choice between these two classes of models for graded responses is not simply a matter of preference; they also permit a better understanding of the choice of models for graded response data as a function of the underlying processes they are intended to represent. Index terms: graded responses, joining assumption, polytomous IRT models, Rasch model, Thurstone model.Item Selection of unidimensional scales from a multidimensional item bank in the polytomous Mokken IRT model(1995) Hemker, Bas T.; Sijtsma, Klaas; Molenaar, Ivo W.An automated item selection procedure for selecting unidimensional scales of polytomous items from multidimensional datasets is developed for use in the context of the Mokken item response theory model of monotone homogeneity (Mokken & Lewis, 1982). The selection procedure is directly based on the selection procedure proposed by Mokken (1971, p. 187) and relies heavily on the scalability coefficient H (Loevinger, 1948; Molenaar, 1991). New theoretical results relating the latent model structure to H are provided. The item selection procedure requires selection of a lower bound for H. A simulation study determined ranges of H for which the unidimensional item sets were retrieved from multidimensional datasets. If multidimensionality is suspected in an empirical dataset, well-chosen lower bound values can be used effectively to detect the unidimensional scales. Index terms: item response theory, Mokken model, multidimensional item banks, nonparametric item response models, scalability coefficient H, test construction, unidimensional scales.Item Analyzing homogeneity and heterogeneity of change using Rasch and latent class models: A comparative and integrative approach(1995) Meiser, Thorsten; Hein-Eggers, Monika; Rompe, Pamela; Rudinger, GeorgThe application of unidimensional Rasch models to longitudinal data assumes homogeneity of change over persons. Using latent class models, several classes with qualitatively distinct patterns of development can be taken into account; thus, heterogeneity of change is assumed. The mixed Rasch model integrates both the Rasch and the latent class approach by dividing the population of persons into classes that conform to Rasch models with class-specific parameters. Thus, qualitatively different patterns of change can be modeled with the homogeneity assumption retained within each class, but not between classes. In contrast to the usual latent class approach, the mixed Rasch model includes a quantitative differentiation among persons in the same class. Thus, quantitative differences in the level of the latent attribute are disentangled from the qualitative shape of development. A theoretical comparison of the formal approaches is presented here, as well as an application to empirical longitudinal data. In the context of personality development in childhood and early adolescence, the existence of different developmental trajectories is demonstrated for two aspects of personality. Relations between the latent trajectories and discrete exogenous variables are investigated. Index terms: latent class analysis, latent structure analysis, measurement of change, mixture distribution models, Rasch model, rating scale model.Item A minimum X² method for equating tests under the graded response model(1995) Kim, Seock-Ho; Cohen, Allan S.The minimum X² method for computing equating coefficients for tests with dichotomously scored items was extended to the case of Samejima’s graded response items. The minimum X² method was compared with the test response function method (also referred to as the test characteristic curve method) in which the equating coefficients were obtained by matching the test response functions of the two tests. The minimum X² method was much less demanding computationally and yielded equating coefficients that differed little from those obtained using the test response function approach. Index terms: equating, graded response model, item response theory, minimum X² method, test response function method.Item The effects of correlated errors on generalizability and dependability coefficients(1995) Bost, James E.This study investigated the effects of correlated errors on the person x occasion design in which the confounding effect of equal time intervals results in correlated error terms in the linear model. Two specific error correlation structures were examined: the first-order stationary autoregressive (SARI), and the first-order nonstationary autoregressive (NARI) with increasing variance parameters. The effects of correlated errors on the existing generalizability and dependability coefficients were assessed by simulating data with known variances (six different combinations of person, occasion, and error variances), occasion sizes, person sizes, correlation parameters, and increasing variance parameters. Estimates derived from the simulated data were compared to their true values. The traditional estimates were acceptable when the error terms were not correlated and the error variances were equal. The coefficients were underestimated when the errors were uncorrelated with increasing error variances. However, when the errors were correlated with equal vanances the traditional formulas overestimated both coefficients. When the errors were correlated with increasing variances, the traditional formulas both overestimated and underestimated the coefficients. Finally, increasing the number of occasions sampled resulted in more improved generalizability coefficient estimates than dependability coefficient estimates. Index terms: changing error variances, computer simulation, correlated errors, dependability coefficients, generalizability coefficients.Item Full-information factor analysis for polytomous item responses(1995) Muraki, Eiji; Carlson, James E.A full-information item factor analysis model for multidimensional polytomously scored item response data is developed as an extension of previous work by several authors. The model is expressed both in factor-analytic and item response theory parameters. Reckase’s multidimensional parameters for the model also are discussed as well as the related geometry. An EM algorithm for estimation of the model parameters is presented and results of the analysis of item response data by a computer program incorporating this algorithm are presented. Index terms: EM algorithm, full-information item factor analysis, multidimensional item response theory, polytomous response data.Item Introduction to the Polytomous IRT Special Issue(1995) Drasgow, FritzItem IRT-based internal measures of differential functioning of items and tests(1995) Raju, Nambury S.; Van der Linden, Wim J.; Fleer, Paul F.Internal measures of differential functioning of items and tests (DFIT) based on item response theory (IRT) are proposed. Within the DFIT context, the new differential test functioning (DTF) index leads to two new measures of differential item functioning (DIF) with the following properties: (1) The compensatory DIF (CDIF) indexes for all items in a test sum to the DTF index for that test and, unlike current DIF procedures, the CDIF index for an item does not assume that the other items in the test are unbiased ; (2) the noncompensatory DIF (NCDIF) index, which assumes that the other items in the test are unbiased, is comparable to some of the IRT-based DIP indexes; and (3) CDIF and NCDIF, as well as DTF, are equally valid for polytomous and multidimensional IRT models. Monte carlo study results, comparing these indexes with Lord’s X² test, the signed area measure, and the unsigned area measure, demonstrate that the DFIT framework is accurate in assessing DTF, CDIF, and NCDIF. Index Terms: area measures of DIF, compensatory DIF, differential functioning of items and tests (DFIT), differential item functioning, differential test functioning, Lord’s X²; noncompensatory DIF, nonuniform DIF, uniform DIF.Item Hyperbolic cosine latent trait models for unfolding direct responses and pairwise preferences(1995) Andrich, DavidThe hyperbolic cosine unfolding model for direct responses of persons to individual stimuli is elaborated in three ways. First, the parameter of the stimulus, which reflects a region within which people located there are more likely to respond positively than negatively, is shown to be a property of the data and not arbitrary as first supposed. Second, the model is used to construct a related model for pairwise preferences. This model, for which joint maximum likelihood estimates are derived, satisfies strong stochastic transitivity. Third, the role of substantive theory in evaluating the fit between the data and the models, in which unique solutions for the estimates are not guaranteed, is explored by analyzing responses of one group of persons to a single set of stimuli obtained both as direct responses and pairwise preferences. Index terms: direct responses, hyberbolic cosine model, item response theory, latent trait models, pair comparisons, pairwise preferences, unfolding models.Item Item response theory for scores on tests including polytomous items with ordered responses(1995) Thissen, David; Pommerich, Mary; Billeaud, Kathleen; Williams, Valerie S. L.Item response theory (IRT) provides procedures for scoring tests including any combination of rated constructed-response and keyed multiple-choice items, in that each response pattern is associated with some modal or expected a posteriori estimate of trait level. However, various considerations that frequently arise in large-scale testing make response-pattern scoring an undesirable solution. Methods are described based on IRT that provide scaled scores, or estimates of trait level, for each summed score for rated responses, or for combinations of rated responses and multiple-choice items. These methods may be used to combine the useful scale properties of IR’r-based scores with the practical virtues of a scale based on a summed score for each examinee. Index terms: graded response model, item response theory, ordered responses, polytomous models, scaled scores.Item The optimal degree of smoothing in equipercentile equating with postsmoothing(1995) Zeng, LingjiaThe effects of different degrees of smoothing on the results of equipercentile equating in the random groups design using a postsmoothing method based on cubic splines were investigated. A computer-based procedure was introduced for selecting a desirable degree of smoothing. The procedure was based on two criteria: (1) that the equating function is reasonably smooth, as evaluated by the second derivatives of the cubic spline functions, and (2) that the equated score distributions are close to that of the old form. The equating functions obtained from smoothing the equipercentile equivalents by a fixed smoothing degree and a degree selected by the computer-based procedure were evaluated in computer simulations for four tests. The results suggest that no particular fixed degree of smoothing always led to an optimal degree of smoothing. The degrees of smoothing selected by the computer-based procedure were better than the best fixed degrees of smoothing for two of the four tests studied; for one of the other two tests, the degrees selected by the computer procedure performed better or nearly as well as the best fixed degrees. Index terms: computer simulation, cubic spline, equating, equipercentile equating, smoothing.Item DIF assessment for polytomously scored items: A framework for classification and evaluation(1995) Potenza, Maria T.; Dorans, Neil J.Increased use of alternatives to the traditional dichotomously scored multiple-choice item yield complex responses that require complex scoring rules. Some of these new item types can be polytomously scored. DIF methodology is well-defined for traditional dichotomously scored multiple-choice items. This paper provides a classification scheme of DIF procedures for dichotomously scored items that is applicable to new DIF procedures for polytomously scored items. In the process, a formal development of a polytomous version of a dichotomous DIF technique is presented. Several polytomous DIF techniques are evaluated in terms of statistical and practical criteria. Index terms: DIF methodology, differential item functioning, item bias, polytomous scoring, statistical criteria for differential item functioning.Item Polychotomous or Polytomous?(1995) Weiss, David J.Item Using subject-matter experts to assess content representation: An MDS analysis(1995) Sireci, Stephen G.; Geisinger, Kurt F.Demonstration of content domain representation is of central importance in test validation. An expanded version of the method of content evaluation proposed by Sireci & Geisinger (1992) was evaluated with respect to a national licensure examination and a nationally standardized social studies achievement test. Two groups of 15 subject-matter experts (SMEs) rated the similarity of all item pairs comprising a test, and then rated the relevance of the items to the content domains listed in the test blueprints. The similarity ratings were analyzed using multidimensional scaling (MDS); the item relevance ratings were analyzed using procedures proposed by Hambleton (1984) and Aiken (1980). The SMES’ perceptions of the underlying content structures of the tests emerged in the MDS solutions. All dimensions were germane to the content domains measured by the tests. Some of these dimensions were consistent with the content structure specified in the test blueprint, others were not. Correlation and regression analyses of the MDS item coordinates and item relevance ratings indicated that using both item similarity and item relevance data provided greater information of content representation than did using either approach alone. The implications of the procedure for test validity are discussed and suggestions for future research are provided. Index terms: construct validity, content validity, cluster analysis, multidimensional scaling, subject-matter experts, test construction.Item Scoring method and the detection of person misfit in a personality assessment context(1995) Reise, Steven P.The purpose of this research was to explore psychometric issues pertinent to the application of an IRT-based person-fit (response aberrancy) detection statistic in the personality measurement domain. Monte carlo data analyses were conducted to address issues regarding lz the person-fit statistic. The major issues explored were characteristics of the null distribution of lz and its power to identify nonfitting response patterns under different scoring strategies. There were two main results. First, the lz index null distribution was not well standardized when item parameters of personality scales were used; the lz null distribution variance was significantly less than the hypothesized value of 1.0 under several conditions. Second, the power of lz to detect response misfit was affected by the scoring method. Detection power was optimal when a biweight estimator of θ was used. Recommendations are made regarding proper implementation of person-fit statistics in personality measurement. Index terms: appropriateness measurement, item response theory, lz statistic, person fit, personality assessment, response aberrancy, scoring methods, two-parameter model.Item Pairwise parameter estimation in Rasch models(1995) Zwinderman, Aeilko H.Rasch model item parameters can be estimated consistently with a pseudo-likelihood method based on comparing responses to pairs of items irrespective of other items. The pseudo-likelihood method is comparable to Fischer’s (1974) Minchi method. A simulation study found that the pseudo-likelihood estimates and their (estimated) standard errors were comparable to conditional and marginal maximum likelihood estimates. The method is extended to estimate parameters of the linear logistic test model allowing the design matrix to vary between persons. Index terms: item parameter estimation, linear logistic test model, Minchi estimation, pseudo-likelihood, Rasch model.Item Computerized adaptive testing with polytomous items(1995) Dodd, Barbara G.; De Ayala, R. J.; Koch, William R.Polytomous item response theory models and the research that has been conducted to investigate a variety of possible operational procedures for polytomous model-based computerized adaptive testing (CAT) are reviewed. Studies that compared polytomous CAT systems based on competing item response theory models that are appropriate for the same measurement objective, as well as applications of polytomous CAT in marketing and educational psychology, also are reviewed. Directions for future research using polytomous model-based CAT are suggested. Index terms: computerized adaptive testing, polytomous item response theory, polytomous scoring.