# Applied Psychological Measurement, Volume 19, 1995

## Persistent link for this collection

## Browse

### Browsing Applied Psychological Measurement, Volume 19, 1995 by Issue Date

Filter results by year or month (Choose year) 1995 (Choose month) January February March April May June July August September October November December Browse

Now showing 1 - 20 of 27

###### Results Per Page

###### Sort Options

Item The Rasch Poisson counts model for incomplete data: An application of the EM algorithm(1995) Jansen, Margo G. H.Show more Rasch’s Poisson counts model is a latent trait model for the situation in which K tests are administered to N examinees and the test score is a count [e.g., the repeated occurrence of some event, such as the number of items completed or the number of items answered (in)correctly]. The Rasch Poisson counts model assumes that the test scores are Poisson distributed random variables. In the approach presented here, the Poisson parameter is assumed to be a product of a fixed test difficulty and a gamma-distributed random examinee latent trait parameter. From these assumptions, marginal maximum likelihood estimators can be derived for the test difficulties and the parameters of the prior gamma distribution. For the examinee parameters, there are a number of options. The model can be applied in a situation in which observations result from an incomplete design. When examinees are assigned to different subsets of tests using background information, this information must be taken into account when using marginal maximum likelihood estimation. If the focus is on test calibration and there is no interest in the characteristics of the latent traits in relation to the background information, conditional maximum likelihood methods may be preferred because they are easier to implement and are justified for incomplete data for test parameter estimation. Index terms: EM algorithm, incomplete designs, latent trait models, marginal maximum likelihood estimation, Rasch Poisson counts model.Show more Item Polychotomous or Polytomous?(1995) Weiss, David J.Show more Item Using subject-matter experts to assess content representation: An MDS analysis(1995) Sireci, Stephen G.; Geisinger, Kurt F.Show more Demonstration of content domain representation is of central importance in test validation. An expanded version of the method of content evaluation proposed by Sireci & Geisinger (1992) was evaluated with respect to a national licensure examination and a nationally standardized social studies achievement test. Two groups of 15 subject-matter experts (SMEs) rated the similarity of all item pairs comprising a test, and then rated the relevance of the items to the content domains listed in the test blueprints. The similarity ratings were analyzed using multidimensional scaling (MDS); the item relevance ratings were analyzed using procedures proposed by Hambleton (1984) and Aiken (1980). The SMES’ perceptions of the underlying content structures of the tests emerged in the MDS solutions. All dimensions were germane to the content domains measured by the tests. Some of these dimensions were consistent with the content structure specified in the test blueprint, others were not. Correlation and regression analyses of the MDS item coordinates and item relevance ratings indicated that using both item similarity and item relevance data provided greater information of content representation than did using either approach alone. The implications of the procedure for test validity are discussed and suggestions for future research are provided. Index terms: construct validity, content validity, cluster analysis, multidimensional scaling, subject-matter experts, test construction.Show more Item Scoring method and the detection of person misfit in a personality assessment context(1995) Reise, Steven P.Show more The purpose of this research was to explore psychometric issues pertinent to the application of an IRT-based person-fit (response aberrancy) detection statistic in the personality measurement domain. Monte carlo data analyses were conducted to address issues regarding lz the person-fit statistic. The major issues explored were characteristics of the null distribution of lz and its power to identify nonfitting response patterns under different scoring strategies. There were two main results. First, the lz index null distribution was not well standardized when item parameters of personality scales were used; the lz null distribution variance was significantly less than the hypothesized value of 1.0 under several conditions. Second, the power of lz to detect response misfit was affected by the scoring method. Detection power was optimal when a biweight estimator of θ was used. Recommendations are made regarding proper implementation of person-fit statistics in personality measurement. Index terms: appropriateness measurement, item response theory, lz statistic, person fit, personality assessment, response aberrancy, scoring methods, two-parameter model.Show more Item Pairwise parameter estimation in Rasch models(1995) Zwinderman, Aeilko H.Show more Rasch model item parameters can be estimated consistently with a pseudo-likelihood method based on comparing responses to pairs of items irrespective of other items. The pseudo-likelihood method is comparable to Fischer’s (1974) Minchi method. A simulation study found that the pseudo-likelihood estimates and their (estimated) standard errors were comparable to conditional and marginal maximum likelihood estimates. The method is extended to estimate parameters of the linear logistic test model allowing the design matrix to vary between persons. Index terms: item parameter estimation, linear logistic test model, Minchi estimation, pseudo-likelihood, Rasch model.Show more Item Computerized adaptive testing with polytomous items(1995) Dodd, Barbara G.; De Ayala, R. J.; Koch, William R.Show more Polytomous item response theory models and the research that has been conducted to investigate a variety of possible operational procedures for polytomous model-based computerized adaptive testing (CAT) are reviewed. Studies that compared polytomous CAT systems based on competing item response theory models that are appropriate for the same measurement objective, as well as applications of polytomous CAT in marketing and educational psychology, also are reviewed. Directions for future research using polytomous model-based CAT are suggested. Index terms: computerized adaptive testing, polytomous item response theory, polytomous scoring.Show more Item Conceptual notes on models for discrete polytomous item responses(1995) Mellenbergh, Gideon J.Show more The following types of discrete item responses are distinguished : nominal-dichotomous, ordinal-dichotomous, nominal-polytomous, and ordinal-polytomous. Bock (1972) presented a model for nominal-polytomous item responses that, when applied to dichotomous responses, yields Birnbaum’s (1968) two-parameter logistic model. Applying Bock’s model to ordinal-polytomous items leads to a conceptual problem. The ordinal nature of the response variable must be preserved; this can be achieved using three different methods. A number of existing models are derived using these three methods. The structure of these models is similar, but they differ in the interpretation and qualities of their parameters. Information, parameter invariance, log-odds differences invariance, and model violation also are discussed. Information and parameter invariance of dichotomous item response theory (IRT) also apply to polytomous IRT. Specific objectivity of the Rasch model for dichotomous items is a special case of log-odds differences invariance of polytomous items. Differential item functioning of dichotomous IRT is a special case of measurement model violation of polytomous IRT. Index terms: adjacent categories, continuation ratios, cumulative probabilities, differential item functioning, log-odds differences invariance, measurement model violation, parameter invariance, polytomous IRT models.Show more Item Reliability estimation for single dichotomous items based on Mokken's IRT model(1995) Meijer, Rob R.; Sijtsma, Klaas; Molenaar, Ivo W.Show more Item reliability is of special interest for Mokken’s nonparametric item response theory, and is useful for the evaluation of item quality in nonparametric test construction research. It is also of interest for nonparametric person-fit analysis. Three methods for the estimation of the reliability of single dichotomous items are discussed. All methods are based on the assumptions of nondecreasing and nonintersecting item response functions. Based on analytical and monte carlo studies, it is concluded that one method is superior to the other two, because it has a smaller bias and a smaller sampling variance. This method also demonstrated some robustness under violation of the condition of nonintersecting item response functions. Index terms: item reliability, item response theory, Mokken model, nonparametric item response models, test construction.Show more Item An alternative approach for IRT observed-score equating of number-correct scores(1995) Zeng, Lingjia; Kolen, Michael J.Show more An alternative approach for item response theory observed-score equating is described. The number-correct score distributions needed in equating are found by numerical integration over the theoretical or empirical distributions of examinees’ traits. The item response theory true-score equating method and the observed-score equating method described by Lord, in which the number-correct score distributions are summed over a sample of trait estimates, are compared in a real test example. In a computer simulation, the observed-score equating methods based on numerical integration and summation were compared using data generated from standard normal and skewed populations. The method based on numerical integration was found to be less biased, especially at the two ends of the score distribution. This method can be implemented without the need to estimate trait level for individual examinees, and it is less computationally intensive than the method based on summation. Index terms: equating, item response theory, numerical integration, observed-score equating.Show more Item Analyzing homogeneity and heterogeneity of change using Rasch and latent class models: A comparative and integrative approach(1995) Meiser, Thorsten; Hein-Eggers, Monika; Rompe, Pamela; Rudinger, GeorgShow more The application of unidimensional Rasch models to longitudinal data assumes homogeneity of change over persons. Using latent class models, several classes with qualitatively distinct patterns of development can be taken into account; thus, heterogeneity of change is assumed. The mixed Rasch model integrates both the Rasch and the latent class approach by dividing the population of persons into classes that conform to Rasch models with class-specific parameters. Thus, qualitatively different patterns of change can be modeled with the homogeneity assumption retained within each class, but not between classes. In contrast to the usual latent class approach, the mixed Rasch model includes a quantitative differentiation among persons in the same class. Thus, quantitative differences in the level of the latent attribute are disentangled from the qualitative shape of development. A theoretical comparison of the formal approaches is presented here, as well as an application to empirical longitudinal data. In the context of personality development in childhood and early adolescence, the existence of different developmental trajectories is demonstrated for two aspects of personality. Relations between the latent trajectories and discrete exogenous variables are investigated. Index terms: latent class analysis, latent structure analysis, measurement of change, mixture distribution models, Rasch model, rating scale model.Show more Item DIF assessment for polytomously scored items: A framework for classification and evaluation(1995) Potenza, Maria T.; Dorans, Neil J.Show more Increased use of alternatives to the traditional dichotomously scored multiple-choice item yield complex responses that require complex scoring rules. Some of these new item types can be polytomously scored. DIF methodology is well-defined for traditional dichotomously scored multiple-choice items. This paper provides a classification scheme of DIF procedures for dichotomously scored items that is applicable to new DIF procedures for polytomously scored items. In the process, a formal development of a polytomous version of a dichotomous DIF technique is presented. Several polytomous DIF techniques are evaluated in terms of statistical and practical criteria. Index terms: DIF methodology, differential item functioning, item bias, polytomous scoring, statistical criteria for differential item functioning.Show more Item Fitting polytomous item response theory models to multiple-choice tests(1995) Drasgow, Fritz; Levine, Michael V.; Tsien, Sherman; Williams, Bruce; Mead, Alan D.Show more This study examined how well current software implementations of four polytomous item response theory models fit several multiple-choice tests. The models were Bock’s (1972) nominal model, Samejima’s (1979) multiple-choice Model C, Thissen & Steinberg’s (1984) multiple-choice model, and Levine’s (1993) maximum-likelihood formula scoring model. The parameters of the first three of these models were estimated with Thissen’s (1986) MULTILOG computer program; Williams & Levine’s (1993) FORSCORE program was used for Levine’s model. Tests from the Armed Services Vocational Aptitude Battery, the Scholastic Aptitude Test, and the American College Test Assessment were analyzed. The models were fit in estimation samples of approximately 3,000; cross-validation samples of approximately 3,000 were used to evaluate goodness of fit. Both fit plots and X² statistics were used to determine the adequacy of fit. Bock’s model provided surprisingly good fit; adding parameters to the nominal model did not yield improvements in fit. FORSCORE provided generally good fit for Levine’s nonparametric model across all tests. Index terms: Bock’s nominal model, FORSCORE, maximum likelihood formula scoring, MULTILOG, polytomous IRT.Show more Item Item response theory for scores on tests including polytomous items with ordered responses(1995) Thissen, David; Pommerich, Mary; Billeaud, Kathleen; Williams, Valerie S. L.Show more Item response theory (IRT) provides procedures for scoring tests including any combination of rated constructed-response and keyed multiple-choice items, in that each response pattern is associated with some modal or expected a posteriori estimate of trait level. However, various considerations that frequently arise in large-scale testing make response-pattern scoring an undesirable solution. Methods are described based on IRT that provide scaled scores, or estimates of trait level, for each summed score for rated responses, or for combinations of rated responses and multiple-choice items. These methods may be used to combine the useful scale properties of IR’r-based scores with the practical virtues of a scale based on a summed score for each examinee. Index terms: graded response model, item response theory, ordered responses, polytomous models, scaled scores.Show more Item The optimal degree of smoothing in equipercentile equating with postsmoothing(1995) Zeng, LingjiaShow more The effects of different degrees of smoothing on the results of equipercentile equating in the random groups design using a postsmoothing method based on cubic splines were investigated. A computer-based procedure was introduced for selecting a desirable degree of smoothing. The procedure was based on two criteria: (1) that the equating function is reasonably smooth, as evaluated by the second derivatives of the cubic spline functions, and (2) that the equated score distributions are close to that of the old form. The equating functions obtained from smoothing the equipercentile equivalents by a fixed smoothing degree and a degree selected by the computer-based procedure were evaluated in computer simulations for four tests. The results suggest that no particular fixed degree of smoothing always led to an optimal degree of smoothing. The degrees of smoothing selected by the computer-based procedure were better than the best fixed degrees of smoothing for two of the four tests studied; for one of the other two tests, the degrees selected by the computer procedure performed better or nearly as well as the best fixed degrees. Index terms: computer simulation, cubic spline, equating, equipercentile equating, smoothing.Show more Item Distinctive and incompatible properties of two common classes of IRT models for graded responses(1995) Andrich, DavidShow more Two classes of models for graded responses, the first based on the work of Thurstone and the second based on the work of Rasch, are juxtaposed and shown to satisfy important, but mutually incompatible, criteria and to reflect different response processes. Specifically, in the Thurstone models if adjacent categories are joined to form a new category, either before or after the data are collected, then the probability of a response in the new category is the sum of the probabilities of the responses in the original categories. However, the model does not have the explicit property that if the categories are so joined, then the estimate of the location of the entity or object being measured is invariant before and after the joining. For the Rasch models, if a pair of adjacent categories are joined and then the data are collected, the estimate of the location of the entity is the same before and after the joining, but the probability of a response in the new category is not the sum of the probabilities of the responses in the original categories. Furthermore, if data satisfy the model and the categories are joined after the data are collected, then they no longer satisfy the same Rasch model with the smaller number of categories. These differences imply that the choice between these two classes of models for graded responses is not simply a matter of preference; they also permit a better understanding of the choice of models for graded response data as a function of the underlying processes they are intended to represent. Index terms: graded responses, joining assumption, polytomous IRT models, Rasch model, Thurstone model.Show more Item Selection of unidimensional scales from a multidimensional item bank in the polytomous Mokken IRT model(1995) Hemker, Bas T.; Sijtsma, Klaas; Molenaar, Ivo W.Show more An automated item selection procedure for selecting unidimensional scales of polytomous items from multidimensional datasets is developed for use in the context of the Mokken item response theory model of monotone homogeneity (Mokken & Lewis, 1982). The selection procedure is directly based on the selection procedure proposed by Mokken (1971, p. 187) and relies heavily on the scalability coefficient H (Loevinger, 1948; Molenaar, 1991). New theoretical results relating the latent model structure to H are provided. The item selection procedure requires selection of a lower bound for H. A simulation study determined ranges of H for which the unidimensional item sets were retrieved from multidimensional datasets. If multidimensionality is suspected in an empirical dataset, well-chosen lower bound values can be used effectively to detect the unidimensional scales. Index terms: item response theory, Mokken model, multidimensional item banks, nonparametric item response models, scalability coefficient H, test construction, unidimensional scales.Show more Item IRT-based internal measures of differential functioning of items and tests(1995) Raju, Nambury S.; Van der Linden, Wim J.; Fleer, Paul F.Show more Internal measures of differential functioning of items and tests (DFIT) based on item response theory (IRT) are proposed. Within the DFIT context, the new differential test functioning (DTF) index leads to two new measures of differential item functioning (DIF) with the following properties: (1) The compensatory DIF (CDIF) indexes for all items in a test sum to the DTF index for that test and, unlike current DIF procedures, the CDIF index for an item does not assume that the other items in the test are unbiased ; (2) the noncompensatory DIF (NCDIF) index, which assumes that the other items in the test are unbiased, is comparable to some of the IRT-based DIP indexes; and (3) CDIF and NCDIF, as well as DTF, are equally valid for polytomous and multidimensional IRT models. Monte carlo study results, comparing these indexes with Lord’s X² test, the signed area measure, and the unsigned area measure, demonstrate that the DFIT framework is accurate in assessing DTF, CDIF, and NCDIF. Index Terms: area measures of DIF, compensatory DIF, differential functioning of items and tests (DFIT), differential item functioning, differential test functioning, Lord’s X²; noncompensatory DIF, nonuniform DIF, uniform DIF.Show more Item A minimum X² method for equating tests under the graded response model(1995) Kim, Seock-Ho; Cohen, Allan S.Show more The minimum X² method for computing equating coefficients for tests with dichotomously scored items was extended to the case of Samejima’s graded response items. The minimum X² method was compared with the test response function method (also referred to as the test characteristic curve method) in which the equating coefficients were obtained by matching the test response functions of the two tests. The minimum X² method was much less demanding computationally and yielded equating coefficients that differed little from those obtained using the test response function approach. Index terms: equating, graded response model, item response theory, minimum X² method, test response function method.Show more Item Complex composites: Issues that arise in combining different modes of assessment(1995) Wilson, Mark; Wang, Wen-chungShow more Data from the California Learning Assessment System are used to examine certain characteristics of tests designed as the composites of items of different modes. The characteristics include rater severity, test information, and definition of the latent variable. Three different assessment modes-multiple-choice, open-ended, and investigation items (the latter two are referred to as performance- based modes)-were combined in a test across three different test forms. Rater severity was investigated by incorporating a rater parameter for each rater in an item response model that then was used to analyze the data. Some rater severities were found to be quite extreme, and the impact of this variation in rater severities on both total scores and trait level estimates was examined. Within-rater variation in rater severity also was examined and was found to have significant variation. The information contribution of the three modes was compared. Performance-based items provided more information than multiple-choice items and also provided greatest precision for higher levels of the latent variable. A projection-like method was applied to investigate the effects of assessment mode on the definition of the latent variable. The multiple-choice items added information to the performance-based variable. The results of the analysis also showed that the projection-like method did not practically differ from the results when the latent trait was defined jointly by both the multiple-choice and the performance-based items. Index terms: equating, linking, multiple assessment modes, polytomous item response models, rater effects.Show more Item The distribution of person fit using true and estimated person parameters(1995) Nering, Michael L.Show more A variety of methods have been developed to determine the extent to which a person’s response vector fits an item response theory model. These person-fit methods are statistical methods that allow researchers to identify nonfitting response vectors. The most promising method has been the lz statistic, which is a standardized person-fit index. Reise & Due (1991) concluded that under the null condition (i.e., when data were simulated to fit the model) lz performed reasonably well. The present study extended the findings of past researchers (e.g., Drasgow, Levine, & McLaughlin, 1987; Molenaar & Hoijtink, 1990; Reise and Due). Results show that lz may not perform as expected when estimated person parameters (θˆ) are used rather than true θ. This study also examined the influence of the pseudo-guessing parameter, the method used to identify nonfitting response vectors, and the method used to estimate θ. When θ was better estimated, lz, was more normally distributed, and the false positive rate for a single cut score did not characterize the distribution of lz. Changing the c parameter from .20 to 0.0 did not improve the normality of the lz. distribution. Index terms: appropriateness measurement, Bayesian estimation, item response theory, maximum likelihood estimation, person fit.Show more