Applied Psychological Measurement, Volume 19, 1995
Persistent link for this collection
Search within Applied Psychological Measurement, Volume 19, 1995
Browse
Recent Submissions
Item The distribution of person fit using true and estimated person parameters(1995) Nering, Michael L.A variety of methods have been developed to determine the extent to which a person’s response vector fits an item response theory model. These person-fit methods are statistical methods that allow researchers to identify nonfitting response vectors. The most promising method has been the lz statistic, which is a standardized person-fit index. Reise & Due (1991) concluded that under the null condition (i.e., when data were simulated to fit the model) lz performed reasonably well. The present study extended the findings of past researchers (e.g., Drasgow, Levine, & McLaughlin, 1987; Molenaar & Hoijtink, 1990; Reise and Due). Results show that lz may not perform as expected when estimated person parameters (θˆ) are used rather than true θ. This study also examined the influence of the pseudo-guessing parameter, the method used to identify nonfitting response vectors, and the method used to estimate θ. When θ was better estimated, lz, was more normally distributed, and the false positive rate for a single cut score did not characterize the distribution of lz. Changing the c parameter from .20 to 0.0 did not improve the normality of the lz. distribution. Index terms: appropriateness measurement, Bayesian estimation, item response theory, maximum likelihood estimation, person fit.Item Pairwise parameter estimation in Rasch models(1995) Zwinderman, Aeilko H.Rasch model item parameters can be estimated consistently with a pseudo-likelihood method based on comparing responses to pairs of items irrespective of other items. The pseudo-likelihood method is comparable to Fischer’s (1974) Minchi method. A simulation study found that the pseudo-likelihood estimates and their (estimated) standard errors were comparable to conditional and marginal maximum likelihood estimates. The method is extended to estimate parameters of the linear logistic test model allowing the design matrix to vary between persons. Index terms: item parameter estimation, linear logistic test model, Minchi estimation, pseudo-likelihood, Rasch model.Item Analyzing homogeneity and heterogeneity of change using Rasch and latent class models: A comparative and integrative approach(1995) Meiser, Thorsten; Hein-Eggers, Monika; Rompe, Pamela; Rudinger, GeorgThe application of unidimensional Rasch models to longitudinal data assumes homogeneity of change over persons. Using latent class models, several classes with qualitatively distinct patterns of development can be taken into account; thus, heterogeneity of change is assumed. The mixed Rasch model integrates both the Rasch and the latent class approach by dividing the population of persons into classes that conform to Rasch models with class-specific parameters. Thus, qualitatively different patterns of change can be modeled with the homogeneity assumption retained within each class, but not between classes. In contrast to the usual latent class approach, the mixed Rasch model includes a quantitative differentiation among persons in the same class. Thus, quantitative differences in the level of the latent attribute are disentangled from the qualitative shape of development. A theoretical comparison of the formal approaches is presented here, as well as an application to empirical longitudinal data. In the context of personality development in childhood and early adolescence, the existence of different developmental trajectories is demonstrated for two aspects of personality. Relations between the latent trajectories and discrete exogenous variables are investigated. Index terms: latent class analysis, latent structure analysis, measurement of change, mixture distribution models, Rasch model, rating scale model.Item IRT-based internal measures of differential functioning of items and tests(1995) Raju, Nambury S.; Van der Linden, Wim J.; Fleer, Paul F.Internal measures of differential functioning of items and tests (DFIT) based on item response theory (IRT) are proposed. Within the DFIT context, the new differential test functioning (DTF) index leads to two new measures of differential item functioning (DIF) with the following properties: (1) The compensatory DIF (CDIF) indexes for all items in a test sum to the DTF index for that test and, unlike current DIF procedures, the CDIF index for an item does not assume that the other items in the test are unbiased ; (2) the noncompensatory DIF (NCDIF) index, which assumes that the other items in the test are unbiased, is comparable to some of the IRT-based DIP indexes; and (3) CDIF and NCDIF, as well as DTF, are equally valid for polytomous and multidimensional IRT models. Monte carlo study results, comparing these indexes with Lord’s X² test, the signed area measure, and the unsigned area measure, demonstrate that the DFIT framework is accurate in assessing DTF, CDIF, and NCDIF. Index Terms: area measures of DIF, compensatory DIF, differential functioning of items and tests (DFIT), differential item functioning, differential test functioning, Lord’s X²; noncompensatory DIF, nonuniform DIF, uniform DIF.Item Selection of unidimensional scales from a multidimensional item bank in the polytomous Mokken IRT model(1995) Hemker, Bas T.; Sijtsma, Klaas; Molenaar, Ivo W.An automated item selection procedure for selecting unidimensional scales of polytomous items from multidimensional datasets is developed for use in the context of the Mokken item response theory model of monotone homogeneity (Mokken & Lewis, 1982). The selection procedure is directly based on the selection procedure proposed by Mokken (1971, p. 187) and relies heavily on the scalability coefficient H (Loevinger, 1948; Molenaar, 1991). New theoretical results relating the latent model structure to H are provided. The item selection procedure requires selection of a lower bound for H. A simulation study determined ranges of H for which the unidimensional item sets were retrieved from multidimensional datasets. If multidimensionality is suspected in an empirical dataset, well-chosen lower bound values can be used effectively to detect the unidimensional scales. Index terms: item response theory, Mokken model, multidimensional item banks, nonparametric item response models, scalability coefficient H, test construction, unidimensional scales.Item Reliability estimation for single dichotomous items based on Mokken's IRT model(1995) Meijer, Rob R.; Sijtsma, Klaas; Molenaar, Ivo W.Item reliability is of special interest for Mokken’s nonparametric item response theory, and is useful for the evaluation of item quality in nonparametric test construction research. It is also of interest for nonparametric person-fit analysis. Three methods for the estimation of the reliability of single dichotomous items are discussed. All methods are based on the assumptions of nondecreasing and nonintersecting item response functions. Based on analytical and monte carlo studies, it is concluded that one method is superior to the other two, because it has a smaller bias and a smaller sampling variance. This method also demonstrated some robustness under violation of the condition of nonintersecting item response functions. Index terms: item reliability, item response theory, Mokken model, nonparametric item response models, test construction.Item Analysis of differential item functioning in translated assessment instruments(1995) Budgell, Glen R.; Raju, Nambury S.; Quartetti, Douglas A.The usefulness of three IRT-based methods and the Mantel-Haenszel technique in evaluating the measurement equivalence of translated assessment instruments was investigated. A 15-item numerical test and an 18-item reasoning test that were originally developed in English and then translated to French were used. The analyses were based on four groups, each containing 1,000 examinees. Two groups of English-speaking examinees were administered the English version of the tests; the other two were French-speaking examinees who were administered the French version of the tests. The percent of items identified with significant differential item functioning (DIF) in this study was similar to findings in previous large-sample studies. The four DIF methods showed substantial consistency in identifying items with significant DIF when replicated. Suggestions for future research are provided. Index terms: area measures, differential item functioning, item response theory, language translations, Lord’s X², Mantel-Haenszel procedure.Item The Rasch Poisson counts model for incomplete data: An application of the EM algorithm(1995) Jansen, Margo G. H.Rasch’s Poisson counts model is a latent trait model for the situation in which K tests are administered to N examinees and the test score is a count [e.g., the repeated occurrence of some event, such as the number of items completed or the number of items answered (in)correctly]. The Rasch Poisson counts model assumes that the test scores are Poisson distributed random variables. In the approach presented here, the Poisson parameter is assumed to be a product of a fixed test difficulty and a gamma-distributed random examinee latent trait parameter. From these assumptions, marginal maximum likelihood estimators can be derived for the test difficulties and the parameters of the prior gamma distribution. For the examinee parameters, there are a number of options. The model can be applied in a situation in which observations result from an incomplete design. When examinees are assigned to different subsets of tests using background information, this information must be taken into account when using marginal maximum likelihood estimation. If the focus is on test calibration and there is no interest in the characteristics of the latent traits in relation to the background information, conditional maximum likelihood methods may be preferred because they are easier to implement and are justified for incomplete data for test parameter estimation. Index terms: EM algorithm, incomplete designs, latent trait models, marginal maximum likelihood estimation, Rasch Poisson counts model.Item Hyperbolic cosine latent trait models for unfolding direct responses and pairwise preferences(1995) Andrich, DavidThe hyperbolic cosine unfolding model for direct responses of persons to individual stimuli is elaborated in three ways. First, the parameter of the stimulus, which reflects a region within which people located there are more likely to respond positively than negatively, is shown to be a property of the data and not arbitrary as first supposed. Second, the model is used to construct a related model for pairwise preferences. This model, for which joint maximum likelihood estimates are derived, satisfies strong stochastic transitivity. Third, the role of substantive theory in evaluating the fit between the data and the models, in which unique solutions for the estimates are not guaranteed, is explored by analyzing responses of one group of persons to a single set of stimuli obtained both as direct responses and pairwise preferences. Index terms: direct responses, hyberbolic cosine model, item response theory, latent trait models, pair comparisons, pairwise preferences, unfolding models.Item Testing the equality of scale values and discriminal dispersions in paired comparisons(1995) Davison, Mark L.; McGuire, Dennis P.; Chen, Tsuey-Hwa; Anderson, Ronald O.General normal ogive and logistic multiple-group models for paired comparisons data are described. In these models, scale value and discriminal dispersion parameters are allowed to vary across stimuli and respondent populations. Submodels can be fit to choice proportions by nonlinearly regressing sample estimates of choice proportions onto a complex design matrix. By fitting various submodels and by appropriate coding of parameter effects, selected hypotheses about the equality of scale value and dispersion parameters across groups can be tested. Model fitting and hypothesis testing are illustrated using health care coverage data collected in two age groups. Index terms: Bradley- Terry-Luce Model, choice models, logistic regression, paired comparisons, probit regression, Thurstone’s Law of Comparative,General normal ogive and logistic multiple-group models for paired comparisons data are described. In these models, scale value and discriminal dispersion parameters are allowed to vary across stimuli and respondent populations. Submodels can be fit to choice proportions by nonlinearly regressing sample estimates of choice proportions onto a complex design matrix. By fitting various submodels and by appropriate coding of parameter effects, selected hypotheses about the equality of scale value and dispersion parameters across groups can be tested. Model fitting and hypothesis testing are illustrated using health care coverage data collected in two age groups. Index terms: Bradley- Terry-Luce Model, choice models, logistic regression, paired comparisons, probit regression, Thurstone’s Law of ComparativeGeneral normal ogive and logistic multiple-group models for paired comparisons data are described. In these models, scale value and discriminal dispersion parameters are allowed to vary across stimuli and respondent populations. Submodels can be fit to choice proportions by nonlinearly regressing sample estimates of choice proportions onto a complex design matrix. By fitting various submodels and by appropriate coding of parameter effects, selected hypotheses about the equality of scale value and dispersion parameters across groups can be tested. Model fitting and hypothesis testing are illustrated using health care coverage data collected in two age groups. Index terms: Bradley-Terry-Luce Model, choice models, logistic regression, paired comparisons, probit regression, Thurstone’s Law of Comparative Judgment.Item Using subject-matter experts to assess content representation: An MDS analysis(1995) Sireci, Stephen G.; Geisinger, Kurt F.Demonstration of content domain representation is of central importance in test validation. An expanded version of the method of content evaluation proposed by Sireci & Geisinger (1992) was evaluated with respect to a national licensure examination and a nationally standardized social studies achievement test. Two groups of 15 subject-matter experts (SMEs) rated the similarity of all item pairs comprising a test, and then rated the relevance of the items to the content domains listed in the test blueprints. The similarity ratings were analyzed using multidimensional scaling (MDS); the item relevance ratings were analyzed using procedures proposed by Hambleton (1984) and Aiken (1980). The SMES’ perceptions of the underlying content structures of the tests emerged in the MDS solutions. All dimensions were germane to the content domains measured by the tests. Some of these dimensions were consistent with the content structure specified in the test blueprint, others were not. Correlation and regression analyses of the MDS item coordinates and item relevance ratings indicated that using both item similarity and item relevance data provided greater information of content representation than did using either approach alone. The implications of the procedure for test validity are discussed and suggestions for future research are provided. Index terms: construct validity, content validity, cluster analysis, multidimensional scaling, subject-matter experts, test construction.Item An alternative approach for IRT observed-score equating of number-correct scores(1995) Zeng, Lingjia; Kolen, Michael J.An alternative approach for item response theory observed-score equating is described. The number-correct score distributions needed in equating are found by numerical integration over the theoretical or empirical distributions of examinees’ traits. The item response theory true-score equating method and the observed-score equating method described by Lord, in which the number-correct score distributions are summed over a sample of trait estimates, are compared in a real test example. In a computer simulation, the observed-score equating methods based on numerical integration and summation were compared using data generated from standard normal and skewed populations. The method based on numerical integration was found to be less biased, especially at the two ends of the score distribution. This method can be implemented without the need to estimate trait level for individual examinees, and it is less computationally intensive than the method based on summation. Index terms: equating, item response theory, numerical integration, observed-score equating.Item Scoring method and the detection of person misfit in a personality assessment context(1995) Reise, Steven P.The purpose of this research was to explore psychometric issues pertinent to the application of an IRT-based person-fit (response aberrancy) detection statistic in the personality measurement domain. Monte carlo data analyses were conducted to address issues regarding lz the person-fit statistic. The major issues explored were characteristics of the null distribution of lz and its power to identify nonfitting response patterns under different scoring strategies. There were two main results. First, the lz index null distribution was not well standardized when item parameters of personality scales were used; the lz null distribution variance was significantly less than the hypothesized value of 1.0 under several conditions. Second, the power of lz to detect response misfit was affected by the scoring method. Detection power was optimal when a biweight estimator of θ was used. Recommendations are made regarding proper implementation of person-fit statistics in personality measurement. Index terms: appropriateness measurement, item response theory, lz statistic, person fit, personality assessment, response aberrancy, scoring methods, two-parameter model.Item The effects of correlated errors on generalizability and dependability coefficients(1995) Bost, James E.This study investigated the effects of correlated errors on the person x occasion design in which the confounding effect of equal time intervals results in correlated error terms in the linear model. Two specific error correlation structures were examined: the first-order stationary autoregressive (SARI), and the first-order nonstationary autoregressive (NARI) with increasing variance parameters. The effects of correlated errors on the existing generalizability and dependability coefficients were assessed by simulating data with known variances (six different combinations of person, occasion, and error variances), occasion sizes, person sizes, correlation parameters, and increasing variance parameters. Estimates derived from the simulated data were compared to their true values. The traditional estimates were acceptable when the error terms were not correlated and the error variances were equal. The coefficients were underestimated when the errors were uncorrelated with increasing error variances. However, when the errors were correlated with equal vanances the traditional formulas overestimated both coefficients. When the errors were correlated with increasing variances, the traditional formulas both overestimated and underestimated the coefficients. Finally, increasing the number of occasions sampled resulted in more improved generalizability coefficient estimates than dependability coefficient estimates. Index terms: changing error variances, computer simulation, correlated errors, dependability coefficients, generalizability coefficients.Item The optimal degree of smoothing in equipercentile equating with postsmoothing(1995) Zeng, LingjiaThe effects of different degrees of smoothing on the results of equipercentile equating in the random groups design using a postsmoothing method based on cubic splines were investigated. A computer-based procedure was introduced for selecting a desirable degree of smoothing. The procedure was based on two criteria: (1) that the equating function is reasonably smooth, as evaluated by the second derivatives of the cubic spline functions, and (2) that the equated score distributions are close to that of the old form. The equating functions obtained from smoothing the equipercentile equivalents by a fixed smoothing degree and a degree selected by the computer-based procedure were evaluated in computer simulations for four tests. The results suggest that no particular fixed degree of smoothing always led to an optimal degree of smoothing. The degrees of smoothing selected by the computer-based procedure were better than the best fixed degrees of smoothing for two of the four tests studied; for one of the other two tests, the degrees selected by the computer procedure performed better or nearly as well as the best fixed degrees. Index terms: computer simulation, cubic spline, equating, equipercentile equating, smoothing.Item A minimum X² method for equating tests under the graded response model(1995) Kim, Seock-Ho; Cohen, Allan S.The minimum X² method for computing equating coefficients for tests with dichotomously scored items was extended to the case of Samejima’s graded response items. The minimum X² method was compared with the test response function method (also referred to as the test characteristic curve method) in which the equating coefficients were obtained by matching the test response functions of the two tests. The minimum X² method was much less demanding computationally and yielded equating coefficients that differed little from those obtained using the test response function approach. Index terms: equating, graded response model, item response theory, minimum X² method, test response function method.Item Fitting polytomous item response theory models to multiple-choice tests(1995) Drasgow, Fritz; Levine, Michael V.; Tsien, Sherman; Williams, Bruce; Mead, Alan D.This study examined how well current software implementations of four polytomous item response theory models fit several multiple-choice tests. The models were Bock’s (1972) nominal model, Samejima’s (1979) multiple-choice Model C, Thissen & Steinberg’s (1984) multiple-choice model, and Levine’s (1993) maximum-likelihood formula scoring model. The parameters of the first three of these models were estimated with Thissen’s (1986) MULTILOG computer program; Williams & Levine’s (1993) FORSCORE program was used for Levine’s model. Tests from the Armed Services Vocational Aptitude Battery, the Scholastic Aptitude Test, and the American College Test Assessment were analyzed. The models were fit in estimation samples of approximately 3,000; cross-validation samples of approximately 3,000 were used to evaluate goodness of fit. Both fit plots and X² statistics were used to determine the adequacy of fit. Bock’s model provided surprisingly good fit; adding parameters to the nominal model did not yield improvements in fit. FORSCORE provided generally good fit for Levine’s nonparametric model across all tests. Index terms: Bock’s nominal model, FORSCORE, maximum likelihood formula scoring, MULTILOG, polytomous IRT.Item Effects of differing item parameters on closed-interval DIF statistics(1995) Feinstein, Zachary S.The closed-interval signed area (CSA) and closed-interval unsigned area (CUA) statistics were studied by monte carlo simulation to detect differential item functioning when the reference and focal groups had different parameter distributions. When the pseudo-guessing parameter was varied, the CSA was better able to detectmoderate to large differences between the groups than the CUA. However, the effect of the pseudo-guessing parameter varied depending on item discriminations. Index terms: closed-interval measures, differential item functioning, item response theory, monte carlo simulation, signed area measures, unsigned area measures.Item Distinctive and incompatible properties of two common classes of IRT models for graded responses(1995) Andrich, DavidTwo classes of models for graded responses, the first based on the work of Thurstone and the second based on the work of Rasch, are juxtaposed and shown to satisfy important, but mutually incompatible, criteria and to reflect different response processes. Specifically, in the Thurstone models if adjacent categories are joined to form a new category, either before or after the data are collected, then the probability of a response in the new category is the sum of the probabilities of the responses in the original categories. However, the model does not have the explicit property that if the categories are so joined, then the estimate of the location of the entity or object being measured is invariant before and after the joining. For the Rasch models, if a pair of adjacent categories are joined and then the data are collected, the estimate of the location of the entity is the same before and after the joining, but the probability of a response in the new category is not the sum of the probabilities of the responses in the original categories. Furthermore, if data satisfy the model and the categories are joined after the data are collected, then they no longer satisfy the same Rasch model with the smaller number of categories. These differences imply that the choice between these two classes of models for graded responses is not simply a matter of preference; they also permit a better understanding of the choice of models for graded response data as a function of the underlying processes they are intended to represent. Index terms: graded responses, joining assumption, polytomous IRT models, Rasch model, Thurstone model.Item Conceptual notes on models for discrete polytomous item responses(1995) Mellenbergh, Gideon J.The following types of discrete item responses are distinguished : nominal-dichotomous, ordinal-dichotomous, nominal-polytomous, and ordinal-polytomous. Bock (1972) presented a model for nominal-polytomous item responses that, when applied to dichotomous responses, yields Birnbaum’s (1968) two-parameter logistic model. Applying Bock’s model to ordinal-polytomous items leads to a conceptual problem. The ordinal nature of the response variable must be preserved; this can be achieved using three different methods. A number of existing models are derived using these three methods. The structure of these models is similar, but they differ in the interpretation and qualities of their parameters. Information, parameter invariance, log-odds differences invariance, and model violation also are discussed. Information and parameter invariance of dichotomous item response theory (IRT) also apply to polytomous IRT. Specific objectivity of the Rasch model for dichotomous items is a special case of log-odds differences invariance of polytomous items. Differential item functioning of dichotomous IRT is a special case of measurement model violation of polytomous IRT. Index terms: adjacent categories, continuation ratios, cumulative probabilities, differential item functioning, log-odds differences invariance, measurement model violation, parameter invariance, polytomous IRT models.