Applied Psychological Measurement, Volume 18, 1994

Persistent link for this collection

Search within Applied Psychological Measurement, Volume 18, 1994

Browse

Recent Submissions

Now showing 1 - 20 of 28
  • Item
    Testing the Equality of Two Related Intraclass Reliability Coefficients
    (1994) Alsawaimeh, Yousef M.; Feldt, Leonard S.
    An approximate statistical test of the equality of two intraclass reliability coefficients based on the same sample of people is derived. Such a test is needed when a researcher wishes to compare the reliability of two measurement procedures and both procedures can be applied to the performances or products of the same group of individuals. A numerical example is presented. Monte carlo studies indicate that the proposed test effectively controls Type I error with as few as two or three measurements on each of 50 people. Index terms: equality of related intraclass reliability coefficients, intraclass reliability, sampling theory, Speannan-Brown extrapolation, statistical test.
  • Item
    A Conditional Item-Fit Index for Rasch Models
    (1994) Jϋrgen, Rost; Von Davier, Matthias
    A new item-fit index is proposed that is both a descriptive measure of deviance of single items and an index for statistical inference. This index is based on the assumptions of the dichotomous and polytomous Rasch models for items with ordered categories and, in particular, is a standardization of the conditional likelihood of the item pattern that does not depend on the item parameters. This approach is compared with other methods for determining item fit. In contrast to many other item-fit indexes, this index is not based on response-score residuals. Results of a simulation study illustrating the performance of the index are provided. An asymptotically normally distributed Z statistic is derived and an empirical example demonstrates the sensitivity of the index with respect to item and person heterogeneity. Index terms: appropriateness measurement, item discrimination, item fit, partial credit model, Rasch model.
  • Item
    A General Approach to Algorithmic Design of Fixed-Form Tests, Adaptive Tests, and Testlets
    (1994) Berger, Martijn P. F.
    The selection of items from a calibrated item bank for fixed-form tests is an optimal test design problem; this problem has been handled in the literature by mathematical programming models. A similar problem, however, arises when items are selected for an adaptive test or for testlets. This paper focuses on the similarities of optimal design of fixed-form tests, adaptive tests, and testlets within the framework of the general theory of optimal designs. A sequential design procedure is proposed that uses these similarities. This procedure not only enables optimal design of fixed-form tests, adaptive tests, and testlets, but is also very flexible. The procedure is easy to apply, and consistent estimates for the trait level distribution are obtained. Index terms: adaptive tests, consistency, efficiency, optimal test design, sequential procedure, test design, testlets.
  • Item
    Why Factor Analysis Often is the Incorrect Model for Analyzing Bipolar Concepts, and What Model to Use Instead
    (1994) Van Schuur, Wijbrandt H.; Kiers, Henk A.
    Factor analysis of data that conform to the unfolding model often results in an extra factor. This artificial extra factor is particularly important when data that conform to a bipolar unidimensional unfolding scale are factor analyzed. One bipolar dimension is expected, but two factors are found and often are interpreted as two unrelated dimensions. Although this extra factor phenomenon was pointed out in the early 1960s, it still is not widely recognized. The extra factor phenomenon in the unidimensional case is reviewed here. A numerical illustration is provided, and a number of diagnostics that can be used to determine whether data conform to the unidimensional unfolding model better than to the factor model are discussed. These diagnostics then are applied to an empirical example. Index terms: factor analysis, factor interpretation problems, rating scales, unfolding diagnostics, unfolding model.
  • Item
    Influence of Test and Person Characteristics on Nonparametric Appropriateness Measurement
    (1994) Meijer, Rob R.; Molenaar, lvo W.; Sijtsma, Klaas
    Appropriateness measurement in nonparametric item response theory modeling is affected by the reliability of the items, the test length, the type of aberrant response behavior, and the percentage of aberrant persons in the group. The percentage of simulees defined a priori as aberrant responders that were detected increased when the mean item reliability, the test length, and the ratio of aberrant to nonaberrant simulees in the group increased. Also, simulees "cheating" on the most difficult items in a test were more easily detected than those "guessing" on all items. Results were less stable across replications as item reliability or test length decreased. Results suggest that relatively short tests of at least 17 items can be used for person-fit analysis if the items are sufficiently reliable. Index terms: aberrance detection, appropriateness measurement, nonparametric item response theory, person-fit, person-fit statistic U3.
  • Item
    A Simulation Study of Methods for Assessing Differential Item Functioning in Computerized Adaptive Tests
    (1994) Zwick, Rebecca; Thayer, Dorothy T.; Wingersky, Marilyn
    Simulated data were used to investigate the performance of modified versions of the Mantel-Haenszel method of differential item functioning (DIF) analysis in computerized adaptive tests (CATs). Each simulated examinee received 25 items from a 75-item pool. A three-parameter logistic item response theory (IRT) model was assumed, and examinees were matched on expected true scores based on their CAT responses and estimated item parameters. The CAT-based DIF statistics were found to be highly correlated with DIF statistics based on nonadaptive administration of all 75 pool items and with the true magnitudes of DIF in the simulation. Average DIF statistics and average standard errors also were examined for items with various characteristics. Finally, a study was conducted of the accuracy with which the modified Mantel-Haenszel procedure could identify CAT items with substantial DIF using a classification system now implemented by some testing programs. These additional analyses provided further evidence that the CAT-based DIF procedures performed well. More generally, the results supported the use of IRT-based matching variables in DIF analysis. Index terms: adaptive testing, computerized adaptive testing, differential item functioning, item bias, item response theory.
  • Item
    A Conditional Item-Fit Index for Rasch Models
    (1994) Jȕrgen, Rost; Von Davier, Matthias
    A new item-fit index is proposed that is both a descriptive measure of deviance of single items and an index for statistical inference. This index is based on the assumptions of the dichotomous and polytomous Rasch models for items with ordered categories and, in particular, is a standardization of the conditional likelihood of the item pattern that does not depend on the item parameters. This approach is compared with other methods for determining item fit. In contrast to many other item-fit indexes, this index is not based on response-score residuals. Results of a simulation study illustrating the performance of the index are provided. An asymptotically normally distributed Z statistic is derived and an empirical example demonstrates the sensitivity of the index with respect to item and person heterogeneity. Index terms: appropriateness measurement, item discrimination, item fit, partial credit model, Rasch model.
  • Item
    A comparison of item calibration media in computerized adaptive testing
    (1994) Hetter, Rebecca D.; Segall, Daniel O.; Bloxom, Bruce M.
    A concern in computerized adaptive testing is whether data for calibrating items can be collected from either a paper-and-pencil (P&P) or a computer administration of the items. Fixed blocks of power test items were administered by computer to one group of examinees and by P&P to a second group. These data were used to obtain computer-based and P&P-based three-parameter logistic model parameters of the items. Then each set of parameters was used to estimate item response theory pseudo-adaptive scores for a third group of examinees who had received all of the items by computer. The effect of medium of administration of the calibration items was assessed by comparative analyses of the adaptive scores using structural modeling. The results support the use of item parameters calibrated from either P&P or computer administrations for use in computerized adaptive power tests. The calibration medium did not appear to alter the constructs measured by the adaptive test or the reliability of the adaptive test scores Irndex. terms: computerized adaptive testing, item calibration, item parameter estimation, item response theory, medium of administration, trait level estimation.
  • Item
    A psychometric evaluation of 4-point and 6-point Likert-type scales in relation to reliability and validity
    (1994) Chang, Lei
    Reliability and validity of 4-point and 6-point scales were assessed using a new model-based approach to fit empirical data. Different measurement models were fit by confirmatory factor analyses of a multitrait-multimethod covariance matrix. 165 graduate students responded to nine items measuring three quantitative attitudes. Separation of method from trait variance led to greater reduction of reliability and heterotrait-monomethod coefficients for the 6-point scale than for the 4-point scale. Criterion-related validity was not affected by the number of scale points. The issue of selecting 4- versus 6-point scales may not be generally resolvable, but may rather depend on the empirical setting. Response conditions theorized to influence the use of scale options are discussed to provide directions for further research. Index terms: Likert-type scales, multitrait-multimethod matrix, reliability, scale options, validity.
  • Item
    An investigation of Lord's procedure for the detection of differential item functioning
    (1994) Kim, Seock-Ho; Cohen, Allan S.; Kim, Hae-Ok
    Type I error rates of Lord’s X² chi; test for differential item functioning were investigated using monte carlo simulations. Two- and three-parameter item response theory (IRT) models were used to generate 50-item tests for samples of 250 and 1,000 simulated examinees. Item parameters were estimated using two algorithms (marginal maximum likelihood estimation and marginal Bayesian estimation) for three IRT models (the three-parameter model, the three-parameter model with a fixed guessing parameter, and the two-parameter model). Proportions of significant X²s at selected nominal α levels were compared to those from joint maximum likelihood estimation as reported by McLaughlin & Drasgow (1987). Type I error rates for the three-parameter model consistently exceeded theoretically expected values. Results for the three-parameter model with a fixed guessing parameter and for the two-parameter model were consistently lower than expected values at the a levels in this study. Index terms: differential item functioning, item response theory, Lord’s X².
  • Item
    Estimation of reliability coefficients using the test information function and its modifications
    (1994) Samejima, Fumiko
    The reliability coefficient and the standard error of measurement in classical test theory are not properties of a specific test, but are attributed to both a specific test and a specific trait distribution. In latent trait models, or item response theory, the test information function (TIF) provides more precise local measures of accuracy in trait estimation than are available from the reliability coefficient. The reliability coefficient is still widely used, however, and is popular because of its simplicity. Thus, it is worthwhile to relate it to the TIF. In this paper, the reliability coefficient is predicted from the TIF, or two modified TIF formulas, and a specific trait distribution. Examples demonstrate the variability of the reliability coefficient across different trait distributions, and the results are compared with empirical reliability coefficients. Practical suggestions are given as to how to make better use of the reliability coefficient. Index terms: adaptive testing, bias, classical test theory, item information function, latent trait models, maximum likelihood estimation, reliability coefficient, standard error of measurement, test information function, trait estimation.
  • Item
    Distinguishing among parametric item response models for polychotomous ordered data
    (1994) Maydeu-Olivares, Albert; Drasgow, Fritz; Mead, Alan D.
    Several item response models have been proposed for fitting Likert-type data. Thissen & Steinberg (1986) classified most of these models into difference models and divide-by-total models. Although they have different mathematical forms, divide-by-total and difference models with the same number of parameters seem to provide very similar fit to the data. The ideal observer method was used to compare two models with the same number of parameters-Samejima’s (1969) graded response model (a difference model) and Thissen & Steinberg’s (1986) extension of Masters’ (1982) partial credit model (a divide-by-total model-to investigate whether difference models or divide-by-total models should be preferred for fitting Likert-type data. The models were found to be very similar under the conditions investigated, which included scale lengths from 5 to 25 items (five-option items were used) and calibration samples of 250 to 3,000. The results suggest that both models fit approximately equally well in most practical applications. Index terms: graded response model, IRT, Likert scales, partial credit model, polychotomous models, psychometrics.
  • Item
    Creating a test information profile for a two-dimensional latent space
    (1994) Ackerman, Terry A.
    In some cognitive testing situations it is believed, despite reporting only a single score, that the test items differentiate levels of multiple traits. In such situations, the reported score may represent quite disparate composites of these multiple traits. Thus, when attempting to interpret a single score from a set of multidimensional items, several concerns naturally arise. First, it is important to know what composite of traits is being measured at all levels of the reported score scale. Second, it is also necessary to discern that all examinees, no matter where they lie in the latent trait space, are being measured on the same composite of traits. Thus, the role of multidimensionality in the interpretation or meaning given to various score levels must be examined. This paper presents a method for computing multidimensional information and provides examples of how different aspects of test information can be displayed graphically to form a profile of a test in a two-dimensional latent space. Index terms: information, item response theory, multidimensional item response theory, test information.
  • Item
    State mastery learning: Dynamic models for longitudinal data
    (1994) Langeheine, Rolf; Stern, Elsbeth; Van de Pol, Frank
    Macready & Dayton (1980) showed that state mastery models are handled optimally within the general latent class framework for data from a single time point. An extension of this idea is presented here for longitudinal data obtained from repeated measurements across time. The static approach is extended using multiple-indicator Markov chain models. The approach presented here emphasizes the dynamic aspects of the process of change, such as growth, decay, and stability. The general approach is presented, and models with purely categorical and ordered categorical states and several extensions of these models are discussed. Problems of estimation, identification, assessment of model fit, and hypothesis testing associated with these models also are discussed. The applicability of these models is demonstrated using data from a longitudinal study on solving arithmetic word problems. The advantages and disadvantages of using the approach presented here are discussed. Index terms: arithmetic word problems, dynamic latent class models, latent class models, longitudinal categorical data, Markov models, state mastery models.
  • Item
    Robust dual scaling with Tukey's biweight
    (1994) Sachs, John
    Use of the method of reciprocal biweighted means (MBM) for dealing with the outlier problem in dual scaling compared favorably with other robust estimation procedures, such as the method of trimmed reciprocal averages (MTA). Like the MTA, the MBM was easy to implement and it converged to a stable point when a two-step estimation procedure was used. One advantage of the MBM over the MTA was that it afforded greater control in fine tuning the final solution. Empirical results for four datasets, some containing multiple outliers, are presented. Index terms: biweight, dual scaling, outliers, reciprocal averages, robust estimation, Tukey ’s biweight.
  • Item
    The number of Guttman errors as a simple and powerful person-fit statistic
    (1994) Meijer, Rob R.
    A number of studies have examined the power of several statistics that can be used to detect examinees with unexpected (nonfitting) item score patterns, or to determine person fit. This study compared the power of the U3 statistic with the power of one of the simplest person-fit statistics, the sum of the number of Guttman errors. In most cases studied, (a weighted version of) the latter statistic performed as well as the U3 statistic. Counting the number of Guttman errors seems to be a useful and simple alternative to more complex statistics for determining person fit. Index terms: aberrance detection, appropriateness measurement, Guttman errors, nonparametric item response theory, person fit.
  • Item
    Performance of the Mantel-Haenszel and simultaneous item bias procedures for detecting differential item functioning
    (1994) Narayanan, Pankaja; Swaminathan, H.
    Two nonparametric procedures for detecting differential item functioning (DIF)-the Mantel-Haenszel (MH) procedure and the simultaneous item bias (SIB) procedure-were compared with respect to their Type I error rates and power. Data were simulated to reflect conditions varying in sample size, ability distribution differences between the focal and reference groups, proportion of DIF items in the test, DIF effect sizes, and type of item. 1,296 conditions were studied. The SIB and MH procedures were equally powerful in detecting uniform DIF for equal ability distributions. The SIB procedure was more powerful than the MH procedure in detecting DIF for unequal ability distributions. Both procedures had sufficient power to detect DIF for a sample size of 300 in each group. Ability distribution did not have a significant effect on the SIB procedure but did affect the MH procedure. This is important because ability distribution differences between two groups often are found in practice. The Type I error rates for the MH statistic were well within the nominal limits, whereas they were slightly higher than expected for the SIB statistic. Comparisons between the detection rates of the two procedures were made with respect to the various factors. Index terms: differential item functioning, Mantel- Haenszel statistic, power, simultaneous item bias statistic, SIBTEST, Type I error rates.
  • Item
    The influence of conditioning scores in performing DIF analyses
    (1994) Ackerman, Terry A.; Evans, John A.
    The effect of the conditioning score on the results of differential item functioning (DIF) analyses was examined. Most DIF detection procedures match examinees from two groups of interest according to the examinees’ test score (e.g., number correct) and then summarize the performance differences across trait levels. DIF has the potential to occur whenever the conditioning criterion cannot account for the multidimensional interaction between items and examinees. Response data were generated from a two-dimensional item response theory model for a 30-item test in which items were measuring uniformly spaced composites of two latent trait parameters, θ₁ and θ₂. Two different DIF detection methods- the Mantel-Haenszel and simultaneous item bias (SIBTEST) detection procedure-were used for three different sample size conditions. When the DIF procedures were conditioned on the number-correct score or on a transformation of θ₁ or θ₂, differential group performance followed hypothesized patterns. When the conditioning criterion was a function of both θ₁ and θ₂ (i.e., when the complete latent space was identified), DIF, as theory would suggest, was eliminated for all items. Index terms: construct validity, differential item functioning, item bias, Mantel-Haenszel procedure, SIBTEST.
  • Item
    Modeling developmental processes using latent growth structural equation methodology
    (1994) Duncan, Terry E.; Duncan, Susan C.; Stoolmiller, Mike
    Recent advances in latent growth modeling allow for the testing of complex models regarding developmental trends from both an inter- and intra-individual perspective. The interpretation of model parameters for the latent growth specification is illustrated with a simple two-factor model. An example application of latent growth methodology analyzing developmental change in adolescent alcohol consumption is presented. Findings are discussed with particular reference to the utility of latent growth curve models for assessing developmental processes at both the inter- and intra-individual level across a variety of behavioral domains. Index terms: alcohol consumption, change measurement, developmental models, growth measurement, latent growth models.
  • Item
    Standard errors of a chain of linear equatings
    (1994) Zeng, Lingjia; Hanson, Bradley A.; Kolen, Michael J.
    A general delta method is described for computing the standard error (SE) of a chain of linear equatings. The general delta method derives the SEs directly from the moments of the score distributions obtained in the equating chain. The partial derivatives of the chain equating function needed for computing the SEs are derived numerically. The method can be applied to equatings using the common-items nonequivalent populations design. Computer simulations were conducted to evaluate the SEs of a chain of two equatings using the Levine and Tucker methods. The general delta method was more accurate than a method that assumes the equating processes in the chain are statistically independent. Index terms: chain equating, delta method, equating, linear equating, standard error of equating.