Applied Psychological Measurement, Volume 20, 1996
Persistent link for this collectionhttps://hdl.handle.net/11299/114836
Browse
Browsing Applied Psychological Measurement, Volume 20, 1996 by Title
Now showing 1 - 20 of 27
- Results Per Page
- Sort Options
Item Are simple gain scores obsolete?(1996) Williams, Richard H.; Zimmerman, Donald W.It is widely believed that measures of gain, growth, or change, expressed as simple differences between pretest and posttest scores, are inherently unreliable. It is also believed that gain scores lack predictive validity with respect to other criteria. However, these conclusions are based on misleading assumptions about the values of parameters in familiar equations in classical test theory. The present paper examines modified equations for the validity and reliability of difference scores that describe applied testing situations more realistically and reveal that simple gain scores can be more useful in research than commonly believed. Index terms: change scores, difference scores, gain scores, measurement of growth, reliability, test theory, validity.Item Assembling tests for the measurement of multiple traits(1996) Van der Linden, Wim J.For the measurement of multiple traits, this paper proposes assembling tests based on the targets for the (asymptotic) variance functions of the estimators of each of the traits. A linear programming model is presented that can be used to computerize the assembly process. Several cases of test assembly dealing with multidimensional traits are distinguished, and versions of the model applicable to each of these cases are discussed. An empirical example of a test assembly problem from a two-dimensional mathematics item pool is provided. Index terms: asymptotic variance functions, linear programming, multidimensional IRT, test assembly, test design.Item An assessment of Stout's index of essential unidimensionality(1996) Hattie, John; Krakowski, Krzysztof; Rogers, H. Jane; Swaminathan, HariharanA simulation study was conducted to evaluate the dependability of Stout’s T index of unidimensionality as used in his DIMTEST procedure. DIMTEST was found to dependably provide indications of unidimensionality, to be reasonably robust, and to allow for a practical demarcation between one and many dimensions. The procedure was not affected by the method used to identify the initial subset of unidimensional items. It was, however, found to be sensitive to whether the multidimensional data arose from a compensatory model or a partially compensatory model. DIMTEST failed when the matrix of tetrachoric correlations was non-Gramian and hence is not appropriate in such cases. Index terms: DIMTEST, essential unidimensionality, factor analysis, item response models, Stout’s test of unidimensionality, tetrachoric correlations, unidimensionality.Item Commentary on the Commentaries of Collins and Humphreys(1996) Williams, Richard H.; Zimmerman, Donald W.The critiques of Collins (1996) and Humphreys (1996) certainly throw light on properties of gain scores and difference scores that have led to controversies in the past. Collins’ examples reveal that familiar formulas for the reliability of differences do not adequately reflect the precision of measures of change, because they do not allow for intraindividual change. Some additional examples are provided here, and a similar argument is applied to the reliability of a single test. As Collins implies, these arguments indeed disclose flaws, not only in the conventional approach to the reliability of gains and differences, but also in the basic concept of reliability in classical test theory. Index terms: change scores, classical test theory, difference scores, gain scores, intraindividual differences, measurement of growth, reliability, test theory, validity.Item Computing elementary symmetric functions and their derivatives: A didactic(1996) Baker, Frank B.; Harwell, Michael R.The computation of elementary symmetric functions and their derivatives is an integral part of conditional maximum likelihood estimation of item parameters under the Rasch model. The conditional approach has the advantages of parameter estimates that are consistent (assuming the model is correct) and statistically rigorous goodness-of-fit tests. Despite these characteristics, the conditional approach has been limited by problems in computing the elementary symmetric functions. The introduction of recursive formulas for computing these functions and the availability of modem computers has largely mediated these problems; however, detailed documentation of how these formulas work is lacking. This paper describes how various recursion formulas work and how they are used to compute elementary symmetric functions and their derivatives. The availability of this information should promote a more thorough understanding of item parameter estimation in the Rasch model among both measurement specialists and practitioners. Index terms: algorithms, computational techniques, conditional maximum likelihood, elementary symmetric functions, Rasch model.Item Conditional covariance-based nonparametric multidimensionality assessment(1996) Stout, William; Habing, Brian; Douglas, Jeff; Kim, Hae Rim; Roussos, Louis; Zhang, JinmingAccording to the weak local independence approach to defining dimensionality, the fundamental quantities for determining a test’s dimensional structure are the covariances of item-pair responses conditioned on examinee trait level. This paper describes three dimensionality assessment procedures-HCA/CCPROX, DIMTEST, and DETECT-that use estimates of these conditional covariances. All three procedures are nonparametric ; that is, they do not depend on the functional form of the item response functions. These procedures are applied to a dimensionality study of the LSAT, which illustrates the capacity of the approaches to assess the lack of unidimensionality, identify groups of items manifesting approximate simple structure, determine the number of dominant dimensions, and measure the amount of multidimensionality. Index terms: approximate simple structure, conditional covariance, DETECT, dimensionality, DIMTEST, HCA/CCPROX, hierarchical cluster analysis, IRT, LSAT, local independence, multidimensionality, simple structure.Item Detecting faking on a personality instrument using appropriateness measurement(1996) Zickar, Michael J.; Drasgow, FritzResearch has demonstrated that people can and often do consciously manipulate scores on personality tests. Test constructors have responded by using social desirability and lying scales in order to identify dishonest respondents. Unfortunately, these approaches have had limited success. This study evaluated the use of appropriateness measurement for identifying dishonest respondents. A dataset was analyzed in which respondents were instructed either to answer honestly or to fake good. The item response theory approach classified a higher number of faking respondents at low rates of misclassification of honest respondents (false positives) than did a social desirability scale. At higher false positive rates, the social desirability approach did slightly better. Implications for operational testing and suggestions for further research are provided. Index terms: appropriateness measurement, detecting faking, item response theory, lying scales, person fit, personality measurement.Item Developments in Multidimensional Item Response Theory(1996) Ackerman, Terry A.Item An empirical link of content and construct validity evidence(1996) Deville, Craig W.Since the 1940s, measurement specialists have called for an empirical validation technique that combines content- and construct-related evidence. This study investigated the value of such a technique. A self-assessment instrument designed to cover four traditional foreign language skills was administered to 1,404 college-level foreign language students. Four subject-matter experts were asked to provide item dissimilarity judgments, using whatever criteria they thought appropriate. The data from the students and the experts were examined separately using multidimensional scaling followed by cluster and discriminant analyses. Results showed that the structure of the data underlying both the student and expert scaling solutions corresponded closely to that specified in the instrument blueprint. In addition, using canonical correlation, a comparison of the two scaling solutions revealed a high degree of similarity in the two solutions. Index terms: canonical correlation, construct validity, content validity, item dissimilarities data, multidimensional scaling.Item A global information approach to computerized adaptive testing(1996) Chang, Hua-Hua; Ying, ZhiliangMost item selection in computerized adaptive testing is based on Fisher information (or item information). At each stage, an item is selected to maximize the Fisher information at the currently estimated trait level (θ). However, this application of Fisher information could be much less efficient than assumed if the estimators are not close to the true θ, especially at early stages of an adaptive test when the test length (number of items) is too short to provide an accurate estimate for true θ. It is argued here that selection procedures based on global information should be used, at least at early stages of a test when θ estimates are not likely to be close to the true θ. For this purpose, an item selection procedure based on average global information is proposed. Results from pilot simulation studies comparing the usual maximum item information item selection with the proposed global information approach are reported, indicating that the new method leads to improvement in terms of bias and mean squared error reduction under many circumstances. Index terms: computerized adaptive testing, Fisher information, global information, information surface, item information, item response theory, Kullback-Leibler information, local information, test information.Item Graphical representation of multidimensional item response theory analyses(1996) Ackerman, Terry A.This paper illustrates how graphical analyses can enhance the interpretation and understanding of multidimensional item response theory (IRT) analyses. Many unidimensional IRT concepts, such as item response functions and information functions, can be extended to multiple dimensions; however, as dimensionality increases, new problems and issues arise, most notably how to represent these features within a multidimensional framework. Examples are provided of several different graphical representations, including item response surfaces, information vectors, and centroid plots of conditional two-dimensional trait distributions. All graphs are intended to supplement quantitative and substantive analyses and thereby assist the test practitioner in determining more precisely such information as the construct validity of a test, the degree of measurement precision, and the consistency of interpretation of the number-correct score scale. Index terms: dimensionality, graphical analysis, multidimensional item response theory, test analysis.Item Identification of items that show nonuniform DIF(1996) Narayanan, Pankaja; Swaminathan, H.This study compared three procedures-the Mantel- Haenszel (MH), the simultaneous item bias (SIB), and the logistic regression (LR) procedures-with respect to their Type I error rates and power to detect nonuniform differential item functioning (DIF). Data were simulated to reflect a variety of conditions: The factors manipulated included sample size, ability distribution differences between the focal and the reference groups, proportion of DIF items in the test, DIF effect sizes, and type of item. 384 conditions were studied. Both the SIB and LR procedures were equally powerful in detecting nonuniform DIF under most conditions. The MH procedure was not very effective in identifying nonuniform DIF items that had disordinal interactions. The Type I error rates were within the expected limits for the MH procedure and were higher than expected for the SIB and LR procedures ; the SIB results showed an overall increase of approximately 1% over the LR results. Index terms: differential item functioning, logistic regression statistic, Mantel-Haenszel statistic, nondirectional DIF, simultaneous item bias statistic, SIBTEST, Type I error rate, unidirectional DIF.Item The influence of the presence of deviant item score patterns on the power of a person-fit statistic(1996) Meijer, Rob R.Studies investigating the power of person-fit statistics often assume that the item parameters that are used to calculate the statistics are estimated in a sample without misfitting item score patterns. However, in practical test applications calibration samples likely will contain such patterns. In the present study, the influence of the type and the number of misfitting patterns in the calibration sample on the detection rate of the ZU3 statistic was investigated by means of simulated data. An increase in the number of misfitting simulees resulted in a decrease in the power of ZU3. Furthermore, the type of misfit and the test length influenced the power of ZU3. The use of an iterative procedure to remove the misfitting patterns from the dataset was investigated. Results suggested that this method can be used to improve the power of ZU3. Index terms: aberrance detection, appropriateness measurement, nonparametric item response theory, person fit, person-fit statistic ZU3.Item An investigation of the likelihood ratio test for detection of differential item functioning(1996) Cohen, Allan S.; Kim, Seock-Ho; Wollack, James A.Type I error rates for the likelihood ratio test for detecting differential item functioning (DIF) were investigated using monte carlo simulations. Two- and three-parameter item response theory (IRT) models were used to generate 100 datasets of a 50-item test for samples of 250 and 1,000 simulated examinees for each IRT model. Item parameters were estimated by marginal maximum likelihood for three IRT models: the three-parameter model, the three-parameter model with a fixed guessing parameter, and the two-parameter model. All DIF comparisons were simulated by randomly pairing two samples from each sample size and IRT model condition so that, for each sample size and IRT model condition, there were 50 pairs of reference and focal groups. Type I error rates for the two-parameter model were within theoretically expected values at each of the α levels considered. Type I error rates for the three-parameter and three-parameter model with a fixed guessing parameter, however, were different from the theoretically expected values at the α levels considered. Index terms: bias, differential item functioning, item bias, item response theory, likelihood ratio test for DIF.Item An investigation of the sampling distributions of equating coefficients(1996) Baker, Frank B.Using the characteristic curve method for dichotomously scored test items, the sampling distributions of equating coefficients were examined. Simulated data for broad-range and screening tests were analyzed using three equating contexts and three anchor-item configurations in horizontal and vertical equating situations. The results indicated that the sampling distributions were bell-shaped and their standard deviations were uniformly small. There were few differences in the forms of the distributions of the obtained equating coefficients as a function of the anchor-item configurations or type of test. For the equating contexts studied, the sampling distributions of the equating coefficients appear to have acceptable characteristics, suggesting confidence in the values obtained by the characteristic curve method. Index terms: anchor items, characteristic curve method, common metric, equating coefficients, sampling distributions, test equating.Item Is reliability obsolete? A commentary on "Are simple gain scores obsolete?"(1996) Colllins, Linda M.Williams & Zimmerman (1996) provided much-needed clarification on the reliability of gain scores. This commentary translates these ideas into recognizable patterns of change that tend to produce reliable or unreliable gain scores. It also questions the relevance of the traditional idea of reliability to the measurement of change. Index terms: change scores, classical test theory, difference scores, gain scores, intraindividual differences, measurement of growth, reliability, test theory, validity.Item Item response theory models and spurious interaction effects in factorial ANOVA designs(1996) Embretson, Susan E.In many psychological experiments, interaction effects in factorial analysis of variance (ANOVA) designs are often estimated using total scores derived from classical test theory. However, interaction effects can be reduced or eliminated by nonlinear monotonic transformations of a dependent variable. Although cross-over interactions cannot be eliminated by transformations, the meaningfulness of other interactions hinges on achieving a measurement scale level for which nonlinear transformations are inappropriate (i.e., at least interval scale level). Classical total test scores do not provide interval level measurement according to contemporary item response theory (IRT). Nevertheless, rarely are IRT models applied to achieve more optimal measurement properties and hence more meaningful interaction effects. This paper provides several conditions under which interaction effects that are estimated from classical total scores, rather than IRT trait scores, can be misleading. Using derived asymptotic expectations from an IRT model, interaction effects of zero on the IRT trait scale were often not estimated as zero from the total score scale. Further, when nonzero interactions were specified on the IRT trait scale, the estimated interaction effects were biased inward when estimated from the total score scale. Test difficulty level determined both the direction and the magnitude of the biased interaction effects. Index terms: factorial designs, interaction effects, interval measurement, item response theory, level of measurement, measurement scales, statistical inference.Item Linear dependence of gain scores on their components imposes constraints on their use and interpretation: Comment on "Are simple gain scores obsolete?"(1996) Humphreys, Lloyd G.The properties of gain scores are linearly determined by the properties of their components. Thus, the reliability of a gain is uniquely determined by the reliabilities of the components, the correlation between them, and their standard deviations. Reliability is not inherently low, but the components of gains used in many investigations make low reliability likely. Correlations of the difference between two measures and a third variate are also determined uniquely by three correlations and two standard deviations. Raw score standard deviations frequently tell more about the measurement metric and how it is used than about the psychological processes underlying the measurements. Correlations involving gains/ differences cannot be understood adequately unless the essential sample statistics of the components are known and reported. Index terms: change scores, classical test theory, difference scores, gain scores, intraindividual differences, measurement of growth, reliability, test theory, validity.Item Linking multidimensional item calibrations(1996) Davey, Tim; Oshima, T. C.; Lee, KevinInvariance of trait scales across changing samples of items and examinees is a central tenet of item response theory (IRT). However, scales defined by most IRT models are truly invariant with respect to certain linear transformations of the parameters. The problem is to find the proper transformation that places separate calibrations on a common scale. A variety of procedures for estimating transformations have been proposed for unidimensional models. This paper explores some issues involved in extending and adapting unidimensional linking procedures for use with multidimensional IRT models. Index terms: equating, item response theory, linking, metric in IRT, multidimensional IRT, scale linking.Item Longitudinal models of reliability and validity: A latent curve approach(1996) Tisak, John; Tisak, Marie S.The concepts of reliability and validity and their associated coefficients typically have been restricted to a single measurement occasion. This paper describes dynamic generalizations of reliability and validity that will incorporate longitudinal or developmental models, using latent curve analysis. Initially a latent curve model is formulated to depict change. This longitudinal model is then incorporated into the classical definitions of reliability and validity. This approach permits the separation of constancy or change from the indexes of reliability and validity. Statistical estimation and hypothesis testing be achieved using standard structural equations modeling computer programs. These longitudinal models of reliability and validity are demonstrated on sociological psychological data. Index terms: concurrent validity, dynamic models, dynamic true score, latent curve analysis, latent trajectory, predictive validity, reliability, validity.