Applied Psychological Measurement, Volume 20, 1996

Persistent link for this collection

Search within Applied Psychological Measurement, Volume 20, 1996


Recent Submissions

Now showing 1 - 20 of 27
  • Item
    Linking multidimensional item calibrations
    (1996) Davey, Tim; Oshima, T. C.; Lee, Kevin
    Invariance of trait scales across changing samples of items and examinees is a central tenet of item response theory (IRT). However, scales defined by most IRT models are truly invariant with respect to certain linear transformations of the parameters. The problem is to find the proper transformation that places separate calibrations on a common scale. A variety of procedures for estimating transformations have been proposed for unidimensional models. This paper explores some issues involved in extending and adapting unidimensional linking procedures for use with multidimensional IRT models. Index terms: equating, item response theory, linking, metric in IRT, multidimensional IRT, scale linking.
  • Item
    Multidimensional computerized adaptive testing in a certification or licensure context
    (1996) Luecht, Richard M.
    Multidimensional item response theory (MIRT) computerized adaptive testing, building on recent work by Segall (1996), is applied in a licensing/certification context. An example of a medical licensure test is used to demonstrate situations in which complex, integrated content must be balanced at the total test level for validity reasons, but items assigned to reportable subscore categories may be used under a MIRT adaptive paradigm to improve the reliability of the subscores. A heuristic optimization framework is outlined that generalizes to both univariate and multivariate statistical objective functions, with additional systems of constraints included to manage the content balancing or other test specifications on adaptively constructed test forms. Simulation results suggested that a multivariate treatment of the problem, although complicating somewhat the objective function used and the estimation of traits, nonetheless produces advantages from a psychometric perspective. Index terms: adaptive testing, computerized adaptive testing, information functions, licensure testing, multidimensional item response theory, sequential testing.
  • Item
    Assembling tests for the measurement of multiple traits
    (1996) Van der Linden, Wim J.
    For the measurement of multiple traits, this paper proposes assembling tests based on the targets for the (asymptotic) variance functions of the estimators of each of the traits. A linear programming model is presented that can be used to computerize the assembly process. Several cases of test assembly dealing with multidimensional traits are distinguished, and versions of the model applicable to each of these cases are discussed. An empirical example of a test assembly problem from a two-dimensional mathematics item pool is provided. Index terms: asymptotic variance functions, linear programming, multidimensional IRT, test assembly, test design.
  • Item
    A multidimensionality-based DIF analysis paradigm
    (1996) Roussos, Louis; Stout, William
    A multidimensionality-based differential item functioning (DIF) analysis paradigm is presented that unifies the substantive and statistical DIF analysis approaches by linking both to a theoretically sound and mathematically rigorous multidimensional conceptualization of DIF. This paradigm has the potential (1) to improve understanding of the causes of DIF by formulating and testing substantive dimensionality-based DIF hypotheses; (2) to reduce Type 1 error through a better understanding of the possible multidimensionality of an appropriate matching criterion; and (3) to increase power through the testing of bundles of items measuring similar dimensions. Using this approach, DIF analysis is shown to have the potential for greater integration in the overall test development process. Index terms: bias, bundle DIF, cluster analysis, DIF estimation, DIF hypothesis testing, differential item functioning, dimensionality, DIMTEST, item response theory, multidimensionality, sensitivity review, SIBTEST.
  • Item
    Conditional covariance-based nonparametric multidimensionality assessment
    (1996) Stout, William; Habing, Brian; Douglas, Jeff; Kim, Hae Rim; Roussos, Louis; Zhang, Jinming
    According to the weak local independence approach to defining dimensionality, the fundamental quantities for determining a test’s dimensional structure are the covariances of item-pair responses conditioned on examinee trait level. This paper describes three dimensionality assessment procedures-HCA/CCPROX, DIMTEST, and DETECT-that use estimates of these conditional covariances. All three procedures are nonparametric ; that is, they do not depend on the functional form of the item response functions. These procedures are applied to a dimensionality study of the LSAT, which illustrates the capacity of the approaches to assess the lack of unidimensionality, identify groups of items manifesting approximate simple structure, determine the number of dominant dimensions, and measure the amount of multidimensionality. Index terms: approximate simple structure, conditional covariance, DETECT, dimensionality, DIMTEST, HCA/CCPROX, hierarchical cluster analysis, IRT, LSAT, local independence, multidimensionality, simple structure.
  • Item
    Graphical representation of multidimensional item response theory analyses
    (1996) Ackerman, Terry A.
    This paper illustrates how graphical analyses can enhance the interpretation and understanding of multidimensional item response theory (IRT) analyses. Many unidimensional IRT concepts, such as item response functions and information functions, can be extended to multiple dimensions; however, as dimensionality increases, new problems and issues arise, most notably how to represent these features within a multidimensional framework. Examples are provided of several different graphical representations, including item response surfaces, information vectors, and centroid plots of conditional two-dimensional trait distributions. All graphs are intended to supplement quantitative and substantive analyses and thereby assist the test practitioner in determining more precisely such information as the construct validity of a test, the degree of measurement precision, and the consistency of interpretation of the number-correct score scale. Index terms: dimensionality, graphical analysis, multidimensional item response theory, test analysis.
  • Item
    Commentary on the Commentaries of Collins and Humphreys
    (1996) Williams, Richard H.; Zimmerman, Donald W.
    The critiques of Collins (1996) and Humphreys (1996) certainly throw light on properties of gain scores and difference scores that have led to controversies in the past. Collins’ examples reveal that familiar formulas for the reliability of differences do not adequately reflect the precision of measures of change, because they do not allow for intraindividual change. Some additional examples are provided here, and a similar argument is applied to the reliability of a single test. As Collins implies, these arguments indeed disclose flaws, not only in the conventional approach to the reliability of gains and differences, but also in the basic concept of reliability in classical test theory. Index terms: change scores, classical test theory, difference scores, gain scores, intraindividual differences, measurement of growth, reliability, test theory, validity.
  • Item
    Linear dependence of gain scores on their components imposes constraints on their use and interpretation: Comment on "Are simple gain scores obsolete?"
    (1996) Humphreys, Lloyd G.
    The properties of gain scores are linearly determined by the properties of their components. Thus, the reliability of a gain is uniquely determined by the reliabilities of the components, the correlation between them, and their standard deviations. Reliability is not inherently low, but the components of gains used in many investigations make low reliability likely. Correlations of the difference between two measures and a third variate are also determined uniquely by three correlations and two standard deviations. Raw score standard deviations frequently tell more about the measurement metric and how it is used than about the psychological processes underlying the measurements. Correlations involving gains/ differences cannot be understood adequately unless the essential sample statistics of the components are known and reported. Index terms: change scores, classical test theory, difference scores, gain scores, intraindividual differences, measurement of growth, reliability, test theory, validity.
  • Item
    Longitudinal models of reliability and validity: A latent curve approach
    (1996) Tisak, John; Tisak, Marie S.
    The concepts of reliability and validity and their associated coefficients typically have been restricted to a single measurement occasion. This paper describes dynamic generalizations of reliability and validity that will incorporate longitudinal or developmental models, using latent curve analysis. Initially a latent curve model is formulated to depict change. This longitudinal model is then incorporated into the classical definitions of reliability and validity. This approach permits the separation of constancy or change from the indexes of reliability and validity. Statistical estimation and hypothesis testing be achieved using standard structural equations modeling computer programs. These longitudinal models of reliability and validity are demonstrated on sociological psychological data. Index terms: concurrent validity, dynamic models, dynamic true score, latent curve analysis, latent trajectory, predictive validity, reliability, validity.
  • Item
    Is reliability obsolete? A commentary on "Are simple gain scores obsolete?"
    (1996) Colllins, Linda M.
    Williams & Zimmerman (1996) provided much-needed clarification on the reliability of gain scores. This commentary translates these ideas into recognizable patterns of change that tend to produce reliable or unreliable gain scores. It also questions the relevance of the traditional idea of reliability to the measurement of change. Index terms: change scores, classical test theory, difference scores, gain scores, intraindividual differences, measurement of growth, reliability, test theory, validity.
  • Item
    Identification of items that show nonuniform DIF
    (1996) Narayanan, Pankaja; Swaminathan, H.
    This study compared three procedures-the Mantel- Haenszel (MH), the simultaneous item bias (SIB), and the logistic regression (LR) procedures-with respect to their Type I error rates and power to detect nonuniform differential item functioning (DIF). Data were simulated to reflect a variety of conditions: The factors manipulated included sample size, ability distribution differences between the focal and the reference groups, proportion of DIF items in the test, DIF effect sizes, and type of item. 384 conditions were studied. Both the SIB and LR procedures were equally powerful in detecting nonuniform DIF under most conditions. The MH procedure was not very effective in identifying nonuniform DIF items that had disordinal interactions. The Type I error rates were within the expected limits for the MH procedure and were higher than expected for the SIB and LR procedures ; the SIB results showed an overall increase of approximately 1% over the LR results. Index terms: differential item functioning, logistic regression statistic, Mantel-Haenszel statistic, nondirectional DIF, simultaneous item bias statistic, SIBTEST, Type I error rate, unidirectional DIF.
  • Item
    A unidimensional item response model for unfolding responses from a graded disagree-agree response scale
    (1996) Roberts, James S.; Laughlin, James E.
    Binary or graded disagree-agree responses to attitude items are often collected for the purpose of attitude measurement. Although such data are sometimes analyzed with cumulative measurement models, recent studies suggest that unfolding models are more appropriate (Roberts, 1995; van Schuur & Kiers, 1994). Advances in item response theory (IRT) have led to the development of several parametric unfolding models for binary data (Andrich, 1988; Andrich & Luo, 1993; Hoijtink, 1991); however, IRT models for unfolding graded responses have not been proposed. A parametric IRT model for unfolding either binary or graded responses is developed here. The graded unfolding model (GUM) is a generalization of Andrich & Luo’s hyperbolic cosine model for binary data. A joint maximum likelihood procedure was implemented to estimate GUM parameters, and a subsequent recovery simulation showed that reasonably accurate estimates could be obtained with minimal data demands (e.g., as few as 100 respondents and 15 to 20 six-category items). The applicability of the GUM to common attitude testing situations is illustrated with real data on student attitudes toward capital punishment. Index terms: attitude measurement, graded unfolding model, hyperbolic cosine model, ideal point process, item response theory, Likert scale, Thurstone scale, unfolding model, unidimensional scaling.
  • Item
    A global information approach to computerized adaptive testing
    (1996) Chang, Hua-Hua; Ying, Zhiliang
    Most item selection in computerized adaptive testing is based on Fisher information (or item information). At each stage, an item is selected to maximize the Fisher information at the currently estimated trait level (θ). However, this application of Fisher information could be much less efficient than assumed if the estimators are not close to the true θ, especially at early stages of an adaptive test when the test length (number of items) is too short to provide an accurate estimate for true θ. It is argued here that selection procedures based on global information should be used, at least at early stages of a test when θ estimates are not likely to be close to the true θ. For this purpose, an item selection procedure based on average global information is proposed. Results from pilot simulation studies comparing the usual maximum item information item selection with the proposed global information approach are reported, indicating that the new method leads to improvement in terms of bias and mean squared error reduction under many circumstances. Index terms: computerized adaptive testing, Fisher information, global information, information surface, item information, item response theory, Kullback-Leibler information, local information, test information.
  • Item
    Item response theory models and spurious interaction effects in factorial ANOVA designs
    (1996) Embretson, Susan E.
    In many psychological experiments, interaction effects in factorial analysis of variance (ANOVA) designs are often estimated using total scores derived from classical test theory. However, interaction effects can be reduced or eliminated by nonlinear monotonic transformations of a dependent variable. Although cross-over interactions cannot be eliminated by transformations, the meaningfulness of other interactions hinges on achieving a measurement scale level for which nonlinear transformations are inappropriate (i.e., at least interval scale level). Classical total test scores do not provide interval level measurement according to contemporary item response theory (IRT). Nevertheless, rarely are IRT models applied to achieve more optimal measurement properties and hence more meaningful interaction effects. This paper provides several conditions under which interaction effects that are estimated from classical total scores, rather than IRT trait scores, can be misleading. Using derived asymptotic expectations from an IRT model, interaction effects of zero on the IRT trait scale were often not estimated as zero from the total score scale. Further, when nonzero interactions were specified on the IRT trait scale, the estimated interaction effects were biased inward when estimated from the total score scale. Test difficulty level determined both the direction and the magnitude of the biased interaction effects. Index terms: factorial designs, interaction effects, interval measurement, item response theory, level of measurement, measurement scales, statistical inference.
  • Item
    Computing elementary symmetric functions and their derivatives: A didactic
    (1996) Baker, Frank B.; Harwell, Michael R.
    The computation of elementary symmetric functions and their derivatives is an integral part of conditional maximum likelihood estimation of item parameters under the Rasch model. The conditional approach has the advantages of parameter estimates that are consistent (assuming the model is correct) and statistically rigorous goodness-of-fit tests. Despite these characteristics, the conditional approach has been limited by problems in computing the elementary symmetric functions. The introduction of recursive formulas for computing these functions and the availability of modem computers has largely mediated these problems; however, detailed documentation of how these formulas work is lacking. This paper describes how various recursion formulas work and how they are used to compute elementary symmetric functions and their derivatives. The availability of this information should promote a more thorough understanding of item parameter estimation in the Rasch model among both measurement specialists and practitioners. Index terms: algorithms, computational techniques, conditional maximum likelihood, elementary symmetric functions, Rasch model.
  • Item
    Multidimensional Rasch models for partial-credit scoring
    (1996) Kelderman, Henk
    Rasch models for partial-credit scoring are discussed and a multidimensional version of the model is formulated. A model may be specified in which consecutive item responses depend on an underlying latent trait. In the multidimensional partial-credit model, different responses may be explained by different latent traits. Data from van Kuyk’s (1988) size concept test and the Raven Progressive Matrices test were analyzed. Maximum likelihood estimation and goodness-of-fit testing are discussed and applied to these datasets. Goodness-of-fit statistics show that for both tests, multidimensional partial-credit models were more appropriate than the unidimensional partial-credit model. Index terms: X2 testing, exponential family model, multidimensional item response theory, multidimensional Rasch model, partial-credit models, Progressive Matrices test, Rasch model.
  • Item
    The influence of the presence of deviant item score patterns on the power of a person-fit statistic
    (1996) Meijer, Rob R.
    Studies investigating the power of person-fit statistics often assume that the item parameters that are used to calculate the statistics are estimated in a sample without misfitting item score patterns. However, in practical test applications calibration samples likely will contain such patterns. In the present study, the influence of the type and the number of misfitting patterns in the calibration sample on the detection rate of the ZU3 statistic was investigated by means of simulated data. An increase in the number of misfitting simulees resulted in a decrease in the power of ZU3. Furthermore, the type of misfit and the test length influenced the power of ZU3. The use of an iterative procedure to remove the misfitting patterns from the dataset was investigated. Results suggested that this method can be used to improve the power of ZU3. Index terms: aberrance detection, appropriateness measurement, nonparametric item response theory, person fit, person-fit statistic ZU3.
  • Item
    An empirical link of content and construct validity evidence
    (1996) Deville, Craig W.
    Since the 1940s, measurement specialists have called for an empirical validation technique that combines content- and construct-related evidence. This study investigated the value of such a technique. A self-assessment instrument designed to cover four traditional foreign language skills was administered to 1,404 college-level foreign language students. Four subject-matter experts were asked to provide item dissimilarity judgments, using whatever criteria they thought appropriate. The data from the students and the experts were examined separately using multidimensional scaling followed by cluster and discriminant analyses. Results showed that the structure of the data underlying both the student and expert scaling solutions corresponded closely to that specified in the instrument blueprint. In addition, using canonical correlation, a comparison of the two scaling solutions revealed a high degree of similarity in the two solutions. Index terms: canonical correlation, construct validity, content validity, item dissimilarities data, multidimensional scaling.
  • Item
    Monte carlo studies in item response theory
    (1996) Harwell, Michael; Stone, Clement A.; Hsu, Tse-Chi; Kirisci, Levent
    Monte carlo studies are being used in item response theory (IRT) to provide information about how validly these methods can be applied to realistic datasets (e.g., small numbers of examinees and multidimensional data). This paper describes the conditions under which monte carlo studies are appropriate in IRT-based research, the kinds of problems these techniques have been applied to, available computer programs for generating item responses and estimating item and examinee parameters, and the importance of conceptualizing these studies as statistical sampling experiments that should be subject to the same principles of experimental design and data analysis that pertain to empirical studies. The number of replications that should be used in these studies is also addressed. Index terms: analysis of variance, experimental design, item response theory, monte carlo techniques, multiple regression.