Applied Psychological Measurement, Volume 20, 1996
Persistent link for this collectionhttps://hdl.handle.net/11299/114836
Browse
Browsing Applied Psychological Measurement, Volume 20, 1996 by Issue Date
Now showing 1 - 20 of 27
- Results Per Page
- Sort Options
Item Computing elementary symmetric functions and their derivatives: A didactic(1996) Baker, Frank B.; Harwell, Michael R.The computation of elementary symmetric functions and their derivatives is an integral part of conditional maximum likelihood estimation of item parameters under the Rasch model. The conditional approach has the advantages of parameter estimates that are consistent (assuming the model is correct) and statistically rigorous goodness-of-fit tests. Despite these characteristics, the conditional approach has been limited by problems in computing the elementary symmetric functions. The introduction of recursive formulas for computing these functions and the availability of modem computers has largely mediated these problems; however, detailed documentation of how these formulas work is lacking. This paper describes how various recursion formulas work and how they are used to compute elementary symmetric functions and their derivatives. The availability of this information should promote a more thorough understanding of item parameter estimation in the Rasch model among both measurement specialists and practitioners. Index terms: algorithms, computational techniques, conditional maximum likelihood, elementary symmetric functions, Rasch model.Item Linear dependence of gain scores on their components imposes constraints on their use and interpretation: Comment on "Are simple gain scores obsolete?"(1996) Humphreys, Lloyd G.The properties of gain scores are linearly determined by the properties of their components. Thus, the reliability of a gain is uniquely determined by the reliabilities of the components, the correlation between them, and their standard deviations. Reliability is not inherently low, but the components of gains used in many investigations make low reliability likely. Correlations of the difference between two measures and a third variate are also determined uniquely by three correlations and two standard deviations. Raw score standard deviations frequently tell more about the measurement metric and how it is used than about the psychological processes underlying the measurements. Correlations involving gains/ differences cannot be understood adequately unless the essential sample statistics of the components are known and reported. Index terms: change scores, classical test theory, difference scores, gain scores, intraindividual differences, measurement of growth, reliability, test theory, validity.Item An investigation of the likelihood ratio test for detection of differential item functioning(1996) Cohen, Allan S.; Kim, Seock-Ho; Wollack, James A.Type I error rates for the likelihood ratio test for detecting differential item functioning (DIF) were investigated using monte carlo simulations. Two- and three-parameter item response theory (IRT) models were used to generate 100 datasets of a 50-item test for samples of 250 and 1,000 simulated examinees for each IRT model. Item parameters were estimated by marginal maximum likelihood for three IRT models: the three-parameter model, the three-parameter model with a fixed guessing parameter, and the two-parameter model. All DIF comparisons were simulated by randomly pairing two samples from each sample size and IRT model condition so that, for each sample size and IRT model condition, there were 50 pairs of reference and focal groups. Type I error rates for the two-parameter model were within theoretically expected values at each of the α levels considered. Type I error rates for the three-parameter and three-parameter model with a fixed guessing parameter, however, were different from the theoretically expected values at the α levels considered. Index terms: bias, differential item functioning, item bias, item response theory, likelihood ratio test for DIF.Item Multidimensional computerized adaptive testing in a certification or licensure context(1996) Luecht, Richard M.Multidimensional item response theory (MIRT) computerized adaptive testing, building on recent work by Segall (1996), is applied in a licensing/certification context. An example of a medical licensure test is used to demonstrate situations in which complex, integrated content must be balanced at the total test level for validity reasons, but items assigned to reportable subscore categories may be used under a MIRT adaptive paradigm to improve the reliability of the subscores. A heuristic optimization framework is outlined that generalizes to both univariate and multivariate statistical objective functions, with additional systems of constraints included to manage the content balancing or other test specifications on adaptively constructed test forms. Simulation results suggested that a multivariate treatment of the problem, although complicating somewhat the objective function used and the estimation of traits, nonetheless produces advantages from a psychometric perspective. Index terms: adaptive testing, computerized adaptive testing, information functions, licensure testing, multidimensional item response theory, sequential testing.Item Assembling tests for the measurement of multiple traits(1996) Van der Linden, Wim J.For the measurement of multiple traits, this paper proposes assembling tests based on the targets for the (asymptotic) variance functions of the estimators of each of the traits. A linear programming model is presented that can be used to computerize the assembly process. Several cases of test assembly dealing with multidimensional traits are distinguished, and versions of the model applicable to each of these cases are discussed. An empirical example of a test assembly problem from a two-dimensional mathematics item pool is provided. Index terms: asymptotic variance functions, linear programming, multidimensional IRT, test assembly, test design.Item An investigation of the sampling distributions of equating coefficients(1996) Baker, Frank B.Using the characteristic curve method for dichotomously scored test items, the sampling distributions of equating coefficients were examined. Simulated data for broad-range and screening tests were analyzed using three equating contexts and three anchor-item configurations in horizontal and vertical equating situations. The results indicated that the sampling distributions were bell-shaped and their standard deviations were uniformly small. There were few differences in the forms of the distributions of the obtained equating coefficients as a function of the anchor-item configurations or type of test. For the equating contexts studied, the sampling distributions of the equating coefficients appear to have acceptable characteristics, suggesting confidence in the values obtained by the characteristic curve method. Index terms: anchor items, characteristic curve method, common metric, equating coefficients, sampling distributions, test equating.Item Monte carlo studies in item response theory(1996) Harwell, Michael; Stone, Clement A.; Hsu, Tse-Chi; Kirisci, LeventMonte carlo studies are being used in item response theory (IRT) to provide information about how validly these methods can be applied to realistic datasets (e.g., small numbers of examinees and multidimensional data). This paper describes the conditions under which monte carlo studies are appropriate in IRT-based research, the kinds of problems these techniques have been applied to, available computer programs for generating item responses and estimating item and examinee parameters, and the importance of conceptualizing these studies as statistical sampling experiments that should be subject to the same principles of experimental design and data analysis that pertain to empirical studies. The number of replications that should be used in these studies is also addressed. Index terms: analysis of variance, experimental design, item response theory, monte carlo techniques, multiple regression.Item Is reliability obsolete? A commentary on "Are simple gain scores obsolete?"(1996) Colllins, Linda M.Williams & Zimmerman (1996) provided much-needed clarification on the reliability of gain scores. This commentary translates these ideas into recognizable patterns of change that tend to produce reliable or unreliable gain scores. It also questions the relevance of the traditional idea of reliability to the measurement of change. Index terms: change scores, classical test theory, difference scores, gain scores, intraindividual differences, measurement of growth, reliability, test theory, validity.Item A multidimensionality-based DIF analysis paradigm(1996) Roussos, Louis; Stout, WilliamA multidimensionality-based differential item functioning (DIF) analysis paradigm is presented that unifies the substantive and statistical DIF analysis approaches by linking both to a theoretically sound and mathematically rigorous multidimensional conceptualization of DIF. This paradigm has the potential (1) to improve understanding of the causes of DIF by formulating and testing substantive dimensionality-based DIF hypotheses; (2) to reduce Type 1 error through a better understanding of the possible multidimensionality of an appropriate matching criterion; and (3) to increase power through the testing of bundles of items measuring similar dimensions. Using this approach, DIF analysis is shown to have the potential for greater integration in the overall test development process. Index terms: bias, bundle DIF, cluster analysis, DIF estimation, DIF hypothesis testing, differential item functioning, dimensionality, DIMTEST, item response theory, multidimensionality, sensitivity review, SIBTEST.Item A unidimensional item response model for unfolding responses from a graded disagree-agree response scale(1996) Roberts, James S.; Laughlin, James E.Binary or graded disagree-agree responses to attitude items are often collected for the purpose of attitude measurement. Although such data are sometimes analyzed with cumulative measurement models, recent studies suggest that unfolding models are more appropriate (Roberts, 1995; van Schuur & Kiers, 1994). Advances in item response theory (IRT) have led to the development of several parametric unfolding models for binary data (Andrich, 1988; Andrich & Luo, 1993; Hoijtink, 1991); however, IRT models for unfolding graded responses have not been proposed. A parametric IRT model for unfolding either binary or graded responses is developed here. The graded unfolding model (GUM) is a generalization of Andrich & Luo’s hyperbolic cosine model for binary data. A joint maximum likelihood procedure was implemented to estimate GUM parameters, and a subsequent recovery simulation showed that reasonably accurate estimates could be obtained with minimal data demands (e.g., as few as 100 respondents and 15 to 20 six-category items). The applicability of the GUM to common attitude testing situations is illustrated with real data on student attitudes toward capital punishment. Index terms: attitude measurement, graded unfolding model, hyperbolic cosine model, ideal point process, item response theory, Likert scale, Thurstone scale, unfolding model, unidimensional scaling.Item Identification of items that show nonuniform DIF(1996) Narayanan, Pankaja; Swaminathan, H.This study compared three procedures-the Mantel- Haenszel (MH), the simultaneous item bias (SIB), and the logistic regression (LR) procedures-with respect to their Type I error rates and power to detect nonuniform differential item functioning (DIF). Data were simulated to reflect a variety of conditions: The factors manipulated included sample size, ability distribution differences between the focal and the reference groups, proportion of DIF items in the test, DIF effect sizes, and type of item. 384 conditions were studied. Both the SIB and LR procedures were equally powerful in detecting nonuniform DIF under most conditions. The MH procedure was not very effective in identifying nonuniform DIF items that had disordinal interactions. The Type I error rates were within the expected limits for the MH procedure and were higher than expected for the SIB and LR procedures ; the SIB results showed an overall increase of approximately 1% over the LR results. Index terms: differential item functioning, logistic regression statistic, Mantel-Haenszel statistic, nondirectional DIF, simultaneous item bias statistic, SIBTEST, Type I error rate, unidirectional DIF.Item A study of a network-flow algorithm and a noncorrecting algorithm for test assembly(1996) Armstrong, R. D.; Jones, D. H.; Li, Xuan; Wu, Ing-LongThe network-flow algorithm (NFA) of Armstrong, Jones, & Wu (1992) and the average growth approximation algorithm (AGAA) of Luecht & Hirsch (1992) were evaluated as methods for automated test assembly. The algorithms were used on ACT and ASVAB item banks, with and without error in the item parameters. Both algorithms matched a target test information function on the ACT item bank, both before and after error was introduced. The NFA matched the target on the ASVAB item bank; however, the AGAA did not, even without error in this item bank. The AGAA is a noncorrecting algorithm, and it made poor item selections early in the search process when using the ASVAB item bank. The NFA corrects for nonoptimal choices with a simplex search. The results indicate that reasonable error in item parameters is not harmful for test assembly using the NFA or AGAA on certain types of item banks. Index terms: algorithmic test construction, automated test assembly, greedy algorithm, heuristic algorithm, item response theory, marginal maximum likelihood, mathematical programming, simulation, test construction.Item The influence of the presence of deviant item score patterns on the power of a person-fit statistic(1996) Meijer, Rob R.Studies investigating the power of person-fit statistics often assume that the item parameters that are used to calculate the statistics are estimated in a sample without misfitting item score patterns. However, in practical test applications calibration samples likely will contain such patterns. In the present study, the influence of the type and the number of misfitting patterns in the calibration sample on the detection rate of the ZU3 statistic was investigated by means of simulated data. An increase in the number of misfitting simulees resulted in a decrease in the power of ZU3. Furthermore, the type of misfit and the test length influenced the power of ZU3. The use of an iterative procedure to remove the misfitting patterns from the dataset was investigated. Results suggested that this method can be used to improve the power of ZU3. Index terms: aberrance detection, appropriateness measurement, nonparametric item response theory, person fit, person-fit statistic ZU3.Item An assessment of Stout's index of essential unidimensionality(1996) Hattie, John; Krakowski, Krzysztof; Rogers, H. Jane; Swaminathan, HariharanA simulation study was conducted to evaluate the dependability of Stout’s T index of unidimensionality as used in his DIMTEST procedure. DIMTEST was found to dependably provide indications of unidimensionality, to be reasonably robust, and to allow for a practical demarcation between one and many dimensions. The procedure was not affected by the method used to identify the initial subset of unidimensional items. It was, however, found to be sensitive to whether the multidimensional data arose from a compensatory model or a partially compensatory model. DIMTEST failed when the matrix of tetrachoric correlations was non-Gramian and hence is not appropriate in such cases. Index terms: DIMTEST, essential unidimensionality, factor analysis, item response models, Stout’s test of unidimensionality, tetrachoric correlations, unidimensionality.Item Are simple gain scores obsolete?(1996) Williams, Richard H.; Zimmerman, Donald W.It is widely believed that measures of gain, growth, or change, expressed as simple differences between pretest and posttest scores, are inherently unreliable. It is also believed that gain scores lack predictive validity with respect to other criteria. However, these conclusions are based on misleading assumptions about the values of parameters in familiar equations in classical test theory. The present paper examines modified equations for the validity and reliability of difference scores that describe applied testing situations more realistically and reveal that simple gain scores can be more useful in research than commonly believed. Index terms: change scores, difference scores, gain scores, measurement of growth, reliability, test theory, validity.Item Longitudinal models of reliability and validity: A latent curve approach(1996) Tisak, John; Tisak, Marie S.The concepts of reliability and validity and their associated coefficients typically have been restricted to a single measurement occasion. This paper describes dynamic generalizations of reliability and validity that will incorporate longitudinal or developmental models, using latent curve analysis. Initially a latent curve model is formulated to depict change. This longitudinal model is then incorporated into the classical definitions of reliability and validity. This approach permits the separation of constancy or change from the indexes of reliability and validity. Statistical estimation and hypothesis testing be achieved using standard structural equations modeling computer programs. These longitudinal models of reliability and validity are demonstrated on sociological psychological data. Index terms: concurrent validity, dynamic models, dynamic true score, latent curve analysis, latent trajectory, predictive validity, reliability, validity.Item A quadratic curve equating method to equate the first three moments in equipercentile equating(1996) Wang, Tianyou; Kolen, Michael J.A quadratic curve test equating method for equating different test forms under a random-groups data collection design is proposed. This new method extends the linear equating method by adding a quadratic term to the linear function and equating the first three central moments (mean, standard deviation, and skewness) of the test forms. Procedures for implementing the method and related issues are described and discussed. The quadratic curve method was evaluated using real test data and simulated data in terms of model fit and equating error, and was compared to linear equating, and unsmoothed and smoothed equipercentile equating. It was found that the quadratic curve method fit most of the real test data examined and that when the model fit the population, this method could perform at least as well as, or often even better than, the other equating methods studied. Index terms: equating, equipercentile equating, linear equating, model-based equating, quadratic curve equating, random-groups equating design, smoothing procedures.Item Conditional covariance-based nonparametric multidimensionality assessment(1996) Stout, William; Habing, Brian; Douglas, Jeff; Kim, Hae Rim; Roussos, Louis; Zhang, JinmingAccording to the weak local independence approach to defining dimensionality, the fundamental quantities for determining a test’s dimensional structure are the covariances of item-pair responses conditioned on examinee trait level. This paper describes three dimensionality assessment procedures-HCA/CCPROX, DIMTEST, and DETECT-that use estimates of these conditional covariances. All three procedures are nonparametric ; that is, they do not depend on the functional form of the item response functions. These procedures are applied to a dimensionality study of the LSAT, which illustrates the capacity of the approaches to assess the lack of unidimensionality, identify groups of items manifesting approximate simple structure, determine the number of dominant dimensions, and measure the amount of multidimensionality. Index terms: approximate simple structure, conditional covariance, DETECT, dimensionality, DIMTEST, HCA/CCPROX, hierarchical cluster analysis, IRT, LSAT, local independence, multidimensionality, simple structure.Item Detecting faking on a personality instrument using appropriateness measurement(1996) Zickar, Michael J.; Drasgow, FritzResearch has demonstrated that people can and often do consciously manipulate scores on personality tests. Test constructors have responded by using social desirability and lying scales in order to identify dishonest respondents. Unfortunately, these approaches have had limited success. This study evaluated the use of appropriateness measurement for identifying dishonest respondents. A dataset was analyzed in which respondents were instructed either to answer honestly or to fake good. The item response theory approach classified a higher number of faking respondents at low rates of misclassification of honest respondents (false positives) than did a social desirability scale. At higher false positive rates, the social desirability approach did slightly better. Implications for operational testing and suggestions for further research are provided. Index terms: appropriateness measurement, detecting faking, item response theory, lying scales, person fit, personality measurement.Item Developments in Multidimensional Item Response Theory(1996) Ackerman, Terry A.