Applied Psychological Measurement, Volume 17, 1993

Persistent link for this collection

https://hdl.handle.net/11299/114833

Search within Applied Psychological Measurement, Volume 17, 1993

Browse

Now showing 1 - 20 of 26

Scale shrinkage in vertical equating
(1993) Camilli, Gregory; Yamamoto, Kentaro; Wang, Ming-mei
As an alternative to equipercentile equating in the area of multilevel achievement test batteries, item response theory (IRT) vertical equating has produced unexpected results. When expanded standard scores were obtained to link the Comprehensive Test of Basic Skills and the California Achievement Test, the variance of test scores diminished both within particular grade levels from fall to spring, and also from lower to upper grade levels. Equipercentile equating, on the other hand, has resulted in increasing variance both within and across grade levels, although the increases are not linear across grade levels. Three potential causes of scale shrinkage are discussed, and a more comprehensive, model-based approach to establishing vertical scales is described. Test data from the National Assessment of Educational Progress were used to estimate the distribution of ability at grades 4, 8, and 12 for several math achievement subtests. For each subtest, the variance of scores increased from grade 4 to grade 8; however, beyond grade 8 the results were not uniform. Index terms: developmental scores, equating, IRT scaling, maximum likelihood estimation, National Assessment of Educational Progress (NAEP), scale shrinkage, vertical equating.
Effect of estimation method on incremental fit indexes for covariance structure models
(1993) Sugawara, Hazuki M.; MacCallum, Robert C.
In a typical study involving covariance structure modeling, fit of a model or a set of alternative models is evaluated using several indicators of fit under one estimation method, usually maximum likelihood. This study examined the stability across estimation methods of incremental and nonincremental fit measures that use the information about the fit of the most restricted (null) model as a reference point in assessing the fit of a more substantive model to the data. A set of alternative models for a large empirical dataset was analyzed by asymptotically distribution-free, generalized least squares, maximum likelihood, and ordinary least squares estimation methods. Four incremental and four nonincremental fit indexes were compared. Incremental indexes were quite unstable across estimation methods-maximum likelihood and ordinary least squares solutions indicated better fit of a given model than asymptotically distribution-free and generalized least squares solutions. The cause of this phenomenon is explained and illustrated, and implications and recommendations for practice are discussed. Index terms: covariance structure models, goodness of fit, incremental fit index, maximum likelihood estimation, parameter estimation, structural equation models.
Information functions of the generalized partial credit model
(1993) Muraki, Eiji
The concept of information functions developed for dichotomous item response models is adapted for the partial credit model. The information function is explained in terms of the model parameters and scoring functions. The relationship between the item information function and the item response function also is discussed. The information function then is used to investigate the effect of collapsing and recoding categories of polytomously-scored items of the National Assessment of Educational Progress (NAEP). The NAEP writing items were calibrated and the item and test information is used to discuss desirable properties of polytomous items. Index terms: information function, item response model, National Assessment of Educational Progress (NAEP), partial credit model, polytomous item response model.
Methodology review: Statistical approaches for assessing measurement bias
(1993) Millsap, Roger E.; Everson, Howard T.
Statistical methods developed over the last decade for detecting measurement bias in psychological and educational tests are reviewed. Earlier methods for assessing measurement bias generally have been replaced by more sophisticated statistical techniques, such as the Mantel-Haenszel procedure, the standardization approach, logistic regression models, and item response theory approaches. The review employs a conceptual framework that distinguishes methods of detecting measurement bias based on either observed or unobserved conditional invariance models. Although progress has been made in the development of statistical methods for detecting measurement bias, issues related to the choice of matching variable, the nonuniform nature of measurement bias, the suitability of current approaches for new and emerging performance assessment methods, and insights into the causes of measurement bias remain elusive. Clearly, psychometric solutions to the problems of measurement bias will further understanding of the more central issue of construct validity. The continuing development of statistical methods for detecting and understanding the causes of measurement bias will continue to be an important scientific challenge. Index terms: bias detection, differential item functioning, item bias, measurement bias, test bias.
A method for severely constrained item selection in adapative testing
(1993) Stocking, Martha L.; Swanson, Len
Previous attempts at incorporating expert test construction practices into computerized adaptive testing paradigms are described. A new method is presented for incorporating a large number of constraints on adaptive item selection. The methodology emulates the test construction practices of expert test specialists, which is a necessity if computerized adaptive testing is to compete with conventional tests. Two examples-one for a verbal measure and the other for a quantitative measure- are provided of the successful use of the proposed method in designing adaptive tests. Index terms: adaptive test design, computerized adaptive testing, constrained adaptive testing, expert systems, test assembly algorithms.
Detection of differential item functioning in the graded response model
(1993) Cohen, Allan S.; Kim, Seock-Ho; Baker, Frank B.
Methods for detecting differential item functioning (DIF) have been proposed primarily for the item response theory dichotomous response model. Three measures of DIF for the dichotomous response model are extended to include Samejima’s graded response model: two measures based on area differences between item true score functions, and a χ² statistic for comparing differences in item parameters. An illustrative example is presented. Index terms: differential item functioning, graded response model, item response theory.
A hyperbolic cosine latent trait model for unfolding dichotomous single-stimulus responses
(1993) Andrich, David; Luo, Guanzhong
Social-psychological variables are typically measured using either cumulative or unfolding response processes. In the former, the greater the location of a person relative to the location of a stimulus on the continuum, the greater the probability of a positive response; in the latter, the closer the location of the person to the location of the statement, irrespective of direction, the greater the probability of a positive response. Formal probability models for these processes are, respectively, monotonically increasing and single-peaked as a function of the location of the person relative to the location of the statement. In general, these models have been considered to be independent of each other. However, if statements constructed on the basis of a cumulative model have three ordered response categories, the response function within the statement for the middle category is in fact single-peaked. Using this observation, a unidimensional model for responses to statements that have an unfolding structure was constructed from the cumulative Rasch model for ordered response categories. A location and unit of measurement parameter exist for each statement. A joint maximum likelihood estimation procedure was investigated. Analysis of a small simulation study and a small real dataset showed that the model is readily applicable. Index terms: attitude measurement, item response theory, latent trait theory, latent trait theory, Rasch models, unfolding data, unidimensional scaling.
Equating tests under the nominal response model
(1993) Baker, Frank B.
Under item response theory, test equating involves finding the coefficients of a linear transformation of the metric of one test to that of another. A procedure for finding these equating coefficients when the items in the two tests are nominally scored was developed. A quadratic loss function based on the differences between response category probabilities in the two tests is employed. The gradients of this loss function needed by the iterative multivariate search procedure used to obtain the equating coefficients were derived for the nominal response case. Examples of both horizontal and vertical equating are provided. The empirical results indicated that tests scored under a nominal response model can be placed on a common metric in both horizontal and vertical equatings. Index terms: characteristic curve, equating, item response theory, nominal response model, quadratic loss function.
Standard errors of Levine linear equating
(1993) Hanson, Bradley A.; Zeng, Lingjia; Kolen, Michael J.
The delta method was used to derive standard errors (SEs) of the Levine observed score and Levine true score linear equating methods. SEs with a normality assumption as well as without a normality assumption were derived. Data from two forms of a test were used as an example to evaluate the derived SEs of equating. Bootstrap SEs also were computed for the purpose of comparison. The SEs derived without the normality assumption and the bootstrap SEs were very close. For the skewed score distributions, the SEs derived with the normality assumption differed from the SEs derived without the normality assumption and the bootstrap SEs. Index terms: equating, delta method, linear equating, score equating, standard errors of equating.
Estimating rater agreement in 2x2 tables: Correction for chance and intraclass correlation
(1993) Blackman, Nicole J.-M.; Koval, John J.
Many estimators of the measure of agreement between two dichotomous ratings of a person have been proposed. The results of Fleiss (1975) are extended, and it is shown that four estimators- Scott’s (1955) π coefficient, Cohen’s (1960) kˆ, Maxwell & Pilliner’s (1968) r₁₁, and Mak’s (1988) p˜-are interpretable both as chance-corrected measures of agreement and as intraclass correlation coefficients for different ANOVA models. Relationships among these estimators are established for finite samples. Under Kraemer’s (1979) model, it is shown that these estimators are equivalent in large samples, and that the equations for their large sample variances are equivalent. Index terms: index of agreement, interrater reliability, intraclass correlation, kappa statistic.
Sensitivity of the linear logistic test model to misspecification of the weight matrix
(1993) Baker, Frank B.
Under the linear logistic test model, a weight is assigned to each cognitive operation used to respond to an item. The allocation of these weights is open to misspecification that can result in faulty estimates of the basic parameters. The effect on root mean squares (RMSs) of the difference between the parameter estimates obtained under misspecification conditions and those obtained under correct specification conditions was examined. Six levels of misspecification and four sample sizes were used. Even a small number of errors in the weight specifications resulted in large RMS values. However, weight matrices with a high proportion of nonzero elements tended to yield RMSs that were approximately half as large as those with a small number of nonzero elements. Although sample size had some effect on the RMS values, it was quite small compared to that due to the level of misspecification of the weights. The results suggest that because specifying the elements in the weight matrix is a subjective process, it must be done with great care. Index terms: error rates, linear logistic test model, misspecification, parameter estimation, weight matrix.
A numerical approach for computing standard errors of linear equating
(1993) Zeng, Lingjia
A numerical approach for computing standard errors (SEs) of a linear equating is described. In the proposed approach, the first partial derivatives of the equating function needed to compute the SEs are derived numerically. Thus, the difficulty of deriving the analytical formulas of the partial derivatives for a complicated equating method is avoided. The numerical and analytical approaches were compared using the Tucker equating method. The SEs derived numerically were found to be indistinguishable from the SEs derived analytically. In a computer simulation of the numerical approach using the Levine equating method, the SEs based on the normality assumption were found to be less accurate than those derived without the normality assumption when the score distributions were skewed. Index terms: common-item design, Levine equating method, linear equating, standard error of equating, Tucker equating method.
Application of an automated item selection method to real data
(1993) Stocking, Martha L.; Swanson, Len; Pearlman, Mari
A method of automatically selecting items for inclusion in a test that has constraints on item content and statistical properties was applied to real data. Two tests were assembled by test specialists who assemble such tests on a routine basis. Using the same pool of items and the same constraints, the two tests were reassembled automatically. Test specialists not involved in the original manual assembly compared the tests constructed manually to the tests constructed automatically. The results indicated that the progress of automated test assembly methods lies in improving item banking systems, classification schemes, and quality control measures, rather than in the development of different algorithms or in the improvement of computer time and cost. Index terms: heuristic algorithms, mathematical programming, test assembly, test construction, test design.
A model and heuristic for solving very large item selection problems
(1993) Swanson, Len; Stocking, Martha L.
A model for solving very large item selection problems is presented. The model builds on previous work in binary programming applied to test construction. Expert test construction practices are applied to situations in which all specifications for item selection cannot necessarily be met. A heuristic for selecting items that satisfy the constraints in the model also is presented. The heuristic is particularly useful for situations in which the size of the test construction problem exceeds the limits of current implementations of linear programming algorithms. A variety of test construction problems involving real test specifications and item data from actual test assemblies were investigated using the model and the heuristic. Index terms: expert systems, heuristic algorithms, item response theory, linear programming, mathematical programming, test assembly, test construction, test design.
Appropriateness fit and criterion-related validity
(1993) Schmitt, Neal; Cortina, José M.; Whitney, David J.
Unmotivated or suspicious test takers in concurrent validation studies can cause numerous problems for test users. The effects of these problems, however, have not been carefully examined. This study used item response theory-based appropriateness fit indexes to identify and remove from a validation sample those examinees whose response patterns did not match their trait levels (e.g., examinees with low trait levels who answered difficult items correctly). The person-fit index lzm described in Drasgow, Levine, & Williams (1985) had little effect on validities. The multitest index lzm described by Drasgow & Hulin (1990) was more promising. Implications for selection research and practice are discussed. Index terms: aberrant response patterns, appropriateness fit, concurrent validity, distorted responses, item response theory, person fit.
The effects on parameter estimation of correlated dimensions and a distribution-restricted trait in a multidimensional item response model
(1993) Batley, Batley; Boss, Marvin W.
This study was designed to assess the effects on parameter estimation of correlated dimensions and a distribution-restricted trait on one dimension using a two-dimensional item response theory model. Multidimensional analysis of simulated two-dimensional item response data fitting a multidimensional two-parameter logistic item response theory model (McKinley & Reckase, 1983a; Reckase & McKinley, 1991) was done using the program MIRTE (Carlson, 1987). Six datasets (2 trait distributions x 3 levels of correlation between dimensions) of 2,000 trait vectors over 104 items were generated. Each dataset was analyzed and replicated 100 times. Trait and item parameters generally were recovered adequately in the datasets in which both traits were normally distributed over the full range. In the datasets with a restricted range of trait level on the second dimension, recovery of the trait and item parameters was affected adversely. The results indicated that MIRTE recovers the structure of a multidimensional correlated space better than reported in earlier studies, especially when items are multidimensional. Index terms: correlated traits, multidimensional item parameter estimates, multidimensional item response theory, multidimensional trait estimates, restricted traits.
Analysis of cognitive structure using the linear logistic test model and quadratic assignment
(1993) Medina-Diaz, Maria
The cognitive structure of an algebra test was defined and validated using the linear logistic test model (LLTM) and quadratic assignment (QA), respectively. The LLTM is an extension of the Rasch model with a linear constraint that describes the difficulty of a test item in terms of the cognitive operations required to solve it. The cognitive structure of a test is specified using the weight matrix W. The cognitive structure defined here was based on a set of eight production rules that represented the mathematical procedures employed in solving linear equations with one variable. A 29-item test was constructed and administered to 235 ninth-graders. Item response data were analyzed using Fischer & Formann’s (1972) LLTM computer program. A QA confirmatory approach was used to validate the cognitive structure of the test. The structure was validated-examinees solved the items using the set of rules specified in the W matrix. The parameters estimated using the LLTM are quantitative indexes of the difficulties of each of the cognitive rules included in the W matrix. Index terms: componential models, confirmatory analysis, content validity, linear logistic test model, quadratic assignment, cognitive structure, validation.
A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning
(1993) Rogers, H. Jane; Swaminathan, Hariharan
The Mantel-Haenszel (MH) procedure is sensitive to only one type of differential item functioning (DIF). It is not designed to detect DIF that has a nonuniform effect across trait levels. By generalizing the model underlying the MH procedure, a more general DIF detection procedure has been developed (Swaminathan & Rogers, 1990). This study compared the performance of this procedure-the logistic regression (LR) procedure-to that of the MH procedure in the detection of uniform and nonuniform DIF in a simulation study which examined the distributional properties of the LR and MH test statistics and the relative power of the two procedures. For both the LR and MH test statistics, the expected distributions were obtained under nearly all conditions. The LR test statistic did not have the expected distribution for very difficult and highly discriminating items. The LR procedure was found to be more powerful than the MH procedure for detecting nonuniform DIF and as powerful in detecting uniform DIF. Index terms: differential item functioning, logistic regression, Mantel-Haenszel statistic, nonuniform DIF, uniform DIF.
A structural equation model for measuring residualized change and discerning patterns of growth or decline
(1993) Raykov, Tenko
This paper is concerned with two theoretically and empirically important issues in longitudinal research: (1) identifying correlates and predictors of change and (2) discerning patterns of change. Two traditional methods of change measurement-the residualized observed difference and the residualized gain score-are discussed. A general structural equation model for measuring residualized true change and studying patterns of true growth or decline is described. This approach allows consistent and efficient estimation of the degree of interrelationship between residualized change in a repeatedly assessed psychological construct and other variables, such as studied/presumed correlates and predictors of growth or decline on the latent dimension. Substantively interesting patterns of change on the trait level, such as regression to the mean, overcrossing, and fan-spreading, can be discerned. The model is useful in research situations in which it is of theoretical and empirical concern to identify those variables that correlate with, or can be used to predict, such patterns of true growth or decline that deviate from a group-specific trend in longitudinally-measured psychological constructs. The approach is illustrated using data from a cognitive intervention study of plasticity in fluid intelligence of aged adults (Baltes, Dittmann-Kohli, & Kliegl, 1986). Index terms: correlates of growth/ decline, fan-spreading, measurement of change, overcrossing, predictors of growth, regression to the mean, structural equations modeling, true change.
A comparison of Lord's x² and Raju's area measures in detection of DIF
(1993) Cohen, Allan S.; Kim, Seock-ho
The area between item response functions estimated in different samples is often used as a measure of differential item functioning (DIF). Under item response theory, this area should be 0, except for errors of measurement. This study examined the effectiveness of two statistical tests of this area—a Z test for exact signed area and a Z test for exact unsigned area—for different test length, sample size, proportion of DIF items on the test, and item parameter estimation conditions using the two-parameter model. Errors in detection made using these two statistics were compared with errors made using Lord’s Χ². Differences between all three statistics were relatively small; however, the Χ² statistic was more effective than either of the two Z tests at detecting simulated DIF. The Z test for the exact signed area was the least effective and was the most likely to result in false negative errors. Index terms: area measures, differential item functioning, item response theory, item bias, Lord’s Χ².