Applied Psychological Measurement, Volume 17, 1993

Persistent link for this collection

https://hdl.handle.net/11299/114833

Browse

Now showing 1 - 20 of 26

Analysis of cognitive structure using the linear logistic test model and quadratic assignment
(1993) Medina-Diaz, Maria
The cognitive structure of an algebra test was defined and validated using the linear logistic test model (LLTM) and quadratic assignment (QA), respectively. The LLTM is an extension of the Rasch model with a linear constraint that describes the difficulty of a test item in terms of the cognitive operations required to solve it. The cognitive structure of a test is specified using the weight matrix W. The cognitive structure defined here was based on a set of eight production rules that represented the mathematical procedures employed in solving linear equations with one variable. A 29-item test was constructed and administered to 235 ninth-graders. Item response data were analyzed using Fischer & Formann’s (1972) LLTM computer program. A QA confirmatory approach was used to validate the cognitive structure of the test. The structure was validated-examinees solved the items using the set of rules specified in the W matrix. The parameters estimated using the LLTM are quantitative indexes of the difficulties of each of the cognitive rules included in the W matrix. Index terms: componential models, confirmatory analysis, content validity, linear logistic test model, quadratic assignment, cognitive structure, validation.
Application of an automated item selection method to real data
(1993) Stocking, Martha L.; Swanson, Len; Pearlman, Mari
A method of automatically selecting items for inclusion in a test that has constraints on item content and statistical properties was applied to real data. Two tests were assembled by test specialists who assemble such tests on a routine basis. Using the same pool of items and the same constraints, the two tests were reassembled automatically. Test specialists not involved in the original manual assembly compared the tests constructed manually to the tests constructed automatically. The results indicated that the progress of automated test assembly methods lies in improving item banking systems, classification schemes, and quality control measures, rather than in the development of different algorithms or in the improvement of computer time and cost. Index terms: heuristic algorithms, mathematical programming, test assembly, test construction, test design.
Appropriateness fit and criterion-related validity
(1993) Schmitt, Neal; Cortina, José M.; Whitney, David J.
Unmotivated or suspicious test takers in concurrent validation studies can cause numerous problems for test users. The effects of these problems, however, have not been carefully examined. This study used item response theory-based appropriateness fit indexes to identify and remove from a validation sample those examinees whose response patterns did not match their trait levels (e.g., examinees with low trait levels who answered difficult items correctly). The person-fit index lzm described in Drasgow, Levine, & Williams (1985) had little effect on validities. The multitest index lzm described by Drasgow & Hulin (1990) was more promising. Implications for selection research and practice are discussed. Index terms: aberrant response patterns, appropriateness fit, concurrent validity, distorted responses, item response theory, person fit.
Assessing essential unidimensionality of real data
(1993) Nandakumar, Ratna
The capability of DIMTEST in assessing essential unidimensionality of item responses to real tests was investigated. DIMTEST found that some test data fit an essentially unidimensional model and other data did not. Essentially unidimensional test data identified by DIMTEST then were combined to form two-dimensional test data. The power of Stout’s statistic T was examined for these two-dimensional data. DIMTEST results on real tests replicated findings from simulated tests-T discriminated well between essentially unidimensional and multidimensional tests. T was also highly sensitive to major traits and insensitive to relatively minor traits that influenced item responses. Index terms: DIMTEST, essential unidimensionality, essential independence, multidimensionality, unidimensionality.
A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning
(1993) Rogers, H. Jane; Swaminathan, Hariharan
The Mantel-Haenszel (MH) procedure is sensitive to only one type of differential item functioning (DIF). It is not designed to detect DIF that has a nonuniform effect across trait levels. By generalizing the model underlying the MH procedure, a more general DIF detection procedure has been developed (Swaminathan & Rogers, 1990). This study compared the performance of this procedure-the logistic regression (LR) procedure-to that of the MH procedure in the detection of uniform and nonuniform DIF in a simulation study which examined the distributional properties of the LR and MH test statistics and the relative power of the two procedures. For both the LR and MH test statistics, the expected distributions were obtained under nearly all conditions. The LR test statistic did not have the expected distribution for very difficult and highly discriminating items. The LR procedure was found to be more powerful than the MH procedure for detecting nonuniform DIF and as powerful in detecting uniform DIF. Index terms: differential item functioning, logistic regression, Mantel-Haenszel statistic, nonuniform DIF, uniform DIF.
A comparison of Lord's x² and Raju's area measures in detection of DIF
(1993) Cohen, Allan S.; Kim, Seock-ho
The area between item response functions estimated in different samples is often used as a measure of differential item functioning (DIF). Under item response theory, this area should be 0, except for errors of measurement. This study examined the effectiveness of two statistical tests of this area—a Z test for exact signed area and a Z test for exact unsigned area—for different test length, sample size, proportion of DIF items on the test, and item parameter estimation conditions using the two-parameter model. Errors in detection made using these two statistics were compared with errors made using Lord’s Χ². Differences between all three statistics were relatively small; however, the Χ² statistic was more effective than either of the two Z tests at detecting simulated DIF. The Z test for the exact signed area was the least effective and was the most likely to result in false negative errors. Index terms: area measures, differential item functioning, item response theory, item bias, Lord’s Χ².
Detection of differential item functioning in the graded response model
(1993) Cohen, Allan S.; Kim, Seock-Ho; Baker, Frank B.
Methods for detecting differential item functioning (DIF) have been proposed primarily for the item response theory dichotomous response model. Three measures of DIF for the dichotomous response model are extended to include Samejima’s graded response model: two measures based on area differences between item true score functions, and a χ² statistic for comparing differences in item parameters. An illustrative example is presented. Index terms: differential item functioning, graded response model, item response theory.
Effect of estimation method on incremental fit indexes for covariance structure models
(1993) Sugawara, Hazuki M.; MacCallum, Robert C.
In a typical study involving covariance structure modeling, fit of a model or a set of alternative models is evaluated using several indicators of fit under one estimation method, usually maximum likelihood. This study examined the stability across estimation methods of incremental and nonincremental fit measures that use the information about the fit of the most restricted (null) model as a reference point in assessing the fit of a more substantive model to the data. A set of alternative models for a large empirical dataset was analyzed by asymptotically distribution-free, generalized least squares, maximum likelihood, and ordinary least squares estimation methods. Four incremental and four nonincremental fit indexes were compared. Incremental indexes were quite unstable across estimation methods-maximum likelihood and ordinary least squares solutions indicated better fit of a given model than asymptotically distribution-free and generalized least squares solutions. The cause of this phenomenon is explained and illustrated, and implications and recommendations for practice are discussed. Index terms: covariance structure models, goodness of fit, incremental fit index, maximum likelihood estimation, parameter estimation, structural equation models.
The effects on parameter estimation of correlated dimensions and a distribution-restricted trait in a multidimensional item response model
(1993) Batley, Batley; Boss, Marvin W.
This study was designed to assess the effects on parameter estimation of correlated dimensions and a distribution-restricted trait on one dimension using a two-dimensional item response theory model. Multidimensional analysis of simulated two-dimensional item response data fitting a multidimensional two-parameter logistic item response theory model (McKinley & Reckase, 1983a; Reckase & McKinley, 1991) was done using the program MIRTE (Carlson, 1987). Six datasets (2 trait distributions x 3 levels of correlation between dimensions) of 2,000 trait vectors over 104 items were generated. Each dataset was analyzed and replicated 100 times. Trait and item parameters generally were recovered adequately in the datasets in which both traits were normally distributed over the full range. In the datasets with a restricted range of trait level on the second dimension, recovery of the trait and item parameters was affected adversely. The results indicated that MIRTE recovers the structure of a multidimensional correlated space better than reported in earlier studies, especially when items are multidimensional. Index terms: correlated traits, multidimensional item parameter estimates, multidimensional item response theory, multidimensional trait estimates, restricted traits.
Equating tests under the nominal response model
(1993) Baker, Frank B.
Under item response theory, test equating involves finding the coefficients of a linear transformation of the metric of one test to that of another. A procedure for finding these equating coefficients when the items in the two tests are nominally scored was developed. A quadratic loss function based on the differences between response category probabilities in the two tests is employed. The gradients of this loss function needed by the iterative multivariate search procedure used to obtain the equating coefficients were derived for the nominal response case. Examples of both horizontal and vertical equating are provided. The empirical results indicated that tests scored under a nominal response model can be placed on a common metric in both horizontal and vertical equatings. Index terms: characteristic curve, equating, item response theory, nominal response model, quadratic loss function.
Estimating rater agreement in 2x2 tables: Correction for chance and intraclass correlation
(1993) Blackman, Nicole J.-M.; Koval, John J.
Many estimators of the measure of agreement between two dichotomous ratings of a person have been proposed. The results of Fleiss (1975) are extended, and it is shown that four estimators- Scott’s (1955) π coefficient, Cohen’s (1960) kˆ, Maxwell & Pilliner’s (1968) r₁₁, and Mak’s (1988) p˜-are interpretable both as chance-corrected measures of agreement and as intraclass correlation coefficients for different ANOVA models. Relationships among these estimators are established for finite samples. Under Kraemer’s (1979) model, it is shown that these estimators are equivalent in large samples, and that the equations for their large sample variances are equivalent. Index terms: index of agreement, interrater reliability, intraclass correlation, kappa statistic.
Further comments on reliability and power of significance tests
(1993) Humphreys, Lloyd G.
The controversy about the relationship between reliability and the power of significance tests exists because statisticians obtain numerical solutions by varying independently the parameters of the power of statistical tests. In contrast, researchers have empirical limitations placed on them in varying the same parameters. Reliability and power can legitimately be decoupled by selection of the population from which to sample (Zimmerman & Williams, 1986), but this is an undependable way to increase power (Humphreys, 1991). Reducing population variance by selection of the sample can be considered a special case of (and a crude approximation to) the analysis of covariance, which is also a more effective way of controlling individual differences in true scores than the use of difference scores. Both the regressed differences and the raw differences are less reliable within treatments than their components, but can have more power in statistical tests. As the reliability of derived scores increases, however, power increases. Index terms: difference scores, error of measurement, planning experiments, power, reliability, significance tests, t tests, true scores.
A hyperbolic cosine latent trait model for unfolding dichotomous single-stimulus responses
(1993) Andrich, David; Luo, Guanzhong
Social-psychological variables are typically measured using either cumulative or unfolding response processes. In the former, the greater the location of a person relative to the location of a stimulus on the continuum, the greater the probability of a positive response; in the latter, the closer the location of the person to the location of the statement, irrespective of direction, the greater the probability of a positive response. Formal probability models for these processes are, respectively, monotonically increasing and single-peaked as a function of the location of the person relative to the location of the statement. In general, these models have been considered to be independent of each other. However, if statements constructed on the basis of a cumulative model have three ordered response categories, the response function within the statement for the middle category is in fact single-peaked. Using this observation, a unidimensional model for responses to statements that have an unfolding structure was constructed from the cumulative Rasch model for ordered response categories. A location and unit of measurement parameter exist for each statement. A joint maximum likelihood estimation procedure was investigated. Analysis of a small simulation study and a small real dataset showed that the model is readily applicable. Index terms: attitude measurement, item response theory, latent trait theory, latent trait theory, Rasch models, unfolding data, unidimensional scaling.
Information functions of the generalized partial credit model
(1993) Muraki, Eiji
The concept of information functions developed for dichotomous item response models is adapted for the partial credit model. The information function is explained in terms of the model parameters and scoring functions. The relationship between the item information function and the item response function also is discussed. The information function then is used to investigate the effect of collapsing and recoding categories of polytomously-scored items of the National Assessment of Educational Progress (NAEP). The NAEP writing items were calibrated and the item and test information is used to discuss desirable properties of polytomous items. Index terms: information function, item response model, National Assessment of Educational Progress (NAEP), partial credit model, polytomous item response model.
A method for severely constrained item selection in adapative testing
(1993) Stocking, Martha L.; Swanson, Len
Previous attempts at incorporating expert test construction practices into computerized adaptive testing paradigms are described. A new method is presented for incorporating a large number of constraints on adaptive item selection. The methodology emulates the test construction practices of expert test specialists, which is a necessity if computerized adaptive testing is to compete with conventional tests. Two examples-one for a verbal measure and the other for a quantitative measure- are provided of the successful use of the proposed method in designing adaptive tests. Index terms: adaptive test design, computerized adaptive testing, constrained adaptive testing, expert systems, test assembly algorithms.
Methodology review: Statistical approaches for assessing measurement bias
(1993) Millsap, Roger E.; Everson, Howard T.
Statistical methods developed over the last decade for detecting measurement bias in psychological and educational tests are reviewed. Earlier methods for assessing measurement bias generally have been replaced by more sophisticated statistical techniques, such as the Mantel-Haenszel procedure, the standardization approach, logistic regression models, and item response theory approaches. The review employs a conceptual framework that distinguishes methods of detecting measurement bias based on either observed or unobserved conditional invariance models. Although progress has been made in the development of statistical methods for detecting measurement bias, issues related to the choice of matching variable, the nonuniform nature of measurement bias, the suitability of current approaches for new and emerging performance assessment methods, and insights into the causes of measurement bias remain elusive. Clearly, psychometric solutions to the problems of measurement bias will further understanding of the more central issue of construct validity. The continuing development of statistical methods for detecting and understanding the causes of measurement bias will continue to be an important scientific challenge. Index terms: bias detection, differential item functioning, item bias, measurement bias, test bias.
A model and heuristic for solving very large item selection problems
(1993) Swanson, Len; Stocking, Martha L.
A model for solving very large item selection problems is presented. The model builds on previous work in binary programming applied to test construction. Expert test construction practices are applied to situations in which all specifications for item selection cannot necessarily be met. A heuristic for selecting items that satisfy the constraints in the model also is presented. The heuristic is particularly useful for situations in which the size of the test construction problem exceeds the limits of current implementations of linear programming algorithms. A variety of test construction problems involving real test specifications and item data from actual test assemblies were investigated using the model and the heuristic. Index terms: expert systems, heuristic algorithms, item response theory, linear programming, mathematical programming, test assembly, test construction, test design.
A numerical approach for computing standard errors of linear equating
(1993) Zeng, Lingjia
A numerical approach for computing standard errors (SEs) of a linear equating is described. In the proposed approach, the first partial derivatives of the equating function needed to compute the SEs are derived numerically. Thus, the difficulty of deriving the analytical formulas of the partial derivatives for a complicated equating method is avoided. The numerical and analytical approaches were compared using the Tucker equating method. The SEs derived numerically were found to be indistinguishable from the SEs derived analytically. In a computer simulation of the numerical approach using the Levine equating method, the SEs based on the normality assumption were found to be less accurate than those derived without the normality assumption when the score distributions were skewed. Index terms: common-item design, Levine equating method, linear equating, standard error of equating, Tucker equating method.
Reliability of measurement and power of significance tests based on differences
(1993) Zimmerman, Donald W.; Williams, Richard H.; Zumbo, Bruno D.
The power of significance tests based on difference scores is indirectly influenced by the reliability of the measures from which differences are obtained. Reliability depends on the relative magnitude of true score and error score variance, but statistical power is a function of the absolute magnitude of these components. Explicit power calculations reaffirm the paradox put forward by Overall & Woodward (1975, 1976)-that significance tests of differences can be powerful even if the reliability of the difference scores is 0. This anomaly arises because power is a function of observed score variance but is not a function of reliability unless either true score variance or error score variance is constant. Provided that sample size, significance level, directionality, and the alternative hypothesis associated with a significance test remain the same, power always increases when population variance decreases, independently of reliability. Index terms: difference scores, error of measurement, power, significance tests, t test, test reliability, true scores.
Reliability, power, functions, and relations: A reply to Humphreys
(1993) Zimmerman, Donald W.; Williams, Richard H.; Zumbo, Bruno D.
Index terms: difference scores, error of measurement, power, significance tests, t test, test reliability, true scores.