Applied Psychological Measurement, Volume 16, 1992

Persistent link for this collection

https://hdl.handle.net/11299/114746

Search within Applied Psychological Measurement, Volume 16, 1992

Browse

Now showing 1 - 20 of 30

Inferential conditions in the statistical detection of measurement bias
(1992) Millsap, Roger E.; Meredith, William
Measurement bias in an observed variable Y as a measure of an unobserved variable W exists when the relationship of Y to W varies among populations of interest. Bias is often studied by examining population differences in the relationship of Y to a second observed measure Z that serves as a substitute for W. Whether the results of such studies have implications for measurement bias is addressed by first defining two forms of invariance- one corresponding to the relationship of Y to the unmeasured W, and one corresponding to the relationship of Y to the observed Z. General theoretical conditions are provided that justify the inference of one form of invariance from the other. The implications of these conditions for bias detection in two broad areas of application are discussed: differential item functioning and predictive bias in employment and educational settings. It is concluded that the conditions for inference are restrictive, and that bias investigations that rely strictly on observed measures are not, in general, diagnostic of measurement bias or the lack of bias. Some alternative approaches to bias detection are discussed. Index terms: differential item functioning, invariance, item bias, item response theory, measurement bias, predictive bias.
Effect of sample size, number of biased items, and magnitude of bias on a two-stage item bias estimation method
(1992) Miller, M. David; Oshima, T. C.
A two-stage procedure for estimating item bias was examined with six indexes of item bias and with the Mantel-Haenszel (MH) statistic; the sample size, the number of biased items, and the magnitude of the bias were varied. The second stage of the procedure did not identify substantial numbers of false positives (unbiased items identified as biased). However, the identification of true positives in the second stage was useful only when the magnitude of the bias was not small and the number of biased items was large (20% or 40% of the test). The weighted indexes tended to identify more true and false positives than their unweighted item response theory counterparts. Finally, the MH statistic identified fewer false positives, but did not identify small bias as well as the item response theory indexes. Index terms: differential item functioning, item bias, Mantel-Haenszel statistic, two-stage bias estimation.
Testing hypotheses about methods, traits, and communalities in the direct-product model
(1992) Bagozzi, Richard P.; Yi, Youjae
The direct-product model has been suggested as a procedure for estimating multiplicative effects of traits and methods in multitrait-multimethod matrices. Research on the direct-product model is extended in two ways. First, hierarchically nested models are derived for explicitly testing the overall and specific patterns of method and trait factors. Second, formal tests are developed for the pattern of communalities. These procedures are illustrated with data from Lawler (1967). Index terms: direct-product model, method factors, multiplicative model, multitrait-multimethod matrix, trait factors.
Polynomial algorithms for item matching
(1992) Armstrong, Ronald D.; Jones, Douglas H.
To estimate test reliability and to create parallel tests, test items frequently are matched. Items can be matched by splitting tests into parallel test halves, by creating T splits, or by matching a desired test form. Problems often occur. Algorithms are presented to solve these problems. The algorithms are based on optimization theory in networks (graphs) and have polynomial complexity. Computational results from solving sample problems with several hundred decision variables are reported. Index terms: branch-and-bound algorithm, classical test theory, complexity, item matching, non-deterministic polynomial complete, parallel tests, polynomial algorithms, test construction.
Effects of response format on diagnostic assessment of scholastic achievement
(1992) Birenbaum, Menucha; Tatsuoka, Kikumi K.; Gutvirtz, Yaffa
The effect of response format on diagnostic assessment of students’ performance on an algebra test was investigated. Two sets of parallel, open-ended (OE) items and a set of multiple-choice (MC) items-which were stem-equivalent to one of the OE item sets-were compared using two diagnostic approaches: a "bug" analysis and a rule-space analysis. Items with identical format (parallel OE items) were more similar than items with different formats (OE vs. MC). Index terms: bug analysis, diagnostic assessment, free-response, item format, multiple-choice, rule space.
The effect of test length and IRT model on the distribution and stability of three appropriateness indexes
(1992) Noonan, Brian W.; Boss, Marvin W.; Gessaroli, Marc E.
The extent to which three appropriateness indexes-Z, ECIZ₄, and W (a variation of Wright’s person-fit statistic)-are well-standardized was investigated in a monte carlo study. To assess the effects of the item response theory (IRT.) model and test length on the distribution of the indexes and their cutoff values at three false positive rates, nonaberrant response patterns were generated. ECIZ₄ most closely approximated a normal distribution, showing less skewness and kurtosis than Z, and W. The ECIZ₄ cutoff values were affected less by test length and the IRT model than were Z, and W. In contrast, the distribution of W was the least stable over replications, and its cutoff values varied greatly depending on the IRT model and test length. Index terms: appropriateness measurement, caution index, item response theory (person fit), person-fit statistics, unusual response patterns.
The nominal response model in computerized adaptive testing
(1992) De Ayala, R. J.
Although most computerized adaptive tests (CATs) use dichotomous item response theory (IRT) models, research on the use of polytomous IRT models in CAT has shown promising results. This study implemented a CAT based on the nominal response model (NR CAT). Item pool requirements for the NR CAT were examined. The performance of the NR CAT and a CAT based on the three-parameter logistic (3PL) model was compared. For two-, three-, and four-category items, items with maximum information of at least .16 produced reasonably accurate trait estimation for tests with a minimum test length of approximately 15 to 20 items. The NR CAT was able to produce trait estimates comparable to those of the 3PL CAT. Implications of these results are discussed. Index terms: adaptive testing; computerized adaptive testing; EAP estimation; nominal response model; polytomous models.
The ordered partition model: An extension of the partial credit model
(1992) Wilson, Mark
An item response model, called the ordered partition model, is designed for a measurement context in which the categories of response to an item cannot be completely ordered. For example, two different solution strategies may lead to an equivalent degree of success because both strategies may result in the same score, but an examiner may want to maintain the distinction between the strategies. Thus, the data would not be nominal nor completely ordered, so may not be suitable for other polytomous item response models such as the partial credit or the graded response models. The ordered partition model is described as an extension of the partial credit model, its relationship to other models is discussed, and two examples are presented. Index terms: ordered partition model, partial credit model, partial order model, polytomous IRT model, Rasch model.
A constrained PARAFAC method for positive manifold data
(1992) Krijnen, Wim P.; Ten Berge, Jos M. F.
A set of non-negatively correlated variables, referred to as positive manifold data, display a peculiar pattern of loadings in principal components analysis (PCA). If a small set of principal components is rotated to a simple structure, the variables correlate positively with all components, thus displaying positive manifold. However, this phenomenon is critically dependent on the freedom of rotation, as is evident from the unrotated loadings. That is, although the first principal component is without contrast (which means that all variables correlate either positively or negatively with the first component), subsequent components have mixtures of positive and negative loadings-which means that positive manifold is absent. PARAFAC is a generalization of PCA that has unique components, which means that rotations are not allowed. This paper examines how PARAFAC behaves when applied to positive manifold data. It is shown that PARAFAC does not always produce positive manifold solutions. For cases in which PARAFAC does not produce a positive manifold solution, a constrained PARAFAC method is offered that restores positive manifold by introducing non-negativity constraints. Thus, noncontrast PARAFAC components can be found that explain only a negligible amount of variance less than the PARAFAC components. These noncontrast components cannot be degenerate and cannot be partially unique in the traditional sense. Index terms: degenerate components; noncontrast components; non-negativity constraints; PARAFAC; positive manifold.
Unidimensional calibrations and interpretations of composite traits for multidimensional tests
(1992) Leucht, Richard M.; Miller, Timothy R.
A two-stage process that considers the multidimensionality of tests under the framework of unidimensional item response theory (IRT) is described and evaluated. In the first stage, items are clustered in a multidimensional latent space with respect to their direction of maximum discrimination. The separate item clusters are subsequently calibrated using a unidimensional IRT model to provide item parameter and trait estimates for composite traits in the context of the multidimensional trait space. This application is proposed as a workable compromise to some of the estimation, indeterminacy, and interpretation problems that affect the direct use of multidimensional IRT procedures for item calibration and trait estimation. The findings of a study based on simulated multidimensional data indicate that there are identifiable gains in estimation robustness and score interpretation with almost no sacrifice in goodness-of-fit using this two-stage approach to modeling composite latent traits. Index terms: item response theory, model fit, multidimensionality, parameter estimation; model fit; multidimensionality in IRT; parameter estimation; person fit; reference composites; trait estimation.
Measuring the difference between two models
(1992) Levine, Michael V.; Drasgow, Drasgow, Fritz Fritz; Williams, Bruce; McCusker, Christopher; Thomasson, Gary L.
Two psychometric models with very different parametric formulas and item response functions can make virtually the same predictions in all applications. By applying some basic results from the theory of hypothesis testing and from signal detection theory, the power of the most powerful test for distinguishing the models can be computed. Measuring model misspecification by computing the power of the most powerful test is proposed. If the power of the most powerful test is low, then the two models will make nearly the same prediction in every application. If the power is high, there will be applications in which the models will make different predictions. This measure, that is, the power of the most powerful test, places various types of model misspecification- item parameter estimation error, multidimensionality, local independence failure, learning and/or fatigue during testing-on a common scale. The theory supporting the method is presented and illustrated with a systematic study of misspecification due to item response function estimation error. In these studies, two joint maximum likelihood estimation methods (LOGIST 2B and LOGIST 5) and two marginal maximum likelihood estimation methods (BILOG and ForScore) were contrasted by measuring the difference between a simulation model and a model obtained by applying an estimation method to simulation data. Marginal estimation was found generally to be superior to joint estimation. The parametric marginal method (BILOG) was superior to the nonparametric method only for three-parameter logistic models. The nonparametric marginal method (ForScore) excelled for more general models. Of the two joint maximum likelihood methods studied, LOGIST s appeared to be more accurate than LOGIST 2B. Index terms: BILOG; forced-choice experiment; ForScore; ideal observer method; item response theory, estimation, models; LOGIST; multilinear formula score theory.
Using the extreme groups strategy when measures are not normally distributed
(1992) Fowler, Robert L.
The extreme groups research strategy is a two-stage measurement procedure that may be employed when it is relatively simple and inexpensive to obtain data on a psychological variable (X) in the first stage of investigation, but it is quite complex and expensive to measure subsequently a second variable (Y). This strategy is related to the selection of upper and lower groups for item discrimination analysis (Kelley, 1939) and to the treatments x blocks design in which participants are first "blocked" on the X variable and then only the extreme (highest and lowest means) blocks are compared on the Y variable, usually by a t test or an analysis of variance. Feldt (1961) showed analytically that if the population correlation coefficient between X and Y is p = .10, the power of the t test is maximized if each extreme group consists of 27% of the population tested on the X variable. However, Feldt’s derivation assumes that the X and Y variables are normally distributed. The present study employed a monte carlo simulation to explore the question of how to optimize power in the extreme groups strategy when sampling from non-normal distributions. The results showed that the optimum percent for the extreme group selection was approximately the same for all population shapes except for the extremely platykurtic (uniform) distribution. The power of the extreme groups strategy under conditions of normality was compared to the power of other research strategies, and an extension of the extreme groups approach was developed and applied in an example. Index terms: construct validation; extreme-group design; monte carlo technique; non-normal distributions; statistical power; upper-lower index.
Multidimensionality and item bias in item response theory
(1992) Oshima, T. C.; Miller, M. David
This paper demonstrates empirically how item bias indexes based on item response theory (IRT) identify bias that results from multidimensionality. When a test is multidimensional (MD) with a primary trait and a nuisance trait that affects a small portion of the test, item bias is defined as a mean difference on the nuisance trait between two groups. Results from a simulation study showed that although IRT-based bias indexes clearly distinguished multidimensionality from item bias, even with the presence of a between-group difference on the primary trait, the bias detection rate depended on the degree to which the item measured the nuisance trait, the values of MD discrimination, and the number of MD items. It was speculated that bias defined from the MD perspective was more likely to be detected when the test data met the essential unidimensionality assumption. Index terms: item bias, multidimensionality; item response theory, item bias, mean differences, multidimensionality; multidimensionality; mean differences in IRT.
Correlated effects in generalizability studies
(1992) Smith, Philip L.; Luecht, Richard M.
The analytical model typically used to perform generalizability analysis assumes that design effects are uncorrelated. Often, the assessment of behavioral data involves designs that employ multiple occasions or repeated trials (as in many observational and rating studies). In these cases, design effects may be serially correlated. The implications of serially correlated effects on the results of generalizability analyses are discussed. Simulated data are provided that demonstrate the biases that serially correlated effects introduce into the results. Index terms: correlated effects, estimation of variance components, generalizability theory, observational studies, repeated trials, serial correlation.
A review of regression diagnostics for behavioral research
(1992) Chatterjee, Sangit; Yilmaz, Mustafa
Influential data points can affect the results of a regression analysis; for example, the usual summary statistics and tests of significance may be misleading. The importance of regression diagnostics in detecting influential points is discussed, and five statistics are recommended for the applied researcher. The suggested diagnostics were used on a small dataset to detect an influential data point, and the effects were analyzed. Colinearity-based diagnostics also are discussed and illustrated on the same dataset. The nonrobustness of the least squares estimates in the presence of influential points is emphasized. Diagnostics for multiple influential points, multivariate regression, multicolinearity, nonlinear regression, and other multivariate procedures also are discussed. Index terms: Andrew-Pregibon measure, colinearity, Cook’s distance, covariance ratio, influential observations, measurement error, partial residual plot, regression diagnostics.
Test of the hypothesis that the intraclass reliability coefficient is the same for two measurement procedures
(1992) Alsawalmeh, Yousef M.; Feldt, Leonard S.
An approximate statistical test is derived for the hypothesis that the intraclass reliability coefficients associated with two measurement procedures are equal. Control of Type 1 error is investigated by comparing empirical sampling distributions of the test statistic with its derived theoretical distribution. A numerical illustration of the procedure is also presented. Index terms: intraclass reliability, reliability, sampling theory, Spearman-Brown extrapolation, statistical test.
The knowledge or random guessing model for matching tests
(1992) Van der Ven, A. H. G. S.; Gremmen, F. M.
The knowledge or random guessing (KRG) model was applied to matching tests. A matching test typically consists of two lists of alternatives. The response alternatives in the first list might consist of several terms to be defined, and the question alternatives in the second list would then consist of the definitions. Examinees are instructed to match the question alternatives to the response alternatives. According to the KRG model, if an examinee knows the correct answer, the correct answer will be chosen; however, if the examinee does not know the correct match, he/she will select the question alternative by guessing at random. Reliability formulas for the number of correct matchings based on the KRG model are given by Zimmerman and Williams (1982). Before applying these formulas, an appropriate statistical test should be used to test whether the model holds. A goodness-of-fit test is developed that is especially sensitive to the assumption of random guessing. Moreover, a simplified version of the model is presented in which the alternatives are ordered according to a Guttman scale. Three examples are given in which the model is applied to real data. It appears that in many cases examinees use coping strategies that violate the assumption of random guessing. A suggestion is made for the development of a somewhat more complex model that takes into account examinee coping strategies and that can be considered an extension of the KRG model. Index terms: achievement testing, guessing in matching tests, knowledge or random guessing model, matching tests.
A generalized partial credit model: Application of an EM algorithm
(1992) Muraki, Eiji
The partial credit model (PCM) with a varying slope parameter is developed and called the generalized partial credit model (GPCM). The item step parameter of this model is decomposed to a location and a threshold parameter, following Andrich’s (1978) rating scale formulation. The EM algorithm for estimating the model parameters is derived. The performance of this generalized model is compared on both simulated and real data to a Rasch family of polytomous item response models. Simulated data were generated and then analyzed by the various polytomous item response models. The results demonstrate that the rating formulation of the GPCM is quite adaptable to the analysis of polytomous item responses. The real data used in this study consisted of the National Assessment of Educational Progress (Johnson & Allen, 1992) mathematics data that used both dichotomous and polytomous items. The PCM was applied to these data using both constant and varying slope parameters. The GPCM, which provides for varying slope parameters, yielded better fit to the data than did the PCM. Index terms: item response model, National Assessment of Educational Progress, nominal response model, partial credit model, polytomous response model, rating scale model.
A method for investigating the intersection of item response functions in Mokken's nonparametric IRT model
(1992) Sijtsma, Klaas; Meijer, Rob R.
For a set of k items having nonintersecting item response functions (IRFs), the H coefficient (Loevinger, 1948; Mokken, 1971) applied to a transposed persons by items binary matrix Hт has a non-negative value. Based on this result, a method is proposed for using Hт to investigate whether a set of IRFs intersect. Results from a monte carlo study support the proposed use of Hт. These results support the use of Hт as an extension to Mokken’s nonparametric item response theory approach. Index terms: double monotonicity, Hт coefficient, intersection of item response functions, item response theory, Mokken models, nonparametric models.
A conceptual analysis of differential item functioning in terms of a multidimensional item response model
(1992) Camilli, Gregory
Differential item functioning (DIF) has been informally conceptualized as multidimensionality. Recently, more formal descriptions of DIF as multidimensionality have become available in the item response theory literature. This approach assumes that DIF is not a difference in the item parameters of two groups; rather, it is a shift in the distribution of ability along a secondary trait that influences the probability of a correct item response. That is, one group is relatively more able on an ability such as test-wiseness. The parameters of the secondary distribution are confounded with item parameters by unidimensional DIF detection models, and this manifests as differences between estimated item parameters. However, DIF is confounded with impact in multidimensional tests, which may be a serious limitation of unidimensional detection methods in some situations. In the multidimensional approach, DIF is considered to be a function of the educational histories of the examinees. Thus, a better tool for understanding DIF may be provided through structural modeling with external variables that describe background and schooling experience. Index terms: differential item functioning, factor analysis, IRT, item bias, LISREL, multidimensionality.