Applied Psychological Measurement, Volume 13, 1989

Persistent link for this collection

Search within Applied Psychological Measurement, Volume 13, 1989

Browse

Recent Submissions

Now showing 1 - 20 of 34
  • Item
    A comparison of two observed-score equating methods that assume equally reliable, congeneric tests
    (1989) MacCann, Robert G.
    For the external-anchor test equating model, two observed-score methods are derived using the slope and intercept assumptions of univariate selection theory and the assumptions that the tests to be equated are congeneric and equally reliable. The first derivation, Method 1, is then shown to give the same set of equations as Levine’s equations for random groups and unequally reliable tests and the "Z predicting X and Y" method. The second derivation, Method 2, is shown to give the same equations as Potthoff’s (1966) Method B and the "X and Y predicting Z" method. Methods 1 and 2 are compared empirically with Tucker’s and Levine’s equations for equally reliable tests; the conditions for which they may be appropriately applied are discussed. Index terms: Angoff’s Design V equations, congeneric tests, equally reliable tests, Levine’s equations (equally reliable), linear equating, observed-score equating, test equating, Tucker’s equations.
  • Item
    Congeneric modeling of reliability using censored variables
    (1989) Brown, R. L.
    This paper explores the use of Jöreskog’s (1970) congeneric modeling approach to reliability using censored quantitative variables, and discusses the compound problem of non-normality and attenuation that occurs when estimating censored continuous variables. Two monte carlo studies were conducted. The first study demonstrated the inappropriateness of using normal theory generalized least-squares (NTGLS) for estimating reliability on censored variables. The second study compared three different estimation procedures- NTGLS, asymptotically distribution free (ADF) estimators, and latent TOBIT estimators-as to their efficiency in estimating individual and composite reliability on censored variables. Results from the studies indicate that problems of non-normality and attenuation must be addressed before accurate reliability estimates may be obtained. Index terms: censored variables, congeneric model, covariance modeling, monte carlo study, reliability, TOBIT correlations.
  • Item
    Unidimensional IRT calibration of compensatory and noncompensatory multidimensional items
    (1989) Ackerman, Terry A.
    The characteristics of unidimensional ability estimates obtained from data generated using multidimensional compensatory models were compared with estimates from noncompensatory IRT models. Reckase, Carlson, Ackerman, and Spray (1986) reported that when a compensatory model is used and item difficulty is confounded with dimensionality, the composition of the unidimensional ability estimates differs for different points along the unidimensional ability (θ) scale. Eight datasets (four compensatory, four noncompensatory) were generated for four different levels of correlated two-dimensional θs. In each dataset, difficulty was confounded with dimensionality and then calibrated using LOGIST and BILOG. The confounding of difficulty and dimensionality affected the BILOG calibration of response vectors using matched multidimensional item parameters more than it affected the LOGIST calibration. As the correlation between the generated two-dimensional θs increased, the response data became more unidimensional as shown in bivariate plots of the mean θ̂₁ as opposed to the mean of θ̂₂ for specified unidimensional quantiles. Index terms: BILOG, compensatory IRT models, IRT ability estimation, LOGIST, multidimensional item response theory, noncompensatory IRT models.
  • Item
    Paradoxes, contradictions, and illusions
    (1989) Humphreys, Lloyd G.; Drasgow, Fritz
    There is no contradiction between a powerful significance test based on a difference score and the necessity for reliable measurement of the dependent measure in a controlled experiment. In fact, the former requires the latter. In this paper we review the conclusions that were drawn by Humphreys and Drasgow (1989) and show that Overall’s (1989) "contradiction" is an illusion derived from imprecise language. Index terms: analysis of covariance, baseline correction, control of individual differences, difference scores, measurement of change, reliability of the marginal distribution, statistical power, within-group reliabilities.
  • Item
    Distinguishing between measurements and dependent variables
    (1989) Overall, John E.
    Humphreys and Drasgow (1989b) recognize two types of dependent variables: the original measurements collected in an experiment and mathematical variables that are subjected to statistical analysis. Overall and Woodward (1975) were explicitly concerned with the latter, whereas Humphreys and Drasgow contend that they were concerned with reliability of the original measurements from which difference scores may be computed. These are quite different matters. Criticisms should focus on points of disagreement, and there has never been any disagreement concerning the importance of reliability of the original measurements. The notion that treatment effects should be considered a part of the true variance for calculation of reliability estimates is rejected as stemming from their failure to understand the basic difference between reliability and validity. Index terms: control of individual differences, difference scores, measurement of change, reliability of the marginal distribution, statistical power, within-group reliabilities.
  • Item
    Contradictions can never a paradox resolve
    (1989) Overall, John E.
    The fact that difference scores tend to be less reliable than the original measurements from which they are calculated should not be a matter of concern in testing the significance of treatment-induced change. The reliabilities of the original measurements are important because unreliability attenuates correlation, and substantial correlation between prescores and postscores is required for difference scores to be of value in controlling for individual differences. Reliability notwithstanding, difference scores provide superior control over true baseline differences in quasi-experimental research, whereas the analysis of covariance (ANCOVA) is generally preferable for baseline control in randomized experimental designs. Index terms: analysis of covariance, baseline correction, difference scores, measurement of change, reliability.
  • Item
    Some comments on the relation between reliability and statistical power
    (1989) Humphreys, Lloyd G.; Drasgow, Fritz
    Several articles have discussed the curious fact that a difference score with zero reliability can nonetheless allow a powerful test of change. This statistical legerdemain should not be overemphasized for three reasons. First, although the reliability of the difference score may be unrelated to power, the reliabilities of the variables used to create the difference scores are directly related to the power of the test. Second, with what some will regard as additional legerdemain, it is possible to define reliability in the context of a difference score in such a way that power is a direct function of reliability. The third and most serious objection to the conclusion that the reliability of a difference score is unimportant is that the underlying statistical model used in its derivation is rarely appropriate for psychological data. Index terms: control of individual differences, difference scores, reliability, reliability of the marginal distribution, statistical power, within-group reliabilities.
  • Item
    Psychometric properties of finite-state scores versus number-correct and formula scores: A simulation study
    (1989) García-Pérez, Miguel A.; Frary, Robert B.
    As developed by García-Pérez (1987), finite-state scores are nonlinear transformations of the proportions of conventional multiple-choice responses that are correct, incorrect, and omitted. They estimate the proportions of item alternatives which the examinees had the knowledge needed to classify (as correct or incorrect) before seeing them together in the items. The present study used simulation techniques to generate conventional test responses and to track the proportions of alternatives the examinees could classify independently before taking the test and the proportions they could classify after taking the test. Then the finite-state scores were computed and compared with these actual values and with number-correct and formula scores based on the conventional responses. Highly favorable results were obtained leading to recommendations for the use of finite-state scores. These results were almost the same when the simulation proceeded according to the model and when it was based on a naturalistic process completely independent of the model. Hence the scoring procedures on which finite-state scores are based are both accurate and robust. Index terms: applied measurement models, examinee behavior, finite-state scores, guessing, multiple-choice tests, test scoring.
  • Item
    A comparison of pseudo-Bayesian and joint maximum likelihood procedures for estimating item parameters in the three-parameter IRT model
    (1989) Skaggs, Gary; Stevenson, José
    This study compared pseudo-Bayesian and joint maximum likelihood procedures for estimating item parameters for the three-parameter logistic model in item response theory. Two programs, ASCAL and LOGIST, which employ the two methods were compared using data simulated from a three-parameter model. Item responses were generated for sample sizes of 2,000 and 500, test lengths of 35 and 15, and examinees of high, medium, and low ability. The results showed that the item characteristic curves estimated by the two methods were more similar to each other than to the generated item characteristic curves. Pseudo-Bayesian estimation consistently produced more accurate item parameter estimates for the smaller sample size, whereas joint maximum likelihood was more accurate as test length was reduced. Index terms: ASCAL, item response theory, joint maximum likelihood estimation, LOGIST, parameter estimation, pseudo-Bayesian estimation, three-parameter model.
  • Item
    Adaptive estimation when the unidimensionality assumption of IRT is violated
    (1989) Folk, Valerie G.; Green, Bert F.
    This study examined some effects of using a unidimensional IRT model when the assumption of unidimensionality was violated. Adaptive and nonadaptive tests were formed from two-dimensional item sets. The tests were administered to simulated examinee populations with different correlations of the two underlying abilities. Scores from the adaptive tests tended to be related to one or the other ability rather than to a composite. Similar but less disparate results were obtained with IRT scoring of nonadaptive tests, whereas the conventional standardized number-correct score was equally related to both abilities. Differences in item selection from the adaptive administration and in item parameter estimation were also examined and related to differences in ability estimation. Index terms: ability estimation, adaptive testing, item parameter estimation, item response theory, multidimensionality.
  • Item
    Adaptive and conventional versions of the DAT: The first complete test battery comparison
    (1989) Henly, Susan J.; Klebe, Kelli J.; McBride, James R.; Cudeck, Robert
    A group of covariance structure models was examined to ascertain the similarity between conventionally administered and computerized adaptive (CAT) versions of the complete battery of the Differential Aptitude Tests (DAT). Two factor analysis models developed from classical test theory and three models with a multiplicative structure for these multitrait-multimethod data were developed and then fit to sample data in a double cross-validation design. All three direct-product models performed better than the factor analysis models in both calibration and cross-validation subsamples. The cross-validated, disattenuated correlation between the administration methods in the best-performing direct-product model was very high in both groups (.98 and .97), suggesting that the CAT version of the DAT is an adequate representation of the conventional test battery. However, some evidence suggested that there are substantial differences between the printed and computerized versions of the one speeded test in the battery. Index terms: adaptive tests, computerized adaptive testing, covariance structure, cross-validation, Differential Aptitude Tests, direct-product models, factor analysis, multitrait-multimethod matrices.
  • Item
    Confirmatory factor analyses of multitrait-multimethod data: Many problems and a few solutions
    (1989) Marsh, Herbert W.
    During the last 15 years there has been a steady increase in the popularity and sophistication of the confirmatory factor analysis (CFA) approach to multitrait-multimethod (MTMM) data. This approach, however, incurs some important problems, the most serious being the ill-defined solutions that plague MTMM studies and the assumption that so-called method factors reflect primarily the influence of method effects. In three different MTMM studies, ill-defined solutions were frequent and alternative parameterizations designed to solve this problem tended to mask the symptoms instead of eliminating the problem. More importantly, so-called method factors apparently represented trait variance in addition to, or instead of, method variance for at least some models in all three studies. Further support for this counterinterpretation of method factors was found when external validity criteria were added to the MTMM models and correlated with trait and so-called method factors. This problem, when it exists, invalidates the traditional interpretation of trait and method factors and the comparison of different MTMM models. A new specification of method effects as correlated uniquenesses instead of method factors was less prone to ill-defined solutions and, apparently, to the confounding of trait and method effects. Index terms: confirmatory factor analysis, construct validity, convergent validity, correlated uniquenesses, discriminant validity, empirical underidentification, LISREL, method effects, multitrait-multimethod analysis.
  • Item
    Two-dimensional configurations on unidimensional stimulus sets in nonmetric multidimensional scaling
    (1989) Davison, Mark L.; Hearn, Marsha
    When unidimensional stimulus sets are subjected to a nonmetric scaling in two dimensions, the stimuli frequently form a C- or S-shaped configuration. In simulated unidimensional data scaled in two dimensions, stimuli formed a C-shaped configuration when the monotone function relating distances to dissimilarity data was negatively accelerating. They formed an S-shaped configuration when the monotone function was positively accelerating. Results suggest that when unidimensional stimulus sets are scaled in two dimensions using a rational starting configuration, the nature of the two-dimensional configuration can indicate the general form of the function mapping psychological dissimilarity, represented as distance in the scaling model, onto the observed response scale. Index terms: data transformations, multidimensional scaling, paired comparisons, proximity data, unidimensional scaling, unidimensionality.
  • Item
    PACM: A two-stage procedure for analyzing structural models
    (1989) Lehmann, Donald R.; Gupta, Sunil
    An alternative procedure for estimating structural equations models is described. The two-stage procedure, Path Analysis of Covariance Matrix (PACM), separately estimates the measurement and structural models using standard least-squares procedures. PACM was empirically compared to simultaneous maximum likelihood estimation of measurement and structural models using LISREL. PACM produced results similar to LISREL in many cases; it also seems to have advantages when dealing with large-scale problems, model misspecifications, collinearity among indicators, and missing data. Index terms: causal models, confirmatory factor analysis, LISREL, path analysis, structural equations models.
  • Item
    Modeling incorrect responses to multiple-choice items with multilinear formula score theory
    (1989) Drasgow, Fritz; Levine, Michael V.; Williams, Bruce; McLaughlin, Mary E.; Candell, Gregory L.
    Multilinear formula score theory (Levine, 1984, 1985, 1989a, 1989b) provides powerful methods for addressing important psychological measurement problems. In this paper, a brief review of multilinear formula scoring (MFS) is given, with specific emphasis on estimating option characteristic curves (OCCS). MFS was used to estimate OCCS for the Arithmetic Reasoning subtest of the Armed Services Vocational Aptitude Battery. A close match was obtained between empirical proportions of option selection for examinees in 25 ability intervals and the modeled probabilities of option selection. In a second analysis, accurately estimated OCCS were obtained for simulated data. To evaluate the utility of modeling incorrect responses to the Arithmetic Reasoning test, the amounts of statistical information about ability were computed for dichotomous and polychotomous scorings of the items. Consistent with earlier studies, moderate gains in information were obtained for low to slightly above average abilities. Index terms: item response theory, marginal maximum likelihood estimation, maximum likelihood estimation, multilinear formula scoring, option characteristic curves, polychotomous measurement, test information function.
  • Item
    The reliability of a linear composite of nonequivalent subtests
    (1989) Rozeboom, William W.
    Traditional formulas for estimating the reliability of a composite test from its internal item statistics are inappropriate to judge the reliability of multiple regressions and other weighted composites of subtests that are appreciably nonequivalent. Formulas are provided here for the reliability of such a composite given the reliabilities of its component subtests, followed by a comparison of the composite’s reliability to that of its components. Compositing can easily incur a substantial loss of reliability, though gains are entirely possible as well. Index terms: combining nonequivalent subtests, composite reliability, item weighting, nonequivalent subtests, nonhomogeneous item composites.
  • Item
    A comparison of three linear equating methods for the common-item nonequivalent-populations design
    (1989) Woodruff, David J.
    Three linear equating methods for the common-item nonequivalent-populations design are compared using an analytical method. The analysis investigated the behavior of the three methods when the true-score correlation between the test and anchor was less than unity, a situation that may occur in practice. The analysis is graphically illustrated using data from a test equating situation. Conclusions derived from the analysis have implications for the practical application of these equating methods. Index terms: congeneric model, Levine equating method, linear equating, Tucker equating method.
  • Item
    The effects of test disclosure on equated scores and pass rates
    (1989) Gilmer, Jerry S.
    This paper examines the effects of test item disclosure on resulting examinee equated scores and population passing rates. The equating model studied was the common-item nonequivalent-populations design under Tucker linear equating procedures. The research involved simulating disclosure by placing correct answers of "disclosed" items into response vectors of selected examinees. The degree of exposure the disclosed items received in the population was manipulated by varying the number of items disclosed and the number of examinee records receiving the correct answers. Other factors considered among the 10 experimental conditions included the characteristics of the disclosed items (difficulty of disclosed items and whether they were anchor or nonanchor test items) and the ability level of the subgroup receiving the disclosed items. Results suggest that effects of disclosure depend on the nature of the released items. Specific effects of disclosure on particular examinees are also discussed. Index terms: equated scores, licensing exams, passing rates, simulated disclosure, test disclosure.
  • Item
    Modeling guessing behavior: A comparison of two IRT models
    (1989) Waller, Michael I.
    This study compared the fit of the three-parameter model to that of the Ability Removing Random Guessing (ARRG) model (Waller, 1973) on data from a wide range of tests of cognitive ability in three representative samples. Although both models were designed to remove only the effects of random guessing, the results of this study indicated that the three-parameter model also makes an adjustment for partial-knowledge guessing. Fit of the three-parameter model with guessing parameters estimated at a constant value of 1 divided by the number of alternatives was compared to fit with individually estimated guessing parameters. The latter were found to produce fit far superior to those estimated at a constant value. A solution to the convergence problems often encountered with the three-parameter model is discussed. Index terms: Ability Removing Random Guessing model, convergence in three-parameter estimation procedures, item response theory, maximum likelihood estimation, partial-knowledge guessing, random guessing.
  • Item
    A simulation study of the difference chi-square statistic for comparing latent class models under violation of regularity conditions
    (1989) Holt, Judith A.; Macready, George B.
    This study explored the robustness of the likelihood ratio difference statistic to the violation of a regularity condition when used to assess differences in fit provided by pairs of latent class models. Under regularity conditions, the additive property of the likelihood ratio statistic can be used to assess the statistical difference between pairs of hierarchically related models (i.e., one model is a constrained form of the other). However, when one of the two models being compared is obtained by fixing parameters of the other model at boundary values (i.e., 0 or 1), a regularity condition is violated and the difference statistic is not necessarily distributed as x². The effects of three independent variables on the distribution of the difference statistics were studied for two generation models and a variety of subsuming models. Differential effects in terms of the direction and the extent of deviation were produced according to the types of model comparisons; these effects negate the application of a simple correction to the statistic to achieve a x² distribution. Recommendations are made regarding how this statistic might reasonably be used under violation of the regularity condition. Index terms: latent class model, likelihood ratio chi-square, mixture model, regularity conditions, tests of fit.