Applied Psychological Measurement, Volume 13, 1989

Persistent link for this collection

https://hdl.handle.net/11299/103303

Browse

Now showing 1 - 20 of 34

Adaptive and conventional versions of the DAT: The first complete test battery comparison
(1989) Henly, Susan J.; Klebe, Kelli J.; McBride, James R.; Cudeck, Robert
A group of covariance structure models was examined to ascertain the similarity between conventionally administered and computerized adaptive (CAT) versions of the complete battery of the Differential Aptitude Tests (DAT). Two factor analysis models developed from classical test theory and three models with a multiplicative structure for these multitrait-multimethod data were developed and then fit to sample data in a double cross-validation design. All three direct-product models performed better than the factor analysis models in both calibration and cross-validation subsamples. The cross-validated, disattenuated correlation between the administration methods in the best-performing direct-product model was very high in both groups (.98 and .97), suggesting that the CAT version of the DAT is an adequate representation of the conventional test battery. However, some evidence suggested that there are substantial differences between the printed and computerized versions of the one speeded test in the battery. Index terms: adaptive tests, computerized adaptive testing, covariance structure, cross-validation, Differential Aptitude Tests, direct-product models, factor analysis, multitrait-multimethod matrices.
Adaptive estimation when the unidimensionality assumption of IRT is violated
(1989) Folk, Valerie G.; Green, Bert F.
This study examined some effects of using a unidimensional IRT model when the assumption of unidimensionality was violated. Adaptive and nonadaptive tests were formed from two-dimensional item sets. The tests were administered to simulated examinee populations with different correlations of the two underlying abilities. Scores from the adaptive tests tended to be related to one or the other ability rather than to a composite. Similar but less disparate results were obtained with IRT scoring of nonadaptive tests, whereas the conventional standardized number-correct score was equally related to both abilities. Differences in item selection from the adaptive administration and in item parameter estimation were also examined and related to differences in ability estimation. Index terms: ability estimation, adaptive testing, item parameter estimation, item response theory, multidimensionality.
A comparison of pseudo-Bayesian and joint maximum likelihood procedures for estimating item parameters in the three-parameter IRT model
(1989) Skaggs, Gary; Stevenson, José
This study compared pseudo-Bayesian and joint maximum likelihood procedures for estimating item parameters for the three-parameter logistic model in item response theory. Two programs, ASCAL and LOGIST, which employ the two methods were compared using data simulated from a three-parameter model. Item responses were generated for sample sizes of 2,000 and 500, test lengths of 35 and 15, and examinees of high, medium, and low ability. The results showed that the item characteristic curves estimated by the two methods were more similar to each other than to the generated item characteristic curves. Pseudo-Bayesian estimation consistently produced more accurate item parameter estimates for the smaller sample size, whereas joint maximum likelihood was more accurate as test length was reduced. Index terms: ASCAL, item response theory, joint maximum likelihood estimation, LOGIST, parameter estimation, pseudo-Bayesian estimation, three-parameter model.
A comparison of three linear equating methods for the common-item nonequivalent-populations design
(1989) Woodruff, David J.
Three linear equating methods for the common-item nonequivalent-populations design are compared using an analytical method. The analysis investigated the behavior of the three methods when the true-score correlation between the test and anchor was less than unity, a situation that may occur in practice. The analysis is graphically illustrated using data from a test equating situation. Conclusions derived from the analysis have implications for the practical application of these equating methods. Index terms: congeneric model, Levine equating method, linear equating, Tucker equating method.
A comparison of two observed-score equating methods that assume equally reliable, congeneric tests
(1989) MacCann, Robert G.
For the external-anchor test equating model, two observed-score methods are derived using the slope and intercept assumptions of univariate selection theory and the assumptions that the tests to be equated are congeneric and equally reliable. The first derivation, Method 1, is then shown to give the same set of equations as Levine’s equations for random groups and unequally reliable tests and the "Z predicting X and Y" method. The second derivation, Method 2, is shown to give the same equations as Potthoff’s (1966) Method B and the "X and Y predicting Z" method. Methods 1 and 2 are compared empirically with Tucker’s and Levine’s equations for equally reliable tests; the conditions for which they may be appropriately applied are discussed. Index terms: Angoff’s Design V equations, congeneric tests, equally reliable tests, Levine’s equations (equally reliable), linear equating, observed-score equating, test equating, Tucker’s equations.
Confirmatory factor analyses of multitrait-multimethod data: Many problems and a few solutions
(1989) Marsh, Herbert W.
During the last 15 years there has been a steady increase in the popularity and sophistication of the confirmatory factor analysis (CFA) approach to multitrait-multimethod (MTMM) data. This approach, however, incurs some important problems, the most serious being the ill-defined solutions that plague MTMM studies and the assumption that so-called method factors reflect primarily the influence of method effects. In three different MTMM studies, ill-defined solutions were frequent and alternative parameterizations designed to solve this problem tended to mask the symptoms instead of eliminating the problem. More importantly, so-called method factors apparently represented trait variance in addition to, or instead of, method variance for at least some models in all three studies. Further support for this counterinterpretation of method factors was found when external validity criteria were added to the MTMM models and correlated with trait and so-called method factors. This problem, when it exists, invalidates the traditional interpretation of trait and method factors and the comparison of different MTMM models. A new specification of method effects as correlated uniquenesses instead of method factors was less prone to ill-defined solutions and, apparently, to the confounding of trait and method effects. Index terms: confirmatory factor analysis, construct validity, convergent validity, correlated uniquenesses, discriminant validity, empirical underidentification, LISREL, method effects, multitrait-multimethod analysis.
Congeneric modeling of reliability using censored variables
(1989) Brown, R. L.
This paper explores the use of Jöreskog’s (1970) congeneric modeling approach to reliability using censored quantitative variables, and discusses the compound problem of non-normality and attenuation that occurs when estimating censored continuous variables. Two monte carlo studies were conducted. The first study demonstrated the inappropriateness of using normal theory generalized least-squares (NTGLS) for estimating reliability on censored variables. The second study compared three different estimation procedures- NTGLS, asymptotically distribution free (ADF) estimators, and latent TOBIT estimators-as to their efficiency in estimating individual and composite reliability on censored variables. Results from the studies indicate that problems of non-normality and attenuation must be addressed before accurate reliability estimates may be obtained. Index terms: censored variables, congeneric model, covariance modeling, monte carlo study, reliability, TOBIT correlations.
A consumer's guide to LOGIST and BILOG
(1989) Mislevy, Robert J.; Stocking, Martha L.
Since its release in 1976, Wingersky, Barton, and Lord’s (1982) LOGIST has been the most widely used computer program for estimating the parameters of the three-parameter logistic item response model. An alternative program, Mislevy and Bock’s (1983) BILOG, has recently become available. This paper compares the approaches taken by the two programs and offers some guidelines for choosing between the two programs for particular applications. Index terms: Bayesian estimation, BILOG, IRT estimation procedures, LOGIST, marginal maximum likelihood, maximum likelihood, three-parameter logistic model estimation procedures.
Contradictions can never a paradox resolve
(1989) Overall, John E.
The fact that difference scores tend to be less reliable than the original measurements from which they are calculated should not be a matter of concern in testing the significance of treatment-induced change. The reliabilities of the original measurements are important because unreliability attenuates correlation, and substantial correlation between prescores and postscores is required for difference scores to be of value in controlling for individual differences. Reliability notwithstanding, difference scores provide superior control over true baseline differences in quasi-experimental research, whereas the analysis of covariance (ANCOVA) is generally preferable for baseline control in randomized experimental designs. Index terms: analysis of covariance, baseline correction, difference scores, measurement of change, reliability.
Correction of an orthogonal procrustes rotation procedure described by Guilford and Hoepfner
(1989) Ten Berge, Jos M. F.
Index terms: factor matching, least-squares rotation, target rotation.
Detection of invalid response patterns on the California Psychological Inventory
(1989) Lanning, Kevin
When faced with the task of responding to a personality questionnaire, an individual may respond with a number of strategies or test-taking attitudes. Among these, deceptive (fake) and disengaged (random) attitudes are of particular interest, for these can potentially mislead and misinform test users. A two-stage model was devised to detect deceptive and disengaged protocols on the California Psychological Inventory. Using parameters from signal detection theory, this model is found to be highly sensitive in detecting invalidity. Index terms: California Psychological Inventory, expected utility, faking on personality inventories, personality assessment, random response patterns, signal detection theory.
Distinguishing between measurements and dependent variables
(1989) Overall, John E.
Humphreys and Drasgow (1989b) recognize two types of dependent variables: the original measurements collected in an experiment and mathematical variables that are subjected to statistical analysis. Overall and Woodward (1975) were explicitly concerned with the latter, whereas Humphreys and Drasgow contend that they were concerned with reliability of the original measurements from which difference scores may be computed. These are quite different matters. Criticisms should focus on points of disagreement, and there has never been any disagreement concerning the importance of reliability of the original measurements. The notion that treatment effects should be considered a part of the true variance for calculation of reliability estimates is rejected as stemming from their failure to understand the basic difference between reliability and validity. Index terms: control of individual differences, difference scores, measurement of change, reliability of the marginal distribution, statistical power, within-group reliabilities.
The effects of test disclosure on equated scores and pass rates
(1989) Gilmer, Jerry S.
This paper examines the effects of test item disclosure on resulting examinee equated scores and population passing rates. The equating model studied was the common-item nonequivalent-populations design under Tucker linear equating procedures. The research involved simulating disclosure by placing correct answers of "disclosed" items into response vectors of selected examinees. The degree of exposure the disclosed items received in the population was manipulated by varying the number of items disclosed and the number of examinee records receiving the correct answers. Other factors considered among the 10 experimental conditions included the characteristics of the disclosed items (difficulty of disclosed items and whether they were anchor or nonanchor test items) and the ability level of the subgroup receiving the disclosed items. Results suggest that effects of disclosure depend on the nature of the released items. Specific effects of disclosure on particular examinees are also discussed. Index terms: equated scores, licensing exams, passing rates, simulated disclosure, test disclosure.
Estimating measures of pass-fail reliability from parallel half-tests
(1989) Woodruff, David J.; Sawyer, Richard L.
Two methods are derived for estimating measures of pass-fail reliability. The methods require only a single test administration and are computationally simple. Both are based on the Spearman-Brown formula for estimating stepped-up reliability. The non-distributional method requires only that the test be divisible into parallel half-tests; the normal method makes the additional assumption of normally distributed test scores. Bias for the two procedures is investigated by simulation. For nearly normal test score distributions, the normal method performed slightly better than the non-distributional method, but for moderately to severely skewed or symmetric platykurtic test score distributions the non-distributional method was superior. Test results from a licensure examination are used to illustrate the methods. Index terms: Cohen’s kappa, licensure examinations, pass-fail reliability, reliability, Spearman-Brown formula.
Estimating reliabilities of computerized adaptive tests
(1989) Divgi, D. R.
This paper presents two methods for estimating the reliability of a computerized adaptive test (CAT) without using item response theory. The required data consist of CAT and paper-pencil (PP) scores from identical or equivalent samples, and scores for all examinees on one or more covariates. Multiple R's and communalities are used to compute the ratio of CAT and PP reliabilities. When combined with the PP reliability calculated by a conventional procedure, these ratios yield estimates of CAT reliability. Index terms: computerized adaptive testing, item response theory, predictive validity, reliability, tailored testing.
Estimating unrestricted population parameters from restricted sample data in employment testing
(1989) Burke, Michael J.; Normand, Jacques; Doran, Lucinda
This study examined the accuracy of Alexander, Alliger, and Hanges’ (1984) method for estimating unrestricted univariate predictor means and variances from sample data drawn from three populations in two personnel selection contexts: (1) where there was direct nonstrict truncation on the predictor, and (2) where there was direct strict truncation on the predictor. In addition, the accuracy of corrected (estimated unrestricted) validity coefficients based on estimated population predictor standard deviations was assessed in the nonstrict truncation condition. In general, there was inconsistency in the accuracy of the population predictor mean and standard deviation estimates obtained across the present datasets and conditions. Caution is advised in the interpretation and reporting of corrected validity coefficients in employment testing based on estimated population predictor standard deviations. Index terms: employment testing, personnel selection, range restriction, true validity estimation, unrestricted population parameters.
An evaluation of marginal maximum likelihood estimation for the two-parameter logistic model
(1989) Drasgow, Fritz
The accuracy of marginal maximum likelihood estimates of the item parameters of the two-parameter logistic model was investigated. Estimates were obtained for four sample sizes and four test lengths; joint maximum likelihood estimates were also computed for the two longer test lengths. Each condition was replicated 10 times, which allowed evaluation of the accuracy of estimated item characteristic curves, item parameter estimates, and estimated standard errors of item parameter estimates for individual items. Items that are typical of a widely used job satisfaction scale and moderately easy tests had satisfactory marginal estimates for all sample sizes and test lengths. Larger samples were required for items with extreme difficulty or discrimination parameters. Marginal estimation was substantially better than joint maximum likelihood estimation. Index terms: Fletcher-Powell algorithm, item parameter estimation, item response theory, joint maximum likelihood estimation, marginal maximum likelihood estimation, two-parameter logistic model.
Grouped versus randomized format: An investigation of scale convergent and discriminant validity using LISREL confirmatory factor analysis
(1989) Schriesheim, Chester A.; Solomon, Esther; Kopelman, Richard E.
LISREL maximum likelihood confirmatory factor analyses (Jöreskog & Sörbom, 1984) were conducted to explore the effects of two questionnaire formats (grouping versus randomizing items) on the convergent and discriminant validity of two sets of questionnaire measures. The first set of measures consisted of satisfaction scales that had demonstrated acceptable psychometric properties in earlier studies; the second set of scales were job characteristics measures that had shown discriminant validity problems in previous research. Correlational data were collected from two groups of employed business administration students (N = 80 in each group) concurrently (Study 1) and at two points in time (Study 2). The results of the analyses showed that the grouped format was superior to the random format, particularly with respect to the weaker measures (the job characteristics scales). The results also illustrated and supported the usefulness of LISREL confirmatory factor analysis in studies of convergent and discriminant validity. Index terms: confirmatory factor analysis, convergent validity, discriminant validity, LISREL analysis, questionnaire formats, scale validity.
Inhibition in prolonged work tasks
(1989) Van der Ven, Ad H. G. S.; Smit, J. C.; Jansen, R. W. T. L.
A new model is presented that explains reaction time fluctuations in prolonged work tasks. The model extends the so-called Poisson-Erlang model and can account for long-term trend effects in the reaction time curve. The model is consistent with Spearman’s hypothesis that inhibition increases during work and decreases during rest. Predictions concerning the long-term trend were tested against data from the Bourdon-Vos cancellation test. The long-term trend in the mean and in the variance was perfectly described by the model. A goodness-of-fit test comparing frequency distributions of observed reaction and simulated reaction times was also supported by the model. Index terms: concentration, continuous work, distraction, inhibition, prolonged work, reaction time, response time.
Measuring change by means of a hybrid variant of the linear logistic model with relaxed assumptions
(1989) Formann, Anton K.; Spiel, Christiane
The linear logistic model with relaxed assumptions (LLRA) was developed for measuring changes in qualitative data. It assumes item-specific person parameters and thus does not require homogeneous items to be presented to the persons at two points in time. The hybrid variant of this model maintains the multidimensionality of the person parameters, but it allows for different sets of items each of which is presented only once. In the model, a Rasch homogeneous item at2t with possibly differing difficulty corresponds to each item at t1. A short description of both models is followed by a first application of the hybrid LLRA to empirical data from a study on text comprehension. This example not only serves to demonstrate possible results when applying the LLRA, but is also used to outline the principle of hypothesis testing and model controls. Index terms: dichotomous data, linear logistic model, measuring change, Rasch model, text comprehension.