Applied Psychological Measurement, Volume 10, 1986
Persistent link for this collectionhttps://hdl.handle.net/11299/100667
Browse
Browsing Applied Psychological Measurement, Volume 10, 1986 by Title
Now showing 1 - 20 of 33
- Results Per Page
- Sort Options
Item Assessing the dimensionality of a set of test items(1986) Hambleton, Ronald K.; Rovinelli, Richard J.This study compared four methods of determining the dimensionality of a set of test items: linear factor analysis, nonlinear factor analysis, residual analysis, and a method developed by Bejar (1980). Five artificial test datasets (for 40 items and 1,500 examinees) were generated to be consistent with the three-parameter logistic model and the assumption of either a one- or a two-dimensional latent space. Two variables were manipulated: (1) the correlation between the traits (r = .10 or r = .60) and (2) the percent of test items measuring each trait (50% measuring each trait, or 75% measuring the first trait and 25% measuring the second trait). While linear factor analysis in all instances overestimated the number of underlying dimensions in the data, nonlinear factor analysis with linear and quadratic terms led to correct determination of the item dimensionality in the three datasets where it was used. Both the residual analysis method and Bejar’s method proved disappointing. These results suggest the need for extreme caution in using linear factor analysis, residual analysis, and Bejar’s method until more investigations of these methods can confirm their adequacy. Nonlinear factor analysis appears to be the most promising of the four methods, but more experience in applying the method seems necessary before wide-scale use can be recommended.Item Banking non-dichotomously scored items(1986) Masters, Geofferey N.; Evans, JohnA method for constructing a bank of items scored in two or more ordered response categories is described and illustrated. This method enables multistep problems, rating scale items, question "clusters," and other items using partial credit scoring to be calibrated and incorporated into an item bank, and it provides a mechanism for computer adaptive testing with items of this type. Procedures are described for calibrating an initial set of items, for testing the fit of items to the underlying measurement model, and for linking new items to an existing item bank. The method is illustrated using items from the Watson-Glaser Critical Thinking Appraisal.Item A cautionary note on the use of LISREL's automatic start values in confirmatory factor analysis studies(1986) Brown, R. L.The accuracy of parameter estimates provided by the major computer programs for confirmatory factor analysis studies is questioned. This note demonstrates an inconsistency in parameter estimates across two of the major programs (LISREL and EQS), with the inconsistency attributed to the use of LISREL VI’S automatic start values for the estimation of generalized least squares models.Item The changing conception of measurement in education and psychology(1986) Van der Linden, Wim J.Since the era of Binet and Spearman, classical test theory and the ideal of the standard test have gone hand in hand, in part because both are based on the same paradigm of experimental control by manipulation and randomization. Their longevity is a consequence of this mutually beneficial symbiosis. A new type of theory and practice in testing is replacing the standard test by the test item bank, and classical test theory by item response theory. In this paper it is shown how these also reinforce and complete each other.Item The changing conception of measurement: A commentary(1986) Hambleton, Ronald K.This paper comments on the contributions to this special issue on item banking. An historical framework for viewing the papers is provided by brief reviews of the literature in the areas of item response theory, item banking, and computerized testing. In general, the eight papers are viewed as contributing valuable technical knowledge for implementing testing programs with the aid of item banks.Item A comparison of the eigenvalue method and the geometric mean procedure for ratio scaling(1986) Budescu, David V.; Zwick, Rami; Rapoport, AmnonThis article evaluates and compares the performance of two ratio scaling methods, the eigenvalue method proposed by Saaty (1977, 1980) and the geometric mean procedure advocated by Williams and Crawford (1980), given random data. The two methods were examined in a series of monte carlo simulations for two response methods (direct estimation and constant sum) and various numbers of stimuli and response scales. The sampling distributions of the measures of consistency of the two methods were tabulated, rules for detecting and rejecting inconsistent respondents are outlined, and approximation formulas for other designs are derived. Overall, there was a high level of agreement and correspondence between the results from the two scaling techniques even when the data were random.Item Covariance and regression slope models for studying validity generalization(1986) Raju, Nambury S.; Fralicx, Rodney; Steinhaus, Stephen D.Two new models, the covariance and regression slope models, are proposed for assessing validity generalization. The new models are less restrictive in that they require only one hypothetical distribution (distribution of range restriction for the covariance model and distribution of predictor reliability for the regression slope model) for their implementation, in contrast to the correlation model which requires hypothetical distributions for criterion reliability, predictor reliability, and range restriction. The new models, however, are somewhat limited in their applicability since they both assume common metrics for predictors and criteria across validation studies. Several simulation (monte carlo) studies showed the new models to be quite accurate in estimating the mean and variance of population true covariances and regression slopes. The results also showed that the accuracy of the covariance, regression slope, and correlation models is affected by the degree to which hypothetical distributions of artifacts match their true distributions; the regression slope model appears to be slightly more robust than the other two models.Item Development of a testing service system(1986) Van Thiel, Catharina C.; Zwarts, Michel A.The development of an integrated system for the storage of items and the construction and analysis of tests is described. The system is being developed both as a general facility for the Dutch Institute of Educational Measurement and as a support system for the use and maintenance of item banks in schools. The methodology of developing the system is described with attention to the system architecture and to the results of the first stage of the system development.Item Effect of dissimulation motivation and anxiety on response pattern appropriateness measures(1986) Birenbaum, MenuchaThis study examined the effect of anxiety and dissimulation motivation of job applicants on their performance on an ability test. Two aspects of performance were considered: the total score and the appropriateness score. Four IRT-based appropriateness indices for detecting aberrant response patterns were employed in this study. The results indicate a negative effect of dissimulation motivation on the performance of low anxiety scorers, with respect to both the total score and the appropriateness score, with a greater effect on the latter. This effect was evidenced by an erratic or aberrant response pattern on the ability test; that is, missing relatively easy items while answering more difficult ones correctly. The results are discussed in light of the diverse interpretations concerning the meaning of Lie scales.Item Effect of examinee group on equating relationships(1986) Harris, Deborah J.; Kolen, Michael J.Many educational tests make use of multiple test forms, which are then horizontally equated to establish interchangeability among forms. To have confidence in this interchangeability, the equating relationships should be robust to the particular group of examinees on which the equating is conducted. This study investigated the effects of ability of the examinee group used to establish the equating relationship on linear, equipercentile, and three-parameter logistic IRT estimated true score equating methods. The results show all of the methods to be reasonably independent of examinee group, and suggest that population independence is not a good reason for selecting one method over another.Item An empirical Bayesian approach to item banking(1986) Van der Linden, Wim J.; Eggen, Theo J. H. M.A procedure for the sequential optimization of the calibration of an item bank is given. The procedure is based on an empirical Bayesian approach to a reformulation of the Rasch model as a model for paired comparisons between the difficulties of test items in which ties are allowed to occur. First, it is shown how a paired-comparisons design deals with the usual incompleteness of calibration data and how the item parameters can be estimated using this design. Next, the procedure for a sequential optimization of the item parameter estimators is given, both for individuals responding to pairs of items and for item and examinee groups of any size. The paper concludes with a discussion of the choice of the first priors in the procedure and the problems involved in its generalization to other item response models.Item Equivalence of conventional and computer presentation of speed tests(1986) Greaud, Valerie A.; Green, Bert F.This study examined the effects of computer presentation on speeded clerical tests. Two ratio scores-average number of correct responses per minute and its inverse, average number of seconds per correct response-were examined as variants of the conventional score, number of correct responses in a fixed interval of time. Ratio scores were more reliable than number-correct scores and were less sensitive to testing time. Tests administered on the computer were found to be at least as reliable as conventionally administered tests, but examinees were much faster in the computer mode. Correlations between paper-and-pencil and computer modes were high, except when task differences were introduced by computer implementation.Item An estimator of examinee-level measurement error variance that considers test form difficulty adjustments(1986) Jarjoura, DavidA model and estimator for examinee-level measurement error variance are developed. Although the binomial distribution is basic to the modeling, the proposed error model provides some insights into problems associated with simple binomial error, and yields estimates of error that are quite distinct from binomial error. By taking into consideration test form difficulty adjustments often used in standardized tests, the model is linked also to indices designed for identifying unusual item response patterns. In addition, average error variance under the model is approximately that which would be obtained through a KR-20 estimate of reliability, thus providing a unique justification for this popular index. Empirical results using odd-even and alternate-forms measures of error variance tend to favor the proposed model over the binomial.Item An exploration of the robustness of four test equating models(1986) Skaggs, Gary; Lissitz, Robert W.This monte carlo study explored how four commonly used test equating methods (linear, equipercentile, and item response theory methods based on the Rasch and three-parameter models) responded to tests of different psychometric properties. The four methods were applied to generated data sets where mean item difficulty and discrimination as well as level of chance scoring were manipulated. In all cases, examinee ability was matched to the level of difficulty of the tests. The results showed the Rasch model not to be very robust to violations of the equal discrimination and non-chance scoring assumptions. There were also problems with the three-parameter model, but these were due primarily to estimation and linking problems. The recommended procedure for tests similar to those studied is the equipercentile method.Item Factor indeterminacy in generalizability theory(1986) Ward, David G.Generalizability theory and common factor analysis are based upon the random effects model of the analysis of variance, and both are subject to the factor indeterminacy problem: The unobserved random variables (common factor scores or universe scores) are indeterminate. In the one-facet (repeated measures) design, the extent to which true or universe scores and common factor scores are not uniquely defined is shown to be a function of the dependability (reliability) of the data. The minimum possible correlation between equivalent common factor scores is a lower bound estimate of reliability.Item Graphical analysis of item response theory residuals(1986) Ludlow, Larry H.A graphical comparison of empirical versus simulated residual variation is presented as one way to assess the goodness of fit of an item response theory model. The two forms of residual variation were generated through the separate calibration of empirical data and data "tailored" to fit the model, given the empirical parameter estimates. A variety of techniques illustrate the utility of using tailored residuals as a specific baseline against which empirical residuals may be understood. This paper presents an analytic method for isolating and identifying departures from the fit of an item response theory (IRT) model. The specific techniques employed focus on the graphical comparison of empirical residual variation to baseline residual variation. The baseline variation is the result of data generated to fit the model, given the empirical parameter estimates. The baseline residuals thus serve as the reference background for interpreting the empirical residuals. Although the Rasch model is applied in this paper, the principles that are discussed and illustrated hold for the residual analysis of any IRT model.Item Item banking in computer-based instructional systems(1986) Baker, Frank B.This paper examines item banking within computer-based instructional systems from both a systems and a measurement perspective. Traditionally, computer-aided instruction involves little testing, although there is a trend to incorporate posttests in the sessions. However, computer-managed instruction has incorporated testing since its inception. The tests employed are similar in most respects to teacher-made classroom tests. The test results are used as the basis for diagnosis, prescription, and management procedures for individual or small groups of students. At the classroom level, test banking may be more appropriate than item banking. Because of the tight linkage of the tests to instructional procedures, the basic measurement issue appears to be the degree to which the approaches evolved from standardized achievement testing can be applied to the large number of short tests employed in computer-based instructional systems.Item Linking item parameters onto a common scale(1986) Vale, C. DavidAn item bank typically contains items from several tests that have been calibrated by administering them to different groups of examinees. The parameters of the items must be linked onto a common scale. A linking technique consists of an anchoring design and a transformation method. Four basic anchoring designs are the unanchored, anchor-items, anchor-group, and double-anchor designs. The transformation design consists of the system of equations that is used to translate the anchor information and put the item parameters on a common scale. Several transformation methods are discussed briefly. A simulation study is presented that compared the equivalent-groups method with the anchor-items method, using varying numbers of common items, applied both to the situation in which the groups were equivalent and one in which they were not. The results confirm previous findings that the equivalent-groups method is adequate when the groups are in fact equivalent. When the groups are not equivalent, accurate linking can be obtained with as few as two common items. Linking using a more efficient interlaced anchor-items design can provide accurate linking without the expense of including explicit common items in each of the tests.Item Methodology review: Analysis of multitrait-multimethod matrices(1986) Schmitt, Neal; Stults, Daniel M.Procedures for analyzing multitrait-multimethod (MTMM) matrices are reviewed. Confirmatory factor analysis (Jöreskog, 1974) is presented as a general model allowing evaluation of the discriminant and convergent validity of MTMM matrices, both as a whole and in individual trait-method units. However, it is noted that this model is deficient with regard to analysis of trait-method interactions of the type described by Campbell and O’Connell (1967, 1982). Composite direct product models described by Browne (1984) are one possible solution to this problem. Further, more systematic use of hypothesis testing regarding convergent and discriminant validity in nested hierarchical models is recommended (Widaman, 1985), as well as the use of a procedure to cross-validate models of MTMM matrices described by Cudeck and Browne (1983).Item The Mokken scale: A critical discussion(1986) Roskam, Edward E.; Van den Wollenberg, Arnold L.; Jansen, Paul G. W.The Mokken scale is critically discussed. It is argued that Loevinger’s H, adapted by Mokken and advocated as a coefficient of scalability, is sensitive to properties of the item set which are extraneous to Mokken’s requirement of holomorphy of item response curves. Therefore, when defined in terms of H, the Mokken scale is ambiguous. It is furthermore argued that item-selection free statistical inferences concerning the latent person order appear to be insufficiently based on double monotony alone, and that the Rasch model is the only item response model fulfilling this requirement. Finally, it is contended that the Mokken scale is an unfruitful compromise between the requirements of a Guttman scale and the requirements of classical test theory.