Applied Psychological Measurement, Volume 10, 1986
Persistent link for this collectionhttps://hdl.handle.net/11299/100667
Browse
Browsing Applied Psychological Measurement, Volume 10, 1986 by Issue Date
Now showing 1 - 20 of 33
- Results Per Page
- Sort Options
Item Simple and weighted unfolding threshold models for the spatial representation of binary choice data(1986) DeSarbo, Wayne S.; Hoffman, Donna L.This paper describes the development of an unfolding methodology designed to analyze "pick any" or "pick any/n" binary choice data (e.g., decisions to buy or not to buy various products). Maximum likelihood estimation procedures are used to obtain a joint space representation of both persons and objects. A review of the relevant literature concerning the spatial treatment of such binary choice data is presented. The nonlinear logistic model type is described, as well as the alternating maximum likelihood algorithm used to estimate the parameter values. The results of an application of the spatial choice model to a synthetic data set in a monte carlo analysis are presented. An application concerning consumer (intended) choices for nine competitive brands of sports cars is discussed. Future research may provide a means of generalizing the model to accommodate three-way choice data.Item Assessing the dimensionality of a set of test items(1986) Hambleton, Ronald K.; Rovinelli, Richard J.This study compared four methods of determining the dimensionality of a set of test items: linear factor analysis, nonlinear factor analysis, residual analysis, and a method developed by Bejar (1980). Five artificial test datasets (for 40 items and 1,500 examinees) were generated to be consistent with the three-parameter logistic model and the assumption of either a one- or a two-dimensional latent space. Two variables were manipulated: (1) the correlation between the traits (r = .10 or r = .60) and (2) the percent of test items measuring each trait (50% measuring each trait, or 75% measuring the first trait and 25% measuring the second trait). While linear factor analysis in all instances overestimated the number of underlying dimensions in the data, nonlinear factor analysis with linear and quadratic terms led to correct determination of the item dimensionality in the three datasets where it was used. Both the residual analysis method and Bejar’s method proved disappointing. These results suggest the need for extreme caution in using linear factor analysis, residual analysis, and Bejar’s method until more investigations of these methods can confirm their adequacy. Nonlinear factor analysis appears to be the most promising of the four methods, but more experience in applying the method seems necessary before wide-scale use can be recommended.Item A multivariate perspective on the analysis of categorical data(1986) Zwick, Rebecca; Cramer, Elliot M.Psychological research often involves analysis of an I x J contingency table consisting of the responses of J groups of individuals on a criterion variable with I nominal categories. The conventional statistical approach for comparing responses across groups is the Pearson chi-square test. Alternatively, this analysis can be viewed as a multivariate analysis of variance with binary dependent variables, a canonical correlation analysis with two sets of binary variables, or a form of correspondence analysis. Although these analysis approaches stem from different traditions, they produce equivalent results when applied to an I x J table.Item Linking item parameters onto a common scale(1986) Vale, C. DavidAn item bank typically contains items from several tests that have been calibrated by administering them to different groups of examinees. The parameters of the items must be linked onto a common scale. A linking technique consists of an anchoring design and a transformation method. Four basic anchoring designs are the unanchored, anchor-items, anchor-group, and double-anchor designs. The transformation design consists of the system of equations that is used to translate the anchor information and put the item parameters on a common scale. Several transformation methods are discussed briefly. A simulation study is presented that compared the equivalent-groups method with the anchor-items method, using varying numbers of common items, applied both to the situation in which the groups were equivalent and one in which they were not. The results confirm previous findings that the equivalent-groups method is adequate when the groups are in fact equivalent. When the groups are not equivalent, accurate linking can be obtained with as few as two common items. Linking using a more efficient interlaced anchor-items design can provide accurate linking without the expense of including explicit common items in each of the tests.Item Factor indeterminacy in generalizability theory(1986) Ward, David G.Generalizability theory and common factor analysis are based upon the random effects model of the analysis of variance, and both are subject to the factor indeterminacy problem: The unobserved random variables (common factor scores or universe scores) are indeterminate. In the one-facet (repeated measures) design, the extent to which true or universe scores and common factor scores are not uniquely defined is shown to be a function of the dependability (reliability) of the data. The minimum possible correlation between equivalent common factor scores is a lower bound estimate of reliability.Item A cautionary note on the use of LISREL's automatic start values in confirmatory factor analysis studies(1986) Brown, R. L.The accuracy of parameter estimates provided by the major computer programs for confirmatory factor analysis studies is questioned. This note demonstrates an inconsistency in parameter estimates across two of the major programs (LISREL and EQS), with the inconsistency attributed to the use of LISREL VI’S automatic start values for the estimation of generalized least squares models.Item Survey research measurement issues in evaluating change: A laboratory investigation(1986) Armenakis, Achilles A.; Buckley, M. Ronald; Bedeian, Arthur G.Efforts to operationalize the alpha/beta/gamma change typology have suffered from a notable limitation. Virtually all have been conducted in field settings, thereby limiting the degree of experimental control over outcome criteria. Recognizing this limitation, the present study employed a laboratory methodology to investigate two research questions related to scale recalibration (beta change) in temporal survey research. Application of this methodology permitted random respondent assignment, exact replication of stimuli, and systematic time interval variation for the pretest-posttest design. Furthermore, the use of these procedures permitted testing the use of the retrospective design in assessing organizational change. Implications of the findings for the measurement of change are discussed.Item Some applications of optimization algorithms in test design and adaptive testing(1986) Theunissen, T. J. J. M.Some test design problems can be seen as combinatorial optimization problems. Several suggestions are presented, with various possible applications. Results obtained thus far are promising; the methods suggested can also be used with highly structured test specifications.Item An estimator of examinee-level measurement error variance that considers test form difficulty adjustments(1986) Jarjoura, DavidA model and estimator for examinee-level measurement error variance are developed. Although the binomial distribution is basic to the modeling, the proposed error model provides some insights into problems associated with simple binomial error, and yields estimates of error that are quite distinct from binomial error. By taking into consideration test form difficulty adjustments often used in standardized tests, the model is linked also to indices designed for identifying unusual item response patterns. In addition, average error variance under the model is approximately that which would be obtained through a KR-20 estimate of reliability, thus providing a unique justification for this popular index. Empirical results using odd-even and alternate-forms measures of error variance tend to favor the proposed model over the binomial.Item The Mokken scale: A critical discussion(1986) Roskam, Edward E.; Van den Wollenberg, Arnold L.; Jansen, Paul G. W.The Mokken scale is critically discussed. It is argued that Loevinger’s H, adapted by Mokken and advocated as a coefficient of scalability, is sensitive to properties of the item set which are extraneous to Mokken’s requirement of holomorphy of item response curves. Therefore, when defined in terms of H, the Mokken scale is ambiguous. It is furthermore argued that item-selection free statistical inferences concerning the latent person order appear to be insufficiently based on double monotony alone, and that the Rasch model is the only item response model fulfilling this requirement. Finally, it is contended that the Mokken scale is an unfruitful compromise between the requirements of a Guttman scale and the requirements of classical test theory.Item Rejoinder to "The Mokken scale: A critical discussion."(1986) Mokken, Robert J.; Lewis, Charles; Sijtsma, KlaasThe nonparametric approach to constructing and evaluating tests based on binary items proposed by Mokken has been criticized by Roskam, van den Wollenberg, and Jansen. It is contended that their arguments misrepresent the objectives of this approach, that their criticisms of the role of the H coefficient in the procedures are irrelevant or erroneous, and that they fail to distinguish the inherent requirements (and limitations) of general nonparametric models and procedures from those of parametric ones. It is concluded that Mokken’s procedures provide a useful tool for researchers in the social sciences who wish to construct and evaluate tests for measuring theoretically meaningful latent traits while avoiding the strong parametric assumptions of traditional item response theory.Item Covariance and regression slope models for studying validity generalization(1986) Raju, Nambury S.; Fralicx, Rodney; Steinhaus, Stephen D.Two new models, the covariance and regression slope models, are proposed for assessing validity generalization. The new models are less restrictive in that they require only one hypothetical distribution (distribution of range restriction for the covariance model and distribution of predictor reliability for the regression slope model) for their implementation, in contrast to the correlation model which requires hypothetical distributions for criterion reliability, predictor reliability, and range restriction. The new models, however, are somewhat limited in their applicability since they both assume common metrics for predictors and criteria across validation studies. Several simulation (monte carlo) studies showed the new models to be quite accurate in estimating the mean and variance of population true covariances and regression slopes. The results also showed that the accuracy of the covariance, regression slope, and correlation models is affected by the degree to which hypothetical distributions of artifacts match their true distributions; the regression slope model appears to be slightly more robust than the other two models.Item Optimal detection of certain forms of inappropriate test scores(1986) Drasgow, Fritz; Levine, Michael V.Optimal appropriateness indices, recently introduced by Levine and Drasgow (1984), provide the highest rates of detection of aberrant response patterns that can be obtained from item responses. In this article they are used to study three important problems in appropriateness measurement. First, the maximum detection rates of two particular forms of aberrance are determined for a long unidimensional test. These detection rates are shown to be moderately high. Second, two versions of the standardized l0 appropriateness index are compared to optimal indices. At low false alarm rates, one standardized l0 index has detection rates that are about 65% as large as optimal for spuriously high (cheating) test scores. However, for the spuriously low scores expected from persons with ill-advised testing strategies or reading problems, both standardized l0 indices are far from optimal. Finally, detection rates for polychotomous and dichotomous scorings of the item responses are compared. It is shown that dichotomous scoring causes serious decreases in the detectability of some aberrant response patterns. Consequently, appropriateness measurement constitutes one practical testing problem in which significant gains result from the use of a polychotomous item response model.Item Graphical analysis of item response theory residuals(1986) Ludlow, Larry H.A graphical comparison of empirical versus simulated residual variation is presented as one way to assess the goodness of fit of an item response theory model. The two forms of residual variation were generated through the separate calibration of empirical data and data "tailored" to fit the model, given the empirical parameter estimates. A variety of techniques illustrate the utility of using tailored residuals as a specific baseline against which empirical residuals may be understood. This paper presents an analytic method for isolating and identifying departures from the fit of an item response theory (IRT) model. The specific techniques employed focus on the graphical comparison of empirical residual variation to baseline residual variation. The baseline variation is the result of data generated to fit the model, given the empirical parameter estimates. The baseline residuals thus serve as the reference background for interpreting the empirical residuals. Although the Rasch model is applied in this paper, the principles that are discussed and illustrated hold for the residual analysis of any IRT model.Item The robustness of Rasch estimates(1986) Van de Vijver, Fons J.The small scale applicability of Rasch estimates was investigated under simulated conditions of guessing and heterogeneity in item discrimination. The accuracy of the Rasch estimates was evaluated by means of the correlation between the item/person parameters and their estimates, the standard deviations of the estimates, and the difference as well as the root mean squared difference between parameters and estimates. Within the range of the present investigation (from 10 to 50 items and from 25 to 500 persons) these criteria yielded favorable results under conditions of heterogeneous item discrimination. Under conditions of guessing, robustness could only be demonstrated for the correlational criterion. Guessing affects the difference measures between the parameter values and estimates quite strongly in a systematic way. It is argued that, notwithstanding these estimation errors, the Rasch model is to be preferred over nonstandard estimation procedures, from which the validity is unclear, or the use of the three-parameter model with its computational problems in small samples.Item The changing conception of measurement in education and psychology(1986) Van der Linden, Wim J.Since the era of Binet and Spearman, classical test theory and the ideal of the standard test have gone hand in hand, in part because both are based on the same paradigm of experimental control by manipulation and randomization. Their longevity is a consequence of this mutually beneficial symbiosis. A new type of theory and practice in testing is replacing the standard test by the test item bank, and classical test theory by item response theory. In this paper it is shown how these also reinforce and complete each other.Item Banking non-dichotomously scored items(1986) Masters, Geofferey N.; Evans, JohnA method for constructing a bank of items scored in two or more ordered response categories is described and illustrated. This method enables multistep problems, rating scale items, question "clusters," and other items using partial credit scoring to be calibrated and incorporated into an item bank, and it provides a mechanism for computer adaptive testing with items of this type. Procedures are described for calibrating an initial set of items, for testing the fit of items to the underlying measurement model, and for linking new items to an existing item bank. The method is illustrated using items from the Watson-Glaser Critical Thinking Appraisal.Item The changing conception of measurement: A commentary(1986) Hambleton, Ronald K.This paper comments on the contributions to this special issue on item banking. An historical framework for viewing the papers is provided by brief reviews of the literature in the areas of item response theory, item banking, and computerized testing. In general, the eight papers are viewed as contributing valuable technical knowledge for implementing testing programs with the aid of item banks.Item Small N does not always justify Rasch model(1986) De Gruijter, Dato N. M.In many applications of item response theory, it is of little consequence whether the Rasch model or a more accurate, but more complicated item response model is used. With small sample sizes, it might be advantageous to employ the Rasch model. A clear counterexample is the case of optimal item selection under guessing.Item Perspective on educational measurement(1986) Gulliksen, HaroldAn important but usually neglected aspect of the training of teachers is instruction in the art of writing good classroom tests. Such training should emphasize various forms of objective items (e.g., multiple-choice, master list, matching, greater-less-same, best-worst answer, and matrix format). The proper formulation and accurate grading of essay items should be included, as should the use of various types of free-answer items (e.g., the brief answer, interlinear, and "fill in the blanks in the following paragraph" forms). For courses involving laboratory work, such as science, machine shop, and home economics, performance and identification tests based on the laboratory work should be used. A second point is that organizations developing aptitude tests for nonacademic areas, such as police work, fire fighting, and licensing tests, should emphasize the use by the client of a valid, reliable, and unbiased criterion. Organizations developing academic aptitude tests should also (1) be alert to the accuracy of criterion measures, grades, rank in class, and so forth; (2) call teachers’ attention to defects in grading; and (3) help guide teachers and schools in improving these procedures. In recent decades, there have been few instances in which a testing organization has apprised teachers of the fact that their criteria-among others, grades on tests and student papers-are often quite unreliable based on characteristics such as work habits and attitude in class, and could be improved by using better tests to evaluate student performance. Characteristics of the group used for determining validity are also critical.