Applied Psychological Measurement, Volume 06, 1982

Persistent link for this collection

Search within Applied Psychological Measurement, Volume 06, 1982

Browse

Recent Submissions

Now showing 1 - 20 of 36
  • Item
    The development and application of a computerized information-processing test battery
    (1982) Barrett, Gerald V.; Alexander, Ralph A.; Doverspike, Dennis; Cellar, Douglas; Thomas, Jay C.
    To bridge the gap between computerized testing and information-processing-based measurement, a battery of computerized information-processing based ability and preference measures was developed. The information-processing and preference measures and a battery of paper-and-pencil tests were administered to 64 college students. Although the internal-consistency reliabilities of the computerized information-processing measures were adequate, test-retest reliabilities were lower than desirable for ability measures. The computerized information-processing measures possessed moderate convergent validity but had low correlations with traditional paper-and-pencil measures. Of the computerized preference measures, the most promising results were obtained with the Stimulus Pace measure. A major problem with the use of the computerized information-processing measures in applied settings would be administration time, as the battery took approximately 4 hours. In addition, problems with the stability of results over time and substantial practice effects suggest that even longer testing sessions would be required to obtain reliable measures. Although information-processing measures of short-term memory have, at best, low correlations with traditional intelligence tests, their ability to predict real-world tasks has yet to be sufficiently researched.
  • Item
    Improving measurement quality and efficiency with adaptive theory
    (1982) Weiss, David J.
    Approaches to adaptive (tailored) testing based on item response theory are described and research results summarized. Through appropriate combinations of item pool design and use of different test termination criteria, adaptive tests can be designed (1) to improve both measurement quality and measurement efficiency, resulting in measurements of equal precision at all trait levels; (2) to improve measurement efficiency for test batteries using item pools designed for conventional test administration; and (3) to improve the accuracy and efficiency of testing for classification (e.g., mastery testing). Research results show that tests based on item response theory (IRT) can achieve measurements of equal precision at all trait levels, given an adequately designed item pool; these results contrast with those of conventional tests which require a tradeoff of bandwidth for fidelity/precision of measurements. Data also show reductions in bias, inaccuracy, and root mean square error of ability estimates. Improvements in test fidelity observed in simulation studies are supported by live-testing data, which showed adaptive tests requiring half the number of items as that of conventional tests to achieve equal levels of reliability, and almost one-third the number to achieve equal levels of validity. When used with item pools from conventional tests, both simulation and live-testing results show reductions in test battery length from conventional tests, with no reductions in the quality of measurements. Adaptive tests designed for dichotomous classification also represent improvements over conventional tests designed for the same purpose. Simulation studies show reductions in test length and improvements in classification accuracy for adaptive vs. conventional tests; live-testing studies in which adaptive tests were compared with "optimal" conventional tests support these findings. Thus, the research data show that IRT-based adaptive testing takes advantage of the capabilities of IRT to improve the quality and/or efficiency of measurement for each examinee.
  • Item
    Standard error of an equating by item response theory
    (1982) Lord, Frederic M.
    A formula is derived for the asymptotic standard error of a true-score equating by item response theory. The equating method is applicable when the two tests to be equated are administered to different groups along with an anchor test. Numerical standard errors are shown for an actual equating (1) comparing the standard errors of IRT, linear, e and equipercentile methods and (2) illustrating the effect of the length of the anchor test on the standard error of the equating.
  • Item
    Latent trait models and ability parameter estimation
    (1982) Andersen, Erling B.
    In recent years several authors have viewed latent trait models for binary data as special models for contingency tables. This connection to contingency table analysis is used as the basis for a survey of various latent trait models. This article discusses estimation of item parameters by conditional, direct, and marginal maximum likelihood methods, and estimation of individual latent parameters as opposed to an estimation of the parameters of a latent population density. Various methods for testing the goodness of fit of the model are also described. Several of the estimators and tests are applied to a data set concerning consumer complaint behavior.
  • Item
    Adaptive EAP estimation of ability in a microcomputer environment
    (1982) Bock, R. Darrell; Mislevy, Robert J.
    Expected a posteriori (EAP) estimation of ability, based on numerical evaluation of the mean and variance of the posterior distribution, is shown to have unusually good properties for computerized adaptive testing. The calculations are not complex, precede noniteratively by simple summation of log likelihoods as items are added, and require only values of the response function obtainable from precalculated tables at a limited number of quadrature points. Simulation studies are reported showing the near equivalence of the posterior standard deviation and the standard error of measurement. When the adaptive testings terminate at a fixed posterior standard deviation criterion of .90 or better, the regression of the EAP estimator on true ability is virtually linear with slope equal to the reliability, and the measurement error homogeneous, in the range +- 2.5 standard deviations.
  • Item
    A nonparameteric approach to the analysis of dichotomous item responses
    (1982) Mokken, Robert J.; Lewis, Charles
    An item response theory is discussed which is based on purely ordinal assumptions about the probabilities that people respond positively to items. It is considered as a natural generalization of both Guttman scaling and classical test theory. A distinction is drawn between construction and evaluation of a test (or scale) on the one hand and the use of a test to measure and make decisions about persons’ abilities on the other. Techniques to deal with each of these aspects are described and illustrated with examples.
  • Item
    Some applications of logistic latent trait models with linear constraints on the parameters
    (1982) Fischer, Gerhard H.; Formann, Anton K.
    The linear logistic test model (LLTM), a Rasch model with linear constraints on the item parameters, is described. Three methods of parameter estimation are dealt with, giving special consideration to the conditional maximum likelihood approach, which provides a basis for the testing of structural hypotheses regarding item difficulty. Standard areas of application of the LLTM are surveyed, including many references to empirical studies in item analysis, item bias, and test construction; and a novel type of application to response-contingent dynamic processes is presented. Finally, the linear logistic model with relaxed assumptions (LLRA) for measuring change is introduced as a special case of an LLTM; it allows the characterization of individuals in a multidimensional latent space and the testing of hypotheses regarding effects of treatments.
  • Item
    Linear versus nonlinear models in item response theory
    (1982) McDonald, Roderick P.
    A broad framework for examining the class of unidimensional and multidimensional models for item responses is provided by nonlinear factor analysis, with a classification of models as strictly linear, linear in their coefficients, or strictly nonlinear. These groups of models are compared and contrasted with respect to the associated problems of estimation, testing fit, and scoring an examinee. The invariance of item parameters is related to the congruence of common factors in linear theory.
  • Item
    Advances in item response theory and applications: An introduction
    (1982) Hambleton, Ronald K.; Van der Linden, Wim J.
    Test theories can be divided roughly into two categories. The first is classical test theory, which dates back to Spearman’s conception of the observed test score as a composite of true and error components, and which was introduced to psychologists at the beginning of this century. Important milestones in its long and venerable tradition are Gulliksen’s Theory of Mental Tests (1950) and Lord and Novick’s Statistical Theories of Mental Test Scores (1968). The second is item response theory, or latent trait theory, as it has been called until recently. At the present time, item response theory (IRT) is having a major impact on the field of testing. Models derived from IRT are being used to develop tests, to equate scores from nonparallel tests, to investigate item bias, and to report scores, as well as to address many other pressing measurement problems (see, e.g., Hambleton, 1983; Lord, 1980). IRT differs from classical test theory in that it assumes a different relation of the test score to the variable measured by the test. Although there are parallels between models from IRT and psychophysical models formulated around the turn of the century, only in the last 10 years has IRT had any impact on psychometricians and test users. Work by Rasch (1980/1960), Fischer (1974), 9 Birnbaum (1968), ivrighi and Panchapakesan (1969), Bock (1972), and Lord (1974) has been especially influential in this turnabout; and Lazarsfeld’s pioneering work on latent structure analysis in sociology (Lazarsfeld, 1950; Lazarsfeld & Henry, 1968) has also provided impetus. One objective of this introduction is to review the conceptual differences between classical test theory and IRT. A second objective is to introduce the goals of this special issue on item response theory and the seven papers. Some basic problems with classical test theory are reviewed in the next section. Then, IRT approaches to educational and psychological measurement are presented and compared to classical test theory. The final two sections present the goals for this special issue and an outline of the seven invited papers.
  • Item
    A comparison of the accuracy of four methods for clustering jobs
    (1982) Zimmerman, Ray; Jacobs, Rick; Farr, James L.
    Four methods of cluster analysis were examined for their accuracy in clustering simulated job analytic data. The methods included hierarchical mode analysis, Ward’s method, k-means method from a random start, and k-means based on the results of Ward’s method. Thirty data sets, which differed according to number of jobs, number of population clusters, number of job dimensions, degree of cluster separation, and size of population clusters, were generated using a monte carlo technique. The results from each of the four methods were then compared to actual classifications. The performance of hierarchical mode analysis was significantly poorer than that of the other three methods. Correlations were computed to determine the effects of the five data set variables on the accuracy of each method. From an applied perspective, these relationships indicate which method is most appropriate for a given data set. These results are discussed in the context of certain limitations of this investigation. Suggestions are also made regarding future directions for cluster analysis research.
  • Item
    Sequential testing for selection
    (1982) Weitzman, R. A.
    In sequential testing for selection, an applicant for school or work responds via a computer terminal to one item at a time until an acceptance or rejection decision can be made with a preset probability of error. The test statistic, as a function of item difficulties for standardization subgroups scoring within successive quantiles of the criterion, is an approximation of a Waldian probability ratio that should improve as the number of quantiles increases. Monte carlo simulation of 1,000 first-year college students under 96 different testing conditions indicated that a quantile number as low as four could yield observed error rates that are close to their nominal values with mean test lengths between 5 and 47. Application to real data, for which interpolative estimation of the quantile item difficulties was necessary, produced, with quantile numbers of four and five, even more accurate observed error rates than the monte carlo studies did. Truncation at 70 items narrowed the range of mean test lengths for the real data to between 5 and 19. Important for use in selection, the critical values of the test statistics are functions not only of the nominal error rates but also, alternatively, of the selection ratio, the base-rate success probability, and the success probability among selectees, which a test user is free to choose.
  • Item
    Bounds on the k out of n reliability of a test, and an exact test for hierarchically related items
    (1982) Wilcox, Rand R.
    Consider an n-item multiple-choice test where it is decided that an examinee knows the answer if and only if he/she gives the correct response. The k out of n reliability of the test, Qk, is defined to be the probability that, for a randomly sampled examinee, at least k correct decisions are made about whether the examinee knows the answer to an item. The paper describes and illustrates how an extension of a recently proposed latent structure model can be used in conjunction with results in Sathe, Pradhan, and Shah (1980) to estimate upper and lower bounds on Qk. A method of empirically checking the model is discussed.
  • Item
    A study of pre-equating based on item response theory
    (1982) Bejar, Isaac I.; Wingersky, Marilyn S.
    The study reports a feasibility study using item response theory (IRT) as a means of equating the Test of Standard Written English (TSWE). The study focused on the possibility of pre-equating, that is, deriving the equating transformation prior to the final administration of the test. The three-parameter logistic model was postulated as the response model and its fit was assessed at the item, subscore, and total score level. Minor problems were found at each of these levels; but, on the whole, the three-parameter model was found to portray the data well. The adequacy of the equating provided by IRT procedures was investigated in two TSWE forms. It was concluded that pre-equating does not appear to present problems beyond those inherent to IRT-equating.
  • Item
    Choice of test model for appropriateness measurement
    (1982) Drasgow, Fritz
    Several theoretical and empirical issues that must be addressed before appropriateness measurement can be used by practitioners are investigated in this paper. These issues include selection of a latent trait model for multiple-choice tests, selection of a particular appropriateness index, and the sample size required for parameter estimation. The threeparameter logistic model is found to provide better detection of simulated spuriously low examinees than the Rasch model for the Graduate Record Examination, Verbal Section. All three appropriateness indices proposed by Levine and Rubin (1979) provide good detection of simulated spuriously low examinees but poor detection of simulated spuriously high examinees. A reason for this discrepancy is provided.
  • Item
    Academic achievement and individual differences in the learning processes of basic skills students in the university
    (1982) Moss, Carolyn J.
    This study analyzed the relationship between the academic achievement and information-processing habits of basic skills students in the university. Academic achievement was measured by grade-point average (GPA) and American College Testing Program Assessment (ACT) scores. Information-processing habits were determined by the Inventory of Learning Processes (ILP). There was no significant difference in the ILP profiles of high- and low-achieving basic skills students, whether they were grouped by ACT or GPA. Study Methods was the only scale that showed a significant correlation with academic achievement-namely, a negative correlation with ACT. A path analysis indicated that the effect of Study Methods on GPA is indirect, as mediated by ACT. Since ACT assesses prior achievement (i.e., high-school performance), it appears that learning style has an effect prior to college entrance. Basic skills students with low ACT scores tend to substitute conventional study methods for deep elaborative processing, but these students are low achievers in college, as indicated by their GPA. A multivariate analysis of variance showed no significant sex or ethnic differences in information-processing habits. Evidently, a low achiever is a low achiever regardless of sex or ethnicity.
  • Item
    Comparison of factor analytic results with two-choice and seven-choice personality item formats
    (1982) Comrey, Andrew L.; Montag, I.
    A translated version of the Comrey Personality Scales (CPS) using a two-choice item format was administered to 159 male applicants for a motor vehicle operator’s license in Israel. Total scores were computed for the 40 homogeneous item subgroups that define the eight personality factors in the taxonomy underlying the CPS. Factor analysis of the intercorrelations among these 40 subvariables resulted in substantial replication of factors found in a previous study employing a seven-choice item format. On the average, higher intercorrelations among subvariables measuring the same factor and higher factor loadings were obtained for the seven-choice item format results. These findings suggest a superiority for the seven-choice over the two-choice item format for personality inventories.
  • Item
    An application of singular value decomposition to the factor analysis of MMPI items
    (1982) Reddon, John R.; Marceau, Roger; Jackson, Douglas N.
    Several measurement problems were identified in the literature concerning the fidelity with which the Minnesota Multiphasic Personality Inventory (MMPI) assesses psychopathology. A straightforward solution to some of these problems is to develop an orthogonal basis in the MMPI; however, there are 550 items, and this is a cumbersome task even for modern computers. The method of alternating least squares was employed to yield a singular value decomposition of these measures on 682 prison inmates. Unsystematic or sample-specific error variance was minimized through a two-stage least squares split thirds replication design. The relative explanatory power of models of psychopathology based on external, internal, naive, and construct-oriented measurement strategies is discussed.
  • Item
    Identifying test items that perform differentially in population subgroups: A partial correlation index
    (1982) Stricker, Lawrence J.
    Verbal items on the GRE Aptitude Test were analyzed for race (white vs. black) and sex differences in their functioning, using a new procedure-item partial correlations with subgroup standing (race or sex), controlling for total score-as well as two standard methods-comparisons of subgroups’ item characteristic curves and item difficulties. The partial correlation index agreed with the item characteristic curve index in the proportions of items identified as performing differentially for each race and sex. These two indexes also agreed in the particular items that they identified as functioning differentially for the sexes, but not in the items that they identified as performing differently for the races. The partial correlation index consistently disagreed with the item difficulty index in the proportions of items identified as functioning differentially and in the particular items involved. The items identified by the partial correlation index as performing differentially, like the items identified by the other indexes, generally did not differ in type or content from items not so identified, with one major exception: this index identified items with female content as functioning differently for the sexes.
  • Item
    Recovery of two- and three-parameter logistic item characteristic curves: A monte carlo study
    (1982) Hulin, Charles L.; Lissak, Robin I.; Drasgow, Fritz
    This monte carlo study assessed the accuracy of simultaneous estimation of item and person parameters in item response theory. Item responses were simulated using the two- and three-parameter logistic models. Samples of 200, 500, 1,000, and 2,000 simulated examinees and tests of 15, 30, and 60 items were generated. Item and person parameters were then estimated using the appropriate model. The root mean squared error between recovered and actual item characteristic curves served as the principal measure of estimation accuracy for items. The accuracy of estimates of ability was assessed by both correlation and root mean squared error. The results indicate that minimum sample sizes and tests lengths depend upon the response model and the purposes of an investigation. With item responses generated by the two-parameter model, tests of 30 items and samples of 500 appear adequate for some purposes. Estimates of ability and item parameters were less accurate in small sample sizes when item responses were generated by the three-parameter logistic model. Here samples of 1,000 examinees with tests of 60 items seem to be required for highly accurate estimation. Tradeoffs between sample size and test length are apparent, however.
  • Item
    Communication apprehension: An assessment of Australian and United States data
    (1982) Hansford, B. C.; Hattie, John
    This study assessed the claims of unidimensionality for a measure of oral communication apprehension (Personal Report of Communication Apprehension). Eighteen independent samples, drawn from Australian and United States sources were used; and comparisons were made between the samples. Although similarities were found among the data sets with respect to internal consistency, frequency distributions, and item-total correlations, the claim of unidimensionality in the measure was rejected. It was also found that there were no overall differences between Australian and United States samples, no sex differences, and no age differences.