Applied Psychological Measurement, Volume 08, 1984

Persistent link for this collection

https://hdl.handle.net/11299/100664

Browse

Now showing 1 - 20 of 40

The validity of item bias techniques with math word problems
(1984) Ironson, Gail; Homan, Susan; Willis, Ruth; Signer, Barbara
Item bias research has compared methods empirically using both computer simulation with known amounts of bias and real data with unknown amounts of bias. This study extends previous research by "planting" biased items in the realistic context of math word problems. "Biased" items are those in which the reading level is too high for a group of students so that the items are unable to assess the students’ math knowledge. Of the three methods assessed (Angoff’s transformed difficulty, Camilli’s full chi-square, and Linn and Harnisch’s item response theory, IRT, approach), only the IRT approach performed well. Removing the biased items had a minor effect on the validity for the minority group.
Two simple models for rater effects.
(1984) De Gruijter, Dato N. M.
In many examinations, essays of different examinees are rated by different rater pairs. This paper discusses the estimation of rater effects for rating designs in which rater pairs overlap in a special way. Two models for rater effects are considered: the additive model and a nonlinear model. An illustration with empirical data is provided.
Relationships between the Thurstone, Coombs, and Rasch approaches to item scaling
(1984) Jansen, Paul G. W.
Andrich (1978) derived a formal equivalency between Thurstone’s Case V specialization of the law of comparative judgment for paired comparisons, with a logistic function substituted for the normal, and the Rasch model for direct responses. The equivalency was corroborated by a specific substantial-psychological interpretation of the Rasch binary item response probability. Studying the relationships between the Thurstone and Rasch models from another perspective than Andrich’s, namely, from a data-theoretical point of view, it appears that the equivalency is based on an implicit assumption with respect to the subject population. This assumption (1) is rather restrictive, and therefore its empirical validity seems to be low, and (2) seems to contradict the substantial reasoning corroborating the Thurstone-Rasch equivalency. It is argued that the Thurstone model cannot be considered the sample-independent pair comparison counterpart of the Rasch model. An alternative pair comparison equivalent of the Rasch model is tentatively proposed. Finally, the theoretical and practical implications of Andrich’s and of the present study are discussed.
Errors of measurement and standard setting in mastery testing
(1984) Kane, Michael T.; Wilson, Jennifer
A number of studies have estimated the dependability of domain-referenced mastery tests for a fixed cutoff score. Other studies have estimated the dependability of judgments about the cutoff score. Each of these two types of dependability introduces error. Brennan and Lockwood (1980) analyzed the two kinds of errors together but assumed that the two sources of error were uncorrelated. This paper extends that analysis of the total error in estimates of the difference between the domain score and the cutoff score to allow for covariance between the two types of error.
An application of latent class models to assessment data
(1984) Haertel, Edward
Responses of 17-year-olds to selected 1977-78 National Assessment of Educational Progress (NAEP) mathematics exercises were analyzed, using latent class models. A single model was fitted to data from five independent samples of examinees, each of which responded to a different set of six algebra or prealgebra exercises. Four categories of items were found, defining five levels of content mastery, ranging from examinees unable to solve any of the exercises (43%) through those able to solve all the exercises (19%). The methods demonstrated are broadly applicable to assessment data, including matrix-sampled data, and provide an aggregate description of examinee abilities independent of the specific characteristics of individual exercises administered.
An investigation of methods for reducing sampling error in certain IRT procedures
(1984) Wingersky, Marilyn S.; Lord, Frederic M.
The sampling errors of maximum likelihood estimates of item response theory parameters are studied in the case when both people and item parameters are estimated simultaneously. A check on the validity of the standard error formulas is carried out. The effect of varying sample size, test length, and the shape of the ability distribution is investigated. Finally, the effect of anchor-test length on the standard error of item parameters is studied numerically for the situation, common in equating studies, when two groups of examinees each take a different test form together with the same anchor test. The results encourage the use of rectangular or bimodal ability distributions, and also the use of very short anchor tests.
Reply to van der Linden's "Thoughts on the use of decision theory to set cutoff scores"
(1984) De Gruijter, Dato N. M.; Hambleton, Ronald K.
Item profile analysis for tests developed according to a table of specifications
(1984) Kolen, Michael J.; Jarjoura, David
An approach to analyzing items is described that emphasizes the heterogeneous nature of many achievement and professional certification tests. The approach focuses on the categories of a table of specifications, which often serves as a blueprint for constructing such tests. The approach is characterized by profile comparisons of observed and expected correlations of item scores with category scores. A multivariate generalizability theory model provides the foundation for the approach, and the concept of a profile of expected correlations is derived from the model. Data from a professional certification testing program are used for illustration and an attempt is made to provide links with test development issues and generalizability theory.
Relationship between corresponding Armed Services Vocational Aptitude Battery (ASVAB) and computerized adaptive testing (CAT) subtests
(1984) Moreno, Kathleen E.; Wetzel, C. Douglas; McBride, James R.; Weiss, David J.
The relationships between selected subtests from the Armed Services Vocational Aptitude Battery (ASVAB) and corresponding subtests administered as computerized adaptive tests (CAT) were investigated using Marine recruits as subjects. Three adaptive subtests were shown to correlate as well with ASVAB as did a second administration of ASVAB, even though the CAT subtests contained only half the number of items. Factor analysis showed the CAT subtests to load on the same factors as the corresponding ASVAB subtests, indicating that the same abilities were being measured. The preenlistment Armed Forces Qualification Test (AFQT) composite scores were predicted as well from the CAT subtest scores as from the retest ASVAB subtest scores, even though the CAT contained only three of the four AFQT subtests. It is concluded that CAT can achieve the same measurement precision as a conventional test, with half the number of items.
Ability metric transformations involved in vertical equating under item response theory
(1984) Baker, Frank B.
The metric transformations of the ability scales involved in three equating techniques-external anchor test, internal anchor test, and a pooled groups procedure -were investigated. Simulated item response data for two unique tests and a common test were obtained for two groups that differed with respect to mean ability and variability. The obtained metrics for various combinations of groups and tests were transformed to a common metric and then to the underlying ability metric. The results showed that there was reasonable agreement between the transformed obtained metrics and the underlying ability metric. They also showed that the largest errors in the ability score statistics occurred under the external anchor test procedure and the smallest under the pooled procedures. Although the pooled procedure performed well, it was affected by unequal variances in the two groups of examinees.
Examination of an extension of Guttman's model of ability tests
(1984) Tziner, Aharon; Rimmer, Avigdor
An extension of Guttman’s structural model of ability tests was devised and investigated with two samples consisting, respectively, of 335 and 225 males. The examinees in the first sample came for vocational guidance after their military service and were administered a 17-test battery. The second sample consisted of applicants for various jobs in an organization and were administered a 14-test battery. For each sample, a matrix of intercorrelations between scores was obtained based on the number of correct responses. The matrices were submitted to Guttman-Lingoes Smallest Space Analysis. The two-dimensional structure found was a radex in which (1) the facet of the language of presentation radially divided the space and (2) the facet of mental operation formed concentric rings. The significance of these findings for theoretical and applied problems relating to ability tests is discussed.
Multivariate generalizability theory in educational measurement: An empirical study
(1984) Nuβbaum, Albert
Multivariate generalizability theory was applied to the assessment of student achievement in art education. Twenty-five art students rated the paintings of 60 fourth-grade students with regard to three criteria. Paintings were made on four different topics. The results indicate that generalizability is low with respect to different raters and moderate with respect to different topics. The three ratings a rater gave on a single painting were moderately correlated. As indicated by the results for the covariance components, nearly half of the covariance between the three criteria was because the three ratings were from the same rater. Expected values for Q²(∆) are reported for different D study designs.
Evaluating reading diagnostic tests: An application of confirmatory factor analysis to multitrait-multimethod data
(1984) Marsh, Herbert W.; Butler, Susan
Diagnostic reading tests, in contrast to achievement tests, claim to measure specific components of ability hypothesized to be important for diagnosis or remediation. A minimal condition for demonstrating the construct validity of such tests is that they are able to differentiate validly between the reading traits that they claim to measure (e.g., comprehension, sound discrimination, blending). This condition is rarely tested, but multitrait-multimethod (MTMM) designs are ideally suited for this purpose. This is demonstrated in two studies based on the 1966 version of the Stanford Diagnostic Reading Test (SDRT). In each study, the application of the Campbell-Fiske guidelines and confirmatory factor analysis (CFA) to the MTMM data indicated that the SDRT subscales could be explained in terms of a method/halo effect and a general reading factor that was not specific to any of the subscales; this refutes the construct validity of the 1966 version of the SDRT as a diagnostic test. Other diagnostic tests probably suffer the same weakness and should also be evaluated in MTMM studies.
Scaling distortion in numerical conjoint measurement
(1984) Nickerson, Carol A.; McClelland, Gary H.
Proponents of numerical conjoint measurement generally assume that the technique’s goodness-of-fit measure will detect an inappropriate composition rule or the presence of random response error. In this paper a number of hypothetical and real preference rank orderings are analyzed using both axiomatic conjoint measurement and numerical conjoint measurement to demonstrate that this assumption is not warranted and may result in a distorted scaling.
Comparison of two methods to identify major personality factors
(1984) Comrey, Andrew L.
Both Howarth and Comrey have developed taxonomies of personality traits and inventories to measure them. The Howarth Personality Questionnaire and Additional Personality Factor inventories include 20 factors, whereas the Comrey Personality Scales (CPS) taxonomy includes eight factors. Howarth identified his factors through factor analysis of items, whereas Comrey identified his primary level factors through factor analysis of conceptually distinct clusters of homogeneous items, called Factored Homogeneous Item Dimensions (FHIDs), while avoiding the inclusion of highly redundant variables in the same analysis. Data for all three inventories were collected from the same subjects and factor analyzed. The Howarth factor scales were narrower in content and more highly overlapping than the CPS factor scales. Most of the Howarth factor scales were good marker variables for the CPS primary factors. Five CPS factors had major loadings for more than one of the Howarth factor scales. The CPS Emotional Stability vs. Neuroticism (S) primary level factor was split into several lower level factors in the Howarth system. Factor analysis of items is recommended to identify FHIDs. Factor analysis of FHIDs, in which no two FHIDs are merely alternate forms of the same conceptual variable, is recommended to identify the major primary factors of personality.
On problems encountered using decision theory to set cutoff scores
(1984) De Gruijter, Dato N. M.; Hambleton, Ronald K.
In the decision-theoretic approach to determining a cutoff score, the cutoff score chosen is that which maximizes expected utility of pass/fail decisions. This approach is not without its problems. In this paper several of these problems are considered: inaccurate parameter estimates, choice of test model and consequences, choice of subpopulations, optimal cutoff scores on various occasions, and cutoff scores as targets. It is suggested that these problems will need to be overcome and/or understood more thoroughly before the full potential of the decision-theoretic approach can be realized in practice.
Thorndike, Thurstone, and Rasch: A comparison of their methods of scaling psychological and educational tests
(1984) Engelhard, George, Jr.
The purpose of this study is to describe and compare the methods used by Thorndike, Thurstone, and Rasch for calibrating test items. Thorndike and Thurstone represent a traditional psychometric approach to this problem, whereas Rasch represents a more modem conceptualization derived from latent trait theory. These three major theorists in psychological and educational measurement were concerned with a common set of issues that seem to recur in a cyclical manner in psychometric theory. One such issue involves the invariance of item parameters. Each recognized the importance of eliminating the effects of an arbitrary sample in the estimation of item parameters. The differences generally arise from the specific methods chosen to deal with the problem. Thorndike attempted to solve the problem of item invariance by adjusting for mean differences in ability distributions. Thurstone extended Thorndike’s work by proposing two adjustments which included an adjustment for differences in the dispersions of ability in addition to Thorndike’s adjustment for mean differences. Rasch’s method implies a third adjustment, which involves the addition of a response model for each person in the sample. Data taken from Trabue (1916) are used to illustrate and compare how Thorndike, Thurstone, and Rasch would approach a common problem, namely, the calibration of a single set of items administered to several groups.
Eigenvalue shrinkage in principal components based factor analysis
(1984) Bobko, Philip; Schemmer, F. Mark
The concept of shrinkage, as (1) a statistical phenomenon of estimator bias, and (2) a reduction in explained variance resulting from cross-validation, is explored for statistics based on sample eigenvalues. Analytic solutions and previous research imply that the magnitude of eigenvalue shrinkage is a function of the type of shrinkage, sample size, the number of variables in the correlation matrix, the ordinal root position, the population eigenstructure, and the choice of principal components analysis or principal factors analysis. Hypotheses relating these specific independent variables to the magnitude of shrinkage were tested by means of a monte carlo simulation. In particular, the independent variable of population eigenstructure is shown to have an important effect on shrinkage. Finally, regression equations are derived that describe the linear relation of population and cross-validated eigenvalues to the original eigenvalues, sample size, ordinal position, and the number of variables factored. These equations are a valuable tool that allows researchers to accurately predict eigenvalue shrinkage based on available sample information.
Comparison of direct and indirect methods for setting minimum passing scores
(1984) Reilly, Richard R.; Zink, Donald L.; Israelski, Edmond W.
Several studies have compared different judgmental methods of setting passing scores by estimating item difficulties for the minimally competent examinee. Usually, a direct method of estimating item difficulties has been compared with an indirect method suggested by Nedelsky (1954). Nedelsky’s method has usually resulted in a substantially lower cutoff score than that arrived at with a direct method. Two studies were carried out for the purpose of comparing a direct method of setting passing scores with an indirect method that allowed judges to estimate the probability of the minimally competent examinee eliminating each incorrect alternative. In Study 1 a sample of 52 first-level supervisors used both methods to estimate passing scores on a content-oriented selection test for building maintenance specialists. In Study 2 a sample of 62 first-level supervisors used both methods to estimate passing scores on an entry level auto mechanics test. Results of both studies showed that the variance component for method was relatively small and that for raters was relatively large. Reliability estimates of judgments and correlations between judged difficulties and empirical difficulties showed the Angoff (1971) approach to be slightly superior. Results showed no particular advantage to using an indirect approach for estimating minimal competence. Recently, the problem of setting passing scores
Homogeneity analysis of test score data: A confrontation with the latent trait approach
(1984) De Gruijter, Dato N. W.
In homogeneity analysis, or dual scaling, weights for item categories are obtained that maximize Cronbach’s alpha. In this paper these weights are compared with the optimal scoring weights in the latent trait approach. This is done on the basis of data generated according to the two-parameter logistic model. As expected from a theoretical analysis, the homogeneity weights show less variation than the optimal weights of latent trait theory. It is argued that the homogeneity weights should not be used for item selection.