Applied Psychological Measurement, Volume 04, 1980

Persistent link for this collection

https://hdl.handle.net/11299/97275

Browse

Now showing 1 - 20 of 42

Agreement coefficients as indices of dependability for domain-referenced tests
(1980) Kane, Michael T.; Brennan, Robert L.
A large number of seemingly diverse coefficients have been proposed as indices of dependability, or reliability, for domain-referenced and/or mastery tests. In this paper it is shown that most of these indices are special cases of two generalized indices of agreement-one that is corrected for chance and one that is not. The special cases of these two indices are determined by assumptions about the nature of the agreement function or, equivalently, the nature of the loss function for the testing procedure. For example, indices discussed by Huynh (1976), Subkoviak (1976), and Swaminathan, Hambleton, and Algina (1974) employ a threshold agreement, or loss, function; whereas, indices discussed by Brennan and Kane (1977a, 1977b) and Livingston (1972a) employ a squared-error loss function. Since all of these indices are discussed within a single general framework, the differences among them in their assumptions, properties, and uses can be exhibited clearly. For purposes of comparison, norm-referenced generalizability coefficients are also developed and discussed within this general framework.
An approach to measuring the achievement or proficiency of an examinee
(1980) Wilcox, Rand R.
Various school systems are developing proficiency tests which are conceptualized as representing a variety of skills with one or more items per skill. This paper discusses how certain recent technical advances might be extended to examine these tests. In contrast to previous analyses, errors at the item level are included; and it is shown that inclusion of these errors implies that a substantially longer test might be needed. One approach to this problem is described, and directions for future research are suggested.
Behavioral expectation scales versus nonanchored and trait rating systems: A sales personnel application
(1980) Ivancevich, John M.
There are presently available several empirical comparisons between behavioral expectation scales (BES) and other rating scales (Bernardin, 1977; Borman & Dunnette, 1975; Burnaska & Hollman, 1974; Keaveny & McGann, 1975). In many of these studies the rater-ratee population has consisted of faculty members and students (Bernardin, 1977; Keaveny & McGann, 1975; Schwab, Heneman, & Decotiis, 1975). Only a handful of scientifically sound investigations comparing BES and other rating scales have used manager-subordinate populations. The importance of sales personnel in organizations, the general lack of previous research using sales employees as subjects in examining BES, and the significance of performance evaluation prompted the present study. Specifically, the study examined estimates of rating leniency, halo error, interrater agreement, and the degree of ratee differentiation of BES, nonanchored, and trait evaluation systems.
Budescu, David V. (1980). Some new measures of profile dissimilarity. Applied Psychological Measurement, 4, 261-272. doi:10.1177/014662168000400212
(1980) Budescu, David V.
Four new measures of multidimensional profile dissimilarity are proposed that are (1) either symmetric or asymmetric and (2) either conditional or unconditional on profile shape. The four similarity indices are based on alternative normalizations of the regular distance (D) statistic of Cronbach and Gleser (1953), all taking values between 0 and 1. Methods of calculation and interpretations of the indices are demonstrated and discussed, and several generalizations are suggested.
Calculation of adjusted response frequencies using least squares regression methods
(1980) Overall, John E.
The use of general linear regression methods for the analysis of categorical data is recommended. The general linear model analysis of a 0,1 coded response variable produces estimates of the same response probabilities that might otherwise be estimated from frequencies in a multiway contingency table. When factors in the design are correlated, the regression analysis estimates the same response probabilities that would be estimated from the simple marginal frequencies in a balanced orthogonal design. The independent effects that are estimated by the regression analysis are the unweighted means of the response probabilities in various cells of a cross-classification design; however, it is not necessary that all cells in a complex design be filled in order for the estimates to have that interpretation. The advantages of the general linear model analysis include familiarity of most psychologists with the methods, availability of computer programs, and ease of application to problems that are too complex for development of complete multiway contingency tables.
Comments on criterion-referenced testing
(1980) Livingston, Samuel A.
The six papers in this issue summarize 10 years of theory development, empirical research, and practical experience in criterion-referenced testing. Much of the theory development has focused on questions and issues raised by Popham and Husek (1969), who pointed out that much of traditional psychometric theory did not work well when applied to criterion-referenced tests. The six papers, taken together, represent an attempt to answer four basic questions: 1. How should the reliability of a criterion-referenced test be measured? 2. How should it be decided how many items are needed in a criterion-referenced test? 3. How should criterion-referenced tests be used to make decisions about the people taking the tests? 4. What kind of evidence should be provided for the validity of a criterion-referenced test? Attempts to answer these questions have been complicated by the lack of a universally accepted, unambiguous definition of the term "criterion-referenced test." Glaser’s (1963) article, in which the term first appeared, defined criterion-referenced measures as those that "depend on an absolute standard of quality" (p. 519). However, Glaser went on to say that "the standard against which a student’s performance is compared when measured in this manner is the behavior which defines each point along the achievement continuum" (p. 519) and that "we need to behaviorally specify minimum levels of performance..." (p. 520). These two ideas-absolute standards and behavioral test content specifications-received varying degrees of emphasis from the different individuals who attempted to develop criterion-referenced tests and to theorize about criterion-referenced testing. As a result, there are now several different answers to some of the questions that Popham and Husek (1969) raised.
The comparative validity of questionnaire data (16PF scales) and objective test data (O-A Battery) in predicting five peer-rating criteria
(1980) Goldberg, Lewis R.; Norman, Warren T.; Schwartz, Edward
Thirty tests from the 1955 edition of Cattell’s Objective- Analytic (O-A) Test Battery, plus Forms A and B of the Sixteen Personality Factor Questionnaire (16PF), were administered to 82 male undergraduates. In addition, each subject was rated by 7 to 11 close associates on each of 20 bipolar rating scales, 4 scales tapping each of 5 peer-rating factors. These peer ratings were used as criterion variables to be predicted by the 16PF scales and by the O-A Battery. The O-A Battery measures were slightly more highly related to one peer-rating factor (Culture); the 16PF scales were slightly more highly related to another (Conscientiousness); and the two sets of test variables were essentially equivalent in predicting the other three factors (two of which showed no significant relationships with either instrument). The lack of any consistent superiority of the objective test scores over the questionnaire scales, coupled with some criticisms of the objective tests on purely logical grounds, should make one cautious in accepting the claims being made for the comparative validity of the O-A Battery.
A comparison of an actuarial and a linear model for predicting organizational behavior
(1980) Frank, Blake A.
Using an actuarial and a linear model for predicting organizational behavior, employee subgroups were identified through a hierarchical and convergent clustering of assessment variable profiles in a validation sample (N = 2,899) and cross-validated by assigning a holdout sample (N = 2,899) to the original subgroups on the basis of a minimum distance qualifier. Subgroup membership in both samples was significantly associated with current employment status and job performance. A linear discriminant function analysis of employment status and a linear regression analysis of job performance also yielded significant results. A comparison of the two models in terms of predictive accuracy indicated that the two models were essentially equivalent. However, it was concluded that the actuarial model was superior to the linear model, since a descriptive and behavioral taxonomy based on stable, homogeneous employee subgroups could be developed.
A comparison of four clustering methods using MMPI Monte Carlo data
(1980) Blashfield, Roger K.; Morey, Leslie C.
Monte carlo procedures were used to generate data sets that resembled MMPI psychotic (8-6), neurotic (1-3-2), and personality disorder (4-9) patterns. Lorr’s clumping method, inverse factor analysis, average linkage, and Ward’s method were the clustering methods compared. The solutions were found to vary in terms of misclassifications and coverage. The clustering solutions also varied as a function of different optional parameters associated with each method.
A comparison of the Nedelsky and Angoff cutting score procedures using generalizability theory
(1980) Brennan, Robert L.; Lockwood, Robert E.
Nedelsky (1954) and Angoff (1971) have suggested procedures for establishing a cutting score based on raters’ judgments about the likely performance of minimally competent examinees on each item in a test. In this paper generalizability theory is used to characterize and quantify expected variance in cutting scores resulting from each procedure. Experimental test data are used to illustrate this approach and to compare the two procedures. Consideration is also given to the impact of rater disagreement on some issues of measurement reliability or dependability. Results suggest that the differences between the Nedel sky and Angoff procedures may be of greater consequence than their apparent similarities. In particular, the restricted nature of the Nedelsky (inferred) probability scale may constitute a basis for seriously questioning the applicability of this procedure in certain contexts.
Contributions to criterion-referenced testing technology: An introduction
(1980) Hambleton, Ronald K.
Glaser (1963) and Popham and Husek (1969) were the first researchers to draw attention to the need for criterion-referenced tests, which were to be tests specifically designed to provide score information in relation to sets of well-defined objectives or competencies. They felt that test score information referenced to clearly specified domains of content was needed by (1) teachers for successfully monitoring student progress and diagnosing student instructional needs in objectives-based programs and by (2) evaluators for determining program effectiveness. Norm-referenced tests were not deemed appropriate for providing the necessary test score information. Many definitions of criterion-referenced tests have been offered in the last 10 years (Gray, 1978; Nitko, 1980). In fact, Gray (1978) reported the existence of 57 different definitions. Popham’s definition, reported by Hambleton (1981) in a slightly modified form, is probably the most widely used: A criterion-referenced test is constructed to assess the performance levels of examinees in relation to a set of well-defined objectives (or competencies).
The criterion problem: What measure of success in graduate education?
(1980) Hartnett, Rodney T.; Willingham, Warren W.
A wide variety of potential indicators of graduate student performance are reviewed. Based on a scrutiny of relevant research literature and experience with recent and current research projects, the various indicators are considered in two ways. First, they are analyzed within the framework of the traditional "criterion problem," that is, with respect to their adequacy as criteria in predicting graduate school performance. In this case, emphasis is given to problems with the criteria that make it difficult to draw valid inferences about the relationship between selection measures and performance measures. Second, the various indicators are considered as an important process of the graduate program. In this case, attention is given to their adequacy as procedures for the evaluation of student performance, e.g., their clarity, fairness, and usefulness as feedback to students.
Decision models for use with criterion-referenced tests
(1980) Van der Linden, Wim J.
The problem of mastery decisions and optimizing cutoff scores on criterion-referenced tests is considered. This problem can be formalized as an (empirical) Bayes problem with decisions rules of a monotone shape. Next, the derivation of optimal cutoff scores for threshold, linear, and normal ogive loss functions is addressed, alternately using such psychometric models as the classical model, the beta-binomial, and the bivariate normal model. One important distinction made is between decisions with an internal and an external criterion. A natural solution to the problem of reliability and validity analysis of mastery decisions is to analyze with a standardization of the Bayes risk (coefficient delta). It is indicated how this analysis proceeds and how, in a number of cases, it leads to coefficients already known from classical test theory. Finally, some new lines of research are suggested along with other aspects of criterion-referenced testing that can be approached from a decision-theoretic point of view.
Dependent variable reliability and determination of sample size
(1980) Maxwell, Scott E.
Arguments have recently been put forth that standard textbook procedures for determining the sample size necessary to achieve a certain level of power in a completely randomized design are incorrect when the dependent variable is fallible. In fact, however, there are several correct procedures-one of which is the standard textbook approach-because there are several ways of defining the magnitude of group differences. The standard formula is appropriate when group differences are defined relative to the within-group standard deviation of observed scores. Advantages and disadvantages of the various approaches are discussed.
Determining the length of a criterion-referenced test
(1980) Wilcox, Rand R.
When determining how many items to include on a criterion-referenced test, practitioners must resolve various nonstatistical issues before a particular solution can be applied. A fundamental problem is deciding which of three true scores should be used. The first is based on the probability that an examinee is correct on a "typical" test item. The second is the probability of having acquired a typical skill among a domain of skills, and the third is based on latent trait models. Once a particular true score is settled upon, there are several perspectives that might be used to determine test length. The paper reviews and critiques these solutions. Some new results are described that apply when latent structure models are used to estimate an examinee’s true score.
Dimensionality of hierarchical and proximal data structures
(1980) Krus, David J.; Krus, Patricia H.
The coefficient of correlation is a fairly general measure which subsumes other, more primitive relationships. At the fundamental classification level, similarities among objects and cladistic relationships were conceptualized as generic concepts underlying formation of proximal and hierarchical structures. Examples of these structures were isolated from data obtained by replicating Thurstone’s classical study of nationality preferences and were subsequently interpreted.
Dimensionality of the California Preschool Social Competency
(1980) Flint, David L.; Hick, Thomas L.; Horan, Mary D.; Irvine, David J.; Kukuk, Susan E.
The structure and construct validity of the California Preschool Social Competency Scale as used with disadvantaged children (N = 1,723) in New York State was investigated through factor analysis. Five factors were extracted and interpreted as (1) Considerateness; (2) Extraversion; (3) Task Orientation; (4) Verbal Facility; and (5) Response to the Unfamiliar. The first three of these were found to be empirically similar to the three dimensions of the Classroom Behavior Inventory. These three factors plus the fourth, Verbal Facility, appeared to be conceptually similar to factors isolated in a number of other research-based social competency scales.
Ear differences and implied cerebral lateralization on some intellective auditory factors
(1980) Stankov, Lazar
A battery of auditory tests was given under the conditions of monaural and binaural presentation. The results indicated that both primary and second-order factors were similar to those found earlier with the same tests. The hierarchical solution also indicated that most of the differences between the conditions of presentation occurred at the lowest order of factoring. Differences between the means showed the same trends as those reported in the literature on hemispheric specialization. Obtained first-order factors were interpreted as Tonal Memory, Speech Perception Under Distraction/Distortion, and Maintaining and Judging Rhythm, all representing a measure of General Auditory Function. In addition, a broad first-order factor of Fluid Intelligence was identified along with Temporal Tracking, representing an interesting new component. Although General Auditory Function is a broad perceptual factor akin to General Visualization, it differs from the latter in an important way. It is suggested that competition between the auditory messages may be typical of General Auditory Function but that the hemispheric localization is not.
The effect of misinformation, partial information, and guessing on expected multiple-choice test item scores
(1980) Frary, Robert B.
Six response/scoring methods for multiple-choice tests are analyzed with respect to expected item scores under various levels of information and misinformation. It is shown that misinformation always and necessarily results in expected item scores lower than those associated with complete ignorance. Moreover, it is shown that some response/ scoring methods penalize all conditions of misinformation equally, and others have varying penalties according to the number of wrong choices the misinformed examinee has categorized with the correct choice. One method exacts the greatest penalty when a specific wrong choice is believed correct ; two other methods provide the maximum penalty when the examinee is confident only that the correct choice is incorrect. Partial information is shown to yield substantially different expected item scores from one method to another. Guessing is analyzed under the assumption that examinees guess whenever it is advantageous to do so under the scoring method used and that these conditions would be made clear to the examinee. Additional guessing is shown to have no effect on expected item scores in some cases, though in others it is shown to lower the expected item score. These outcomes are discussed with respect to validity and reliability of resulting total scores and also with respect to test content and examinee characteristics.
A framework for methodological advances in criterion-referenced testing
(1980) Berk, Ronald A.
A vast body of methodological research on criterion-referenced testing has been amassed over the past decade. Much of that research is synthesized in the articles contained in this issue. The fact that this issue is devoted exclusively to criterion-referenced testing sets it apart as a quintessential journal publication on the topic. This paper is intended to provide a broad framework for understanding and evaluating the individual contributions in the context of the literature. The six articles appear to fall into four major categories: (1) test length; (2) validity; (3) standard setting; and (4) reliability. These categories correspond to most of the technical topics in the test development process (see, e.g., Berk, 1980b; Hambleton, 1980).