Applied Psychological Measurement, Volume 04, 1980

Persistent link for this collection

https://conservancy.umn.edu/handle/11299/97275

Search within Applied Psychological Measurement, Volume 04, 1980

Browse

Now showing 1 - 20 of 42

Comments on criterion-referenced testing
(1980) Livingston, Samuel A.
The six papers in this issue summarize 10 years of theory development, empirical research, and practical experience in criterion-referenced testing. Much of the theory development has focused on questions and issues raised by Popham and Husek (1969), who pointed out that much of traditional psychometric theory did not work well when applied to criterion-referenced tests. The six papers, taken together, represent an attempt to answer four basic questions: 1. How should the reliability of a criterion-referenced test be measured? 2. How should it be decided how many items are needed in a criterion-referenced test? 3. How should criterion-referenced tests be used to make decisions about the people taking the tests? 4. What kind of evidence should be provided for the validity of a criterion-referenced test? Attempts to answer these questions have been complicated by the lack of a universally accepted, unambiguous definition of the term "criterion-referenced test." Glaser’s (1963) article, in which the term first appeared, defined criterion-referenced measures as those that "depend on an absolute standard of quality" (p. 519). However, Glaser went on to say that "the standard against which a student’s performance is compared when measured in this manner is the behavior which defines each point along the achievement continuum" (p. 519) and that "we need to behaviorally specify minimum levels of performance..." (p. 520). These two ideas-absolute standards and behavioral test content specifications-received varying degrees of emphasis from the different individuals who attempted to develop criterion-referenced tests and to theorize about criterion-referenced testing. As a result, there are now several different answers to some of the questions that Popham and Husek (1969) raised.
A framework for methodological advances in criterion-referenced testing
(1980) Berk, Ronald A.
A vast body of methodological research on criterion-referenced testing has been amassed over the past decade. Much of that research is synthesized in the articles contained in this issue. The fact that this issue is devoted exclusively to criterion-referenced testing sets it apart as a quintessential journal publication on the topic. This paper is intended to provide a broad framework for understanding and evaluating the individual contributions in the context of the literature. The six articles appear to fall into four major categories: (1) test length; (2) validity; (3) standard setting; and (4) reliability. These categories correspond to most of the technical topics in the test development process (see, e.g., Berk, 1980b; Hambleton, 1980).
Reliability of test scores and decisions
(1980) Traub, Ross E.; Rowley, Glenn L.
A criterion-referenced test can be viewed as testing either a continuous or a binary variable, and the scores on a test can be used as measurements of the variable or to make decisions (e.g., pass or fail). Recent work on the reliability of criterion-referenced tests has focused on the use of scores from tests of continuous variables for decision-making purposes. This work can be categorized according to type of loss function-threshold, linear, or quadratic. It is the loss function that is used either explicitly or implicitly to evaluate the goodness of the decisions that are made on the basis of the test scores. The literature in which a threshold loss function is employed can be further subdivided according to whether the goodness of decisions is assessed as the probability of making an erroneous decision or as a measure of the consistency of decisions over repeated testing occasions. This review points to the need for simple procedures by which to estimate the probability of decision errors.
Issues of validity for criterion-referenced measures
(1980) Linn, Robert L.
It has sometimes been assumed that validity of criterion-referenced tests is guaranteed by the definition of the domain and the process used to generate items. These are important considerations for content validity. It is argued that the proper focus for content validity is on the items of a test rather than on examinee responses to those items. Content validity is important for criterion-referenced measures, but it is not sufficient. This claim is discussed and the case is made that interpretations and uses of criterion-referenced tests require support of other kinds of evidence and logical analysis. The inferences that are made should dictate the kinds of evidence and logical arguments that are needed to support claims of validity. Illustrations of aspects of the validation process are provided in two concrete examples.
The nature and use of state mastery models
(1980) Macready, George B.; Dayton, C. Mitchell
This paper provides a review of a class of probabilistic models that has been developed for use in the assessment of trait or competency acquisition. Consideration is given to the relative merits and limitations of this class of state models, under which trait acquisition is conceived as being "all-ornone," as compared with those occurring under an alternative conceptual framework, in which trait acquisition is assumed to be gradual. In addition, some of the applications of these state models are presented, including the establishment of mastery classification decisions and the assessment of consistency with respect to items and classification. Finally, some extensions to the class of state models, which may be helpful in increasing the applicability of this class of models, are presented.
Decision models for use with criterion-referenced tests
(1980) Van der Linden, Wim J.
The problem of mastery decisions and optimizing cutoff scores on criterion-referenced tests is considered. This problem can be formalized as an (empirical) Bayes problem with decisions rules of a monotone shape. Next, the derivation of optimal cutoff scores for threshold, linear, and normal ogive loss functions is addressed, alternately using such psychometric models as the classical model, the beta-binomial, and the bivariate normal model. One important distinction made is between decisions with an internal and an external criterion. A natural solution to the problem of reliability and validity analysis of mastery decisions is to analyze with a standardization of the Bayes risk (coefficient delta). It is indicated how this analysis proceeds and how, in a number of cases, it leads to coefficients already known from classical test theory. Finally, some new lines of research are suggested along with other aspects of criterion-referenced testing that can be approached from a decision-theoretic point of view.
Standard setting issues and methods
(1980) Shepard, Lorrie
Previous methodological reviews and the controversy regarding the adequacy of standard-setting technology are summarized. The judgmental nature of all standard-setting methods is examined, and the debate about whether fallible standards are better than none is recast in the context of three different test uses: pupil diagnosis, pupil certification (for high school graduation or professional licensure), and program evaluation. Exemplary standard-setting methods are reviewed, representing the following major approaches: (1) judgments of test content ; (2) judgments about mastery-nonmastery groups; (3) norms and passing rates; (4) empirical methods for discovering standards; and (5) empirical methods for adjusting cutoff scores, given a standard on an external criterion measure. Standards based on the performance of judged mastery groups (the Contrasting Groups method) and certain uses of normative data are likened to Known Groups validation. Recommendations are made for selecting standard-setting techniques depending on test use, including pupil diagnosis, pupil certification, and program evaluation. Future research on standard setting is discussed in the context of improving practical aspects of judgmental methods.
Determining the length of a criterion-referenced test
(1980) Wilcox, Rand R.
When determining how many items to include on a criterion-referenced test, practitioners must resolve various nonstatistical issues before a particular solution can be applied. A fundamental problem is deciding which of three true scores should be used. The first is based on the probability that an examinee is correct on a "typical" test item. The second is the probability of having acquired a typical skill among a domain of skills, and the third is based on latent trait models. Once a particular true score is settled upon, there are several perspectives that might be used to determine test length. The paper reviews and critiques these solutions. Some new results are described that apply when latent structure models are used to estimate an examinee’s true score.
Contributions to criterion-referenced testing technology: An introduction
(1980) Hambleton, Ronald K.
Glaser (1963) and Popham and Husek (1969) were the first researchers to draw attention to the need for criterion-referenced tests, which were to be tests specifically designed to provide score information in relation to sets of well-defined objectives or competencies. They felt that test score information referenced to clearly specified domains of content was needed by (1) teachers for successfully monitoring student progress and diagnosing student instructional needs in objectives-based programs and by (2) evaluators for determining program effectiveness. Norm-referenced tests were not deemed appropriate for providing the necessary test score information. Many definitions of criterion-referenced tests have been offered in the last 10 years (Gray, 1978; Nitko, 1980). In fact, Gray (1978) reported the existence of 57 different definitions. Popham’s definition, reported by Hambleton (1981) in a slightly modified form, is probably the most widely used: A criterion-referenced test is constructed to assess the performance levels of examinees in relation to a set of well-defined objectives (or competencies).
Large sample estimators for standard errors of functions of correlation coefficients
(1980) Bobko, Philip; Rieck, Angela
Standard errors of estimators that are functions of correlation coefficients are shown to be quite different in magnitude than standard errors of the initial correlations. A general large-sample methodology, based upon Taylor series expansions and asymptotic correlational results, is developed for the computation of such standard errors. Three exemplary analyses are conducted on a correction for attenuation, a correction for range restriction, and an indirect effect in path analysis. Derived formulae are consistent with several previously proposed estimators and provide excellent approximations to the standard errors obtained in computer simulations, even for moderate sample size (n = 100). It is shown that functions of correlations can be considerably more variable than product-moment correlations. Additionally, appropriate hypothesis tests are derived for these corrected coefficients and the indirect effect. It is shown that in the range restriction situation, the appropriate hypothesis test based on the corrected coefficient is asymptotically more powerful than the test utilizing the uncorrected coefficient. Bias is also discussed as a by-product of the methodology.
Limitations of additive conjoint scaling procedures: Detecting nonadditivity when additivity is known to be violated
(1980) Nygren, Thomas E.
Two sets of three-outcome gambles were constructed to vary factorially along the factors Amount to Lose, Amount to Win, Probability of Losing, and Probability of Winning. Single stimulus ratings of attractiveness and risk were obtained for each of the constructed gambles from 19 subjects. In addition, paired comparison strength of preference and difference in risk judgments were obtained for a subset of these gambles. Two additive conjoint scaling procedures, Carroll’s (1972) MDPREF and Johnson’s (1975) NMRG, were used to generate predicted paired comparison preference and risk judgments from the single stimulus ratings for each subject. These predictions were then compared with the observed paired comparison judgments. Results indicated that although the goodness-of-fit measures associated with each of the scaling models indicated that the subject’s data were being fit very well by the additive models, additivity among the payoff and probability factors was clearly violated. A procedure for detecting nonadditivity is outlined and illustrated with the data. The limitations of using these additive conjoint scaling procedures as predictive techniques when additivity is violated are shown and their implications are discussed.
Measures for the study of maternal teaching strategies
(1980) Laosa, Luis M.
A technique to measure maternal teaching strategies was developed for possible use in research and evaluation studies. Scores derived from the technique describe both quality and quantity of behaviors used by mothers to teach cognitive-perceptual tasks to their own young children. The maternal teaching observation technique (MTOT) yields scores on the following teaching strategy dimensions : inquiry, directive, praise, negative verbal feedback or disapproval, modeling, visual cue, physical affection, positive physical control, and negative physical control. English and Spanish versions of the technique were developed. The technique was administered to 83 different mother-child dyads of two sociocultural and language groups, Anglo-American and Chicano. The tasks and procedures were sufficiently engaging and appealing, in terms of difficulty level and ability to elicit and to maintain the subjects’ attention, for mothers and their 5-year-old children in both groups. Interobserver reliabilities and parallel-form consistency were adequate for both groups, indicating that each MTOT scale measures a moderately stable attribute of maternal behavior. Group differences in intercorrelations suggest that construct invariance might not exist across sociocultural or language groups.
A paper-and-pencil inventory for the assessment of Piaget's tasks
(1980) Patterson, Henry O.; Milakofsky, Louis
Although science educators conversant with Piaget’s work have recognized the importance of adapting instruction and curricula to the cognitive level of their students, such attempts have been difficult because of a lack of appropriate cognitive assessment instruments. To meet such a need, a comprehensive, objective paper-and-pencil inventory was investigated using 542 subjects, 8 years through adulthood, in order to determine its usefulness for normal and retarded students. The results showed that the inventory was acceptably reliable and valid and had advantages over other Piaget tests. With some suggested improvements, it was concluded that the instrument had potential as an educational and theoretical research tool.
A test of graphicacy in children
(1980) Wainer, Howard
A test of graphicacy was developed, administered to third- through fifth- grade schoolchildren, and scored using the Rasch model with Gustafsson’s conditional maximum likelihood estimation method. After removing children with scores at or below chance, the model fit well. It was found that of the four types of displays used (tables, line charts, bar charts, pie charts), the line chart was inferior to the others, which were all equal. There was some interaction between the kind of question asked and the display technique. Third-grade children were much poorer at reading graphs than fourth- or fifth- grade children, but the differences between these latter two groups were modest.
Item analysis with small samples
(1980) Nevo, Baruch
Traditional item analysis centers on the characteristics of individual items, typically on the item’s level of difficulty and discrimination power. In constructing new tests, attempts are therefore made to obtain large samples of subjects in order to decrease the standard error of measurement of the item’s characteristics. However, there are common test situations in which the exact parameters of individual items are not of much importance. Rather, the focus of interest is on the position of the items in relation to one another or in relation to some critical statistical value. Five such test situations are described. Quasi-simulations of item analyses were performed to determine the optimal sample sizes required in such test situations. These simulations consisted of analyzing responses of 5,200 university applicants, each of whom completed three different multiple-choice tests. Sample sizes of 16, 32, 64, 128, 256, 512, and 1,024 were chosen; and for each size, eight samples were randomly drawn from the population of applicants. For three of five different indices of accuracy that were employed, the results showed that the sample size needed for the pretest stage in test construction is considerably smaller than the traditionally recommended size.
Dimensionality of hierarchical and proximal data structures
(1980) Krus, David J.; Krus, Patricia H.
The coefficient of correlation is a fairly general measure which subsumes other, more primitive relationships. At the fundamental classification level, similarities among objects and cladistic relationships were conceptualized as generic concepts underlying formation of proximal and hierarchical structures. Examples of these structures were isolated from data obtained by replicating Thurstone’s classical study of nationality preferences and were subsequently interpreted.
Is a behavioral measure the best estimate of behavioral parameters? Perhaps not.
(1980) Howard, George S.; Maxwell, Scott E.; Wiener, Richard L.; Boynton, Kathy S.; Rooney, William M.
In many areas of psychological research various measurement procedures are employed in order to obtain estimates of some set of parameter values. A common practice is to validate one measurement device by demonstrating its relationship to some criterion. However, in many cases the measurement of that criterion is less than a perfect estimate of true parameters. Self-report measures are often validated by comparing them with behavioral measures of the dimension of interest. This procedure is only justifiable insofar as the behavioral measure represents an accurate estimate of population parameters. Three studies, dealing with the assessment of assertiveness, students’ in-class verbal and nonverbal behaviors, and a number of teacher-student in-class interactions, tested the adequacy of behavioral versus self-report measures as accurate estimates of behavioral parameters. In Studies 2 and 3 self-reports were found to be as good as behavioral measures as estimates of behavioral parameters, while Study 1 found self-reports to be significantly superior.
The criterion problem: What measure of success in graduate education?
(1980) Hartnett, Rodney T.; Willingham, Warren W.
A wide variety of potential indicators of graduate student performance are reviewed. Based on a scrutiny of relevant research literature and experience with recent and current research projects, the various indicators are considered in two ways. First, they are analyzed within the framework of the traditional "criterion problem," that is, with respect to their adequacy as criteria in predicting graduate school performance. In this case, emphasis is given to problems with the criteria that make it difficult to draw valid inferences about the relationship between selection measures and performance measures. Second, the various indicators are considered as an important process of the graduate program. In this case, attention is given to their adequacy as procedures for the evaluation of student performance, e.g., their clarity, fairness, and usefulness as feedback to students.
Budescu, David V. (1980). Some new measures of profile dissimilarity. Applied Psychological Measurement, 4, 261-272. doi:10.1177/014662168000400212
(1980) Budescu, David V.
Four new measures of multidimensional profile dissimilarity are proposed that are (1) either symmetric or asymmetric and (2) either conditional or unconditional on profile shape. The four similarity indices are based on alternative normalizations of the regular distance (D) statistic of Cronbach and Gleser (1953), all taking values between 0 and 1. Methods of calculation and interpretations of the indices are demonstrated and discussed, and several generalizations are suggested.
Dependent variable reliability and determination of sample size
(1980) Maxwell, Scott E.
Arguments have recently been put forth that standard textbook procedures for determining the sample size necessary to achieve a certain level of power in a completely randomized design are incorrect when the dependent variable is fallible. In fact, however, there are several correct procedures-one of which is the standard textbook approach-because there are several ways of defining the magnitude of group differences. The standard formula is appropriate when group differences are defined relative to the within-group standard deviation of observed scores. Advantages and disadvantages of the various approaches are discussed.