Applied Psychological Measurement, Volume 03, 1979

Persistent link for this collection

https://conservancy.umn.edu/handle/11299/97274

Search within Applied Psychological Measurement, Volume 03, 1979

Browse

Now showing 1 - 20 of 44

Systematic Errors in Approximations to the Standard Error of Measurement and Reliability
(1979) Kleinke, David J.
Lord’s approximation to the standard error of measurement of a test uses only n, the number of items. Millman’s is based on n and p̄, the mean difficulty. Saupe has used Lord’s approximation to derive an approximation to the reliability. Through an empirical demonstration involving 200 classroom tests, all three approximations are shown to be biased. The Lord and Millman approximations overestimate s[subscript x]√(1-KR20), and thus Saupe’s underestimates r[subscript x, subscript x prime] for these tests. The unweighted mean of the tests’ mean item difficulties was .68, supporting Lord’s original warning that his approximation be used cautiously with tests that are either very difficult or very easy. Still, the approximations did correlate very highly with their criteria, supporting their continued limited use.
On the robustness of a class of naive estimators
(1979) Wainer, Howard; Thissen, David
A class of naive estimators of correlation was tested for robustness, accuracy, and efficiency against Pearson’s r. Tukey’s r., and Spearman’s ro. It was found that this class of estimators seems in some respects to be superior being less affected by outliers, reasonably efficient, and frequently more easily calculated. The definition and details of the use of these naive estimators are the subject of this paper.
The reliability of dichotomous judgments: Unequal numbers of judges per subject
(1979) Fleiss, Joseph L.; Cuzick, Jack
Consider a reliability study in which different subjects are judged on a dichotomous trait by different sets of judges, possibly unequal in number. A kappa-like measure of reliability is proposed, its correspondence to an intraclass correlation coefficient is pointed out, and a test for its statistical significance is presented. A numerical example is given.
Ordering power of separate versus grouped true-false tests: Interaction of type of test with knowledge levels of examinees
(1979) Hsu, Louis M.
The ordering power of an objective test was defined in terms of the probability that this test led to the correct ranking of examinees. A comparison of the relative ordering power of separate and grouped-items true-false (T-F) tests indicated that neither type of test was uniformly superior to the other across all levels of knowledge of examinees. Instead, separate-items T-F tests were found to be superior in discriminating among examinees with medium and high levels of knowledge, and grouped-items T-F tests with two and three items per cluster were found to be superior for discriminating among examinees with low levels of knowledge. These findings do not support blanket recommendations such as Ebel’s (1978) that "test constructors should avoid constructing items in multiple-choice form which are essentially collections of T-F statements" (p. 43) or that, in general, "it is better to present such statements as independent T-F items" (p. 43). Rather, they are similar to Lord’s (1977) findings concerning the relative efficiency of multiple-choice tests with different numbers of options per question for examinees of differing ability levels.
Dominance, information, and hierarchical scaling of variance space
(1979) Krus, David J.; Ceurvorst, Robert W.
A method for computation of dominance relations and for construction of their corresponding hierarchical structures is presented. It is shown that variance can be computed from the squared pairwise differences between scores and that dominance indices are actually linear transformations of variances. The interpretation of variance as a quantitative measure of information is suggested and conceptual partition of variance into components associated with relational spaces is proposed. The link between dominance and variance allows integration of the mathematical theory of information with least squares statistical procedures without recourse to logarithmic transformations of the data.
Evaluation of implied orders as a basis for tailored testing with simulation data
(1979) Cliff, Norman; Cudeck, Robert; McCormick, Douglas J.
Monte carlo research with TAILOR, a program using Implied Orders as a basis for tailored testing, is reported. Birnbaum’s (1968) three-parameter logistic model was used to generate data matrices under a variety of simulated conditions. It was found that TAILOR typically required about half the available items to estimate for each simulated examinee the responses on the remainder. Validity of Tailored score with True score was found to be within a few points of True score with Complete test score. Increasing item discrimination affected the efficiency of the tailored test, but the procedure was little affected by any of a variety of other factors.
The feasibility of informed pretests in attenuating response-shift bias
(1979) Howard, George S.; Dailey, Patrick R.; Gulanick, Nancy A.
Response-shift bias has been shown to contaminate self-reported pretest/posttest evaluations of various interventions. To eliminate the detrimental effects of response shifts, retrospective measures have been employed as substitutes for the traditional self-reported pretest. Informed pretests, wherein subjects are provided information about the construct being measured prior to completing the pretest self-report, are considered in the present studies as an alternative method to retrospective pretests in reducing response-shift effects. In Study 1 subjects were given a 20-minute presentation on assertiveness, which failed to significantly improve the accuracy of self-reported assertiveness. Other procedural influences hypothesized to improve self-report accuracy-previous experience with the objective measure of assertiveness and previous completion of the self-report measure-also were not related to increased self-report accuracy. In a second study, information about interviewing skills was provided at pretest using behaviorally anchored rating scales to participants in a workshop on interviewing skills. Response-shift bias was not attenuated by providing subjects with information about interviewing prior to the intervention. Change measures which employed retrospective pretest measures demonstrated somewhat higher (although nonsignificant) validity coefficients than measures of change utilizing informed pretest data.
Bipolar scales with pictorial anchors: Some characteristics and a method for their use
(1979) Beard, Arthur D.
Visual aesthetic preferences seem to be based upon a judgmental mechanism that processes nonverbal cues. Yet the usual methods for measuring these cues are verbal. A nonverbal method for identifying and measuring the component cues present in slides of nonrepresentational paintings is described. The method was used to develop pictorially anchored scales that were easy for subjects to use and that elicited reliable cue ratings.
Validity and cross-validity of metric and nonmetric multiple regression
(1979) MacCallum, Robert C.; Cornelius, Edwin T., III; Champney, Timothy
Several questions are raised concerning differences between traditional metric multiple regression, which assumes all variables to be measured on interval scales, and nonmetric multiple regression, which treats variables measured on any scale. Both models are applied to 30 derivation and cross-validation samples drawn from two sets of empirical data composed of ordinally scaled variables. Results indicate that the nonmetric model is, on the average, far superior in fitting derivation samples but that it exhibits much more shrinkage than the metric model. The metric technique fits better than the nonmetric in cross-validation samples. In addition, results produced by the nonmetric model are more unstable across repeated samples. A probable cause of these results is presented, and the need for further research is discussed. A common problem in data analysis involves
Item-option weighting of achievement tests: Comparative study of methods
(1979) Downey, Ronald G.
Previous research has studied the effects of different methods of item-option weighting on the reliability and the concurrent and predictive validity of achievement tests. Generally, increases in reliability are found, but with mixed results for validity. This research attempted to interrelate several methods of producing option weights (i.e., Guttman internal and external weights and judges’ weights) and examined their effects on reliability and on concurrent, predictive, and face validity. Option weights to maximize reliability produced cross-validated (N = 974) increases in Hoyt reliability over rights-only scoring (.82 versus .58, respectively) ; decreases in correlations with other achievement tests; few changes in predictive validity ; and a loss in face validity (i.e., some correct options had lower weights than incorrect options). Weights to maximize validity did not cross-validate and led to a reduction in reliability and to mixed validity results. Judges’ weights produced increases in reliability and mixed results with validity. The size of Guttman weights were shown to interact with item-option and test characteristics. It was concluded that option weighting offered limited, if any, improvement over unit weighting.
The Rasch model, objective measurement, equating, and robustness
(1979) Slinde, Jeffrey A.; Linn, Robert L.
This study investigated the adequacy of the Rasch model in providing objective measurement when equating existing standardized reading achievement tests with groups of examinees not widely separated in ability. To provide the context for the assessment of objectivity with the Rasch model, information relevant to several assumptions of the model was provided. An anchor test procedure was used to equate the various pairs of existing achievement tests. Despite the considerable lack of fit of the data to the model found for all tests used, the Rasch difficulty estimates were reasonably invariant for replications with random samples as well as samples that differed in ability by one grade level. Furthermore, with the exception of the data for one test pair and one grade level, the Rasch model using the anchor test procedure provided a reasonably satisfactory means of equating three test pairs on the log ability scale for the examinees at two grade levels.
What can the WISC-R measure?
(1979) Conger, Anthony J.; Conger, Judith Cohen; Farrell, Albert D.; Ward, David
The WISC-R was investigated by using measures of profile (multivariate) reliability in order to determine its most reliable dimensions and the precision and similarity of the multivariate structure across age groups. Due to differences among the 11 age groups in both subscale reliabilities and true score covariance matrices, it was concluded that the precision of measurement differed across age groups. This finding was further supported by a comparison of canonical reliability coefficients and composites computed for each age group. However, exhaustive analyses of Varimax rotated profile dimensions indicated that the structure of the WISC-R subscales is rather stable across age groups, but the reliability of that structure differs systematically. A synthesis of the analyses indicated that (1) the WISC-R allows highly reliable comparisons of profile levels (Full-Scale IQ) at each age level; that (2) reasonably reliable comparisons of Verbal-Performance differences can be made at each age level; but that (3) for other comparisons, caution should be exercised because of age group differences and potentially high unreliability. Two strategies for the interpretation of WISC-R profiles, which take into account the above findings, are offered.
The reliability of Oltman's rod-and-frame test with grade-school children
(1979) De Lisi, Richard; Smith, Jeffrey K.
The Portable Rod-and-Frame Test (PRFT) was developed by Oltman (1968) to measure field dependence-independence in a lighted room. Oltman suggested that the lighted room would be more appropriate for use with children than measures that require a darkened room. Reliability data on the use of the PRFT with school children are scant. Dreyer, Dreyer, and Neblekopf (1971) reported a test-retest reliability (one-month interval between testings) on 46 kindergarten children as .96. It appears that there are no other reliability data on the Oltman PRFT when used with school children (Cox & Witkin, 1978). The purpose of this study was to assess reliability of the PRFT with school-age children and to examine grade and sex differences.
Binomial test models and item difficulty
(1979) Van der Linden, Wim J.
In choosing a binomial test model, it is important to know exactly what conditions are imposed on item difficulty. In this paper these conditions are examined for both a deterministic and a stochastic conception of item responses. It appears that they are more restrictive than is generally understood and differ for both conceptions. When the binomial model is applied to a fixed examinee, the deterministic conception imposes no conditions on item difficulty but requires instead that all items have characteristic functions of the Guttman type. In contrast, the stochastic conception allows non- Guttman items but requires that all characteristic functions must intersect at the same point, which implies equal classically defined difficulty. The beta-binomial model assumes identical characteristic functions for both conceptions, and this also implies equal difficulty. Finally, the compound binomial model entails no restrictions on item difficulty.
Estimators of the squared cross-validity coefficient: A Monte Carlo investigation
(1979) Drasgow, Fritz; Dorans, Neil J.; Tucker, Ledyard R.
A monte carlo experiment was used to evaluate four procedures for estimating the population squared cross-validity of a sample least squares regression equation. Four levels of population squared multiple correlation (Rp2) and three levels of number of predictors (n) were factorially crossed to produce 12 population covariance matrices. Random samples at four levels of sample size (N) were drawn from each population. The levels of N, n, and RP2 were carefully selected to ensure relevance of simulation results for much applied research. The least squares regression equation from each sample was applied in its respective population to obtain the actual population squared cross-validity (Rcv2). Estimates of Rcv2 were computed using three formula estimators and the double cross-validation procedure. The results of the experiment demonstrate that two estimators which have previously been advocated in the literature were negatively biased and exhibited poor accuracy. The negative bias for these two estimators increased as Rp2 decreased and as the ratio of N to n decreased. As a consequence, their biases were most evident in small samples where cross-validation is imperative. In contrast, the third estimator was quite accurate and virtually unbiased within the scope of this simulation. This third estimator is recommended for applied settings which are adequately approximated by the correlation model.
Estimating item characteristic curves
(1979) Ree, Malcolm James
A simulation study of the effectiveness of four item characteristic curve estimation programs was conducted. Using the three-parameter logistic model, three groups of 2,000 simulated subjects were administered 80-item tests. These simulated test responses were then calibrated using the four programs. The estimated item parameters were compared to the known item parameters in four analyses for each program in all three data sets. It was concluded that the selection of an item calibration procedure should be dependent on the distribution of ability in the calibration sample, the later uses of the item parameters, and the computer resources available.
Identifying the situationally variable subject: Correspondence among different self-report formats
(1979) Turner, Robert G.; Gilliam, Bob J.
The present study compared the results obtained from three different procedures for obtaining self-reports of behavioral consistency versus inconsistency : (1) the traditional bipolar rating scale; (2) the Nisbett, Caputo, Legant, and Maracek (1973) procedure whereby subjects check the term, its antonym or the phrase "it depends on the situation"; and (3) subject-generated lists of self-descriptive traits. Results showed a moderate association between self-reports of situational variability and central responses on the scaled format; however, omission of a term in self-generated lists was not strongly associated with either central responses on the scale format or situational responses on the inventory formulated according to Nisbett et al.
The causal influence of anxiety on academic achievement for students of differing intellectual ability
(1979) Heinrich, Darlene L.
The present study examined the relationship between anxiety and learning within the context of drive theory and trait-state anxiety theory. It was hypothesized that trait anxiety (A-trait) would influence state anxiety (A-state), which in turn would influence academic achievement. The subjects were 86 students enrolled in a graduate education course for whom measures of A-state, A-trait, and achievement were obtained concurrently at three times during the course. GRE scores were used as measures of intellectual ability. Data were analyzed using the frequency-of-change-in-product-moment technique (Yee & Gage, 1968), a causal analysis statistic which permits the determination of source and direction of causal influence in lagged correlational data. Results showed that A-trait influenced A-state and achievement, but the relationship between A-state and achievement was ambiguous. When intellectual ability was considered, A-trait was found to influence A-state and achievement, but only for high-ability students.
The hierarchical structure of formal operational tasks
(1979) Bart, William M.; Mertens, Donna M.
The hierarchical structure of the formal operational period of Piaget’s theory of cognitive development was explored through the application of ordering theoretic methods to a set of data that systematically tapped the various formal operational schemes. The results suggest that the tasks within some schemes are empirically equivalent. While the response patterns were quite varied, the results do suggest that some common structure may underlie performance on the tasks, thus supporting Piaget’s notion of the integrative structure of the period.
Dimensions and clusters: A hybrid approach to classification
(1979) Skinner, Harvey A.
A hybrid strategy is described for integrating the dimensional and discrete clusters approaches to classification research. First, a parsimonious set of dimensions is sought through a multiple replications design. The computations employ a two-stage least squares solution that is based on a sequential application of the Eckart and Young (1936) decomposition. Second, relatively homogeneous subgroups are identified within this low dimensional space using a clustering or density search algorithm. To facilitate interpretation of the final solution, an ideal type concept is introduced that is similar to the "idealized individual" interpretation of multidimensional scaling. Depending upon the model chosen, the independent contribution of elevation, scatter, and shape parameters may be differentiated in defining profile similarity.