Applied Psychological Measurement, Volume 08, 1984

Persistent link for this collectionhttps://hdl.handle.net/11299/100664

Search within Applied Psychological Measurement, Volume 08, 1984

Browse

Now showing 1 - 20 of 40

Bias and information of Bayesian adaptive testing
(1984) Weiss, David J.; McBride, James R.
Monte carlo simulation was used to investigate score bias and information characteristics of Owen’s Bayesian adaptive testing strategy and to examine possible causes of score bias. Factors investigated in three related studies included effects of an accurate prior θ estimate, effects of item discrimination, and effects of fixed versus variable test length. Data were generated from a three-parameter logistic model for 3,100 simulees in each of eight data sets, and Bayesian adaptive tests were administered, drawing items from a "perfect" item pool. Results showed that the Bayesian adaptive test yielded unbiased θ estimates and relatively flat information functions only in the situation in which an accurate prior θ estimate was used. When a constant prior θ estimate was used with a fixed test length, severe bias was observed that varied with item discrimination. A different pattern of bias was observed with variable test length and a constant prior. Information curves for the constant prior conditions generally became more peaked and asymmetric with increasing item discrimination. In the variable test length condition, the test length required to achieve a specified level of the posterior variance of θ estimates was an increasing function of θ level. These results indicate that θ estimates from Owen’s Bayesian adaptive testing method are affected by the prior θ estimate used and that the method does not provide measurements that are unbiased and equiprecise except when an accurate prior θ estimate is used.
Multivariate generalizability theory in educational measurement: An empirical study
(1984) Nuβbaum, Albert
Multivariate generalizability theory was applied to the assessment of student achievement in art education. Twenty-five art students rated the paintings of 60 fourth-grade students with regard to three criteria. Paintings were made on four different topics. The results indicate that generalizability is low with respect to different raters and moderate with respect to different topics. The three ratings a rater gave on a single painting were moderately correlated. As indicated by the results for the covariance components, nearly half of the covariance between the three criteria was because the three ratings were from the same rater. Expected values for Q²(∆) are reported for different D study designs.
Effects of local item dependence on the fit and equating performance of the three-parameter logistic model
(1984) Yen, Wendy M.
Unidimensional item response theory (IRT) has become widely used in the analysis and equating of educational achievement tests. If an IRT model is true, item responses must be locally independent when the trait is held constant. This paper presents several measures of local dependence that are used in conjunction with the three-parameter logistic model in the analysis of unidimensional and two-dimensional simulated data and in the analysis of three mathematics achievement tests at Grades 3 and 6. The measures of local dependence (called Q₂ and Q₃) were useful for identifying subsets of items that were influenced by the same factors (simulated data) or that had similar content (real data). Item pairs with high Q₂ or Q₃ values tended to have similar item parameters, but most items with similar item parameters did not have high Q₂ or Q₃ values. Sets of locally dependent items tended to be difficult and discriminating if the items involved an accumulation of the skills involved in the easier items in the rest of the test. Locally dependent items that were independent of the other items in the test did not have unusually high or low difficulties or discriminations. Substantial unsystematic errors of equating were found from the equating of tests involving collections of different dimensions, but substantial systematic errors of equating were only found when the two tests measured quite different dimensions that were presumably taught sequentially.
Comparison of IRT true-score and equipercentile observed-score "equatings"
(1984) Lord, Frederic M.; Wingersky, Marilyn S.
Two methods of ’equating’ tests are compared, one using true scores, the other using equipercentile equating of observed scores. The theory of equating is discussed. For the data studied, the two methods yield almost indistinguishable results.
Eigenvalue shrinkage in principal components based factor analysis
(1984) Bobko, Philip; Schemmer, F. Mark
The concept of shrinkage, as (1) a statistical phenomenon of estimator bias, and (2) a reduction in explained variance resulting from cross-validation, is explored for statistics based on sample eigenvalues. Analytic solutions and previous research imply that the magnitude of eigenvalue shrinkage is a function of the type of shrinkage, sample size, the number of variables in the correlation matrix, the ordinal root position, the population eigenstructure, and the choice of principal components analysis or principal factors analysis. Hypotheses relating these specific independent variables to the magnitude of shrinkage were tested by means of a monte carlo simulation. In particular, the independent variable of population eigenstructure is shown to have an important effect on shrinkage. Finally, regression equations are derived that describe the linear relation of population and cross-validated eigenvalues to the original eigenvalues, sample size, ordinal position, and the number of variables factored. These equations are a valuable tool that allows researchers to accurately predict eigenvalue shrinkage based on available sample information.
Correcting for range restriction when the population variance is unknown
(1984) Alexander, Ralph A.; Alliger, George M.; Hanges, Paul J.
Correction of correlations diminished by range restriction is a commonly suggested psychometric technique. Such corrections may be appropriate in applied settings, such as educational or personnel selection, or in more theoretical applications, such as meta-analysis. However, an important limitation on the practice of range restriction corrections exists-an estimate of the unrestricted population variance is required. This article outlines and examines the accuracy of a method for estimating the unrestricted variance of a variable from the restricted sample itself. This method is based on the observation that it is possible to table a function of the truncated normal distribution that will allow the extent or point of truncation to be estimated (Cohen, 1959). The correlation of the truncated variable with other variables may then be corrected by standard restriction of range formulas. The method also allows for correction of the mean of the restricted variable.
Comparison of direct and indirect methods for setting minimum passing scores
(1984) Reilly, Richard R.; Zink, Donald L.; Israelski, Edmond W.
Several studies have compared different judgmental methods of setting passing scores by estimating item difficulties for the minimally competent examinee. Usually, a direct method of estimating item difficulties has been compared with an indirect method suggested by Nedelsky (1954). Nedelsky’s method has usually resulted in a substantially lower cutoff score than that arrived at with a direct method. Two studies were carried out for the purpose of comparing a direct method of setting passing scores with an indirect method that allowed judges to estimate the probability of the minimally competent examinee eliminating each incorrect alternative. In Study 1 a sample of 52 first-level supervisors used both methods to estimate passing scores on a content-oriented selection test for building maintenance specialists. In Study 2 a sample of 62 first-level supervisors used both methods to estimate passing scores on an entry level auto mechanics test. Results of both studies showed that the variance component for method was relatively small and that for raters was relatively large. Reliability estimates of judgments and correlations between judged difficulties and empirical difficulties showed the Angoff (1971) approach to be slightly superior. Results showed no particular advantage to using an indirect approach for estimating minimal competence. Recently, the problem of setting passing scores
Item format and the structure of the Personal Orientation Inventory
(1984) Velicer, Wayne F.; DiClemente, Carlo C.; Corriveau, Donald P.
Two versions of the Personal Orientation Inventory were administered to 317 subjects. One version employed the standard two-choice response format. The other version used a six-choice response format. The purpose of this study was (1) to determine if a multiple- response format resulted in improved psychometric properties, (2) to compare the component structure of the two versions, and (3) to compare the empirically derived scales with the theoretically defined scales. The results showed a slight improvement for the multiple- response format, but with poorly defined component patterns. The change in format resulted in a change in component structure. The components derived from both versions did not correspond to the theoretical scales. An analysis indicated that the only well-defined component from either response format could be interpreted as measuring social desirability responding rather than measuring content. A follow-up questionnaire indicated greater subject acceptance of the six-choice version.
Comparison of two methods to identify major personality factors
(1984) Comrey, Andrew L.
Both Howarth and Comrey have developed taxonomies of personality traits and inventories to measure them. The Howarth Personality Questionnaire and Additional Personality Factor inventories include 20 factors, whereas the Comrey Personality Scales (CPS) taxonomy includes eight factors. Howarth identified his factors through factor analysis of items, whereas Comrey identified his primary level factors through factor analysis of conceptually distinct clusters of homogeneous items, called Factored Homogeneous Item Dimensions (FHIDs), while avoiding the inclusion of highly redundant variables in the same analysis. Data for all three inventories were collected from the same subjects and factor analyzed. The Howarth factor scales were narrower in content and more highly overlapping than the CPS factor scales. Most of the Howarth factor scales were good marker variables for the CPS primary factors. Five CPS factors had major loadings for more than one of the Howarth factor scales. The CPS Emotional Stability vs. Neuroticism (S) primary level factor was split into several lower level factors in the Howarth system. Factor analysis of items is recommended to identify FHIDs. Factor analysis of FHIDs, in which no two FHIDs are merely alternate forms of the same conceptual variable, is recommended to identify the major primary factors of personality.
The validity of item bias techniques with math word problems
(1984) Ironson, Gail; Homan, Susan; Willis, Ruth; Signer, Barbara
Item bias research has compared methods empirically using both computer simulation with known amounts of bias and real data with unknown amounts of bias. This study extends previous research by "planting" biased items in the realistic context of math word problems. "Biased" items are those in which the reading level is too high for a group of students so that the items are unable to assess the students’ math knowledge. Of the three methods assessed (Angoff’s transformed difficulty, Camilli’s full chi-square, and Linn and Harnisch’s item response theory, IRT, approach), only the IRT approach performed well. Removing the biased items had a minor effect on the validity for the minority group.
Homogeneity analysis of test score data: A confrontation with the latent trait approach
(1984) De Gruijter, Dato N. W.
In homogeneity analysis, or dual scaling, weights for item categories are obtained that maximize Cronbach’s alpha. In this paper these weights are compared with the optimal scoring weights in the latent trait approach. This is done on the basis of data generated according to the two-parameter logistic model. As expected from a theoretical analysis, the homogeneity weights show less variation than the optimal weights of latent trait theory. It is argued that the homogeneity weights should not be used for item selection.
Relationships between the Thurstone, Coombs, and Rasch approaches to item scaling
(1984) Jansen, Paul G. W.
Andrich (1978) derived a formal equivalency between Thurstone’s Case V specialization of the law of comparative judgment for paired comparisons, with a logistic function substituted for the normal, and the Rasch model for direct responses. The equivalency was corroborated by a specific substantial-psychological interpretation of the Rasch binary item response probability. Studying the relationships between the Thurstone and Rasch models from another perspective than Andrich’s, namely, from a data-theoretical point of view, it appears that the equivalency is based on an implicit assumption with respect to the subject population. This assumption (1) is rather restrictive, and therefore its empirical validity seems to be low, and (2) seems to contradict the substantial reasoning corroborating the Thurstone-Rasch equivalency. It is argued that the Thurstone model cannot be considered the sample-independent pair comparison counterpart of the Rasch model. An alternative pair comparison equivalent of the Rasch model is tentatively proposed. Finally, the theoretical and practical implications of Andrich’s and of the present study are discussed.
An investigation of methods for reducing sampling error in certain IRT procedures
(1984) Wingersky, Marilyn S.; Lord, Frederic M.
The sampling errors of maximum likelihood estimates of item response theory parameters are studied in the case when both people and item parameters are estimated simultaneously. A check on the validity of the standard error formulas is carried out. The effect of varying sample size, test length, and the shape of the ability distribution is investigated. Finally, the effect of anchor-test length on the standard error of item parameters is studied numerically for the situation, common in equating studies, when two groups of examinees each take a different test form together with the same anchor test. The results encourage the use of rectangular or bimodal ability distributions, and also the use of very short anchor tests.
An application of latent class models to assessment data
(1984) Haertel, Edward
Responses of 17-year-olds to selected 1977-78 National Assessment of Educational Progress (NAEP) mathematics exercises were analyzed, using latent class models. A single model was fitted to data from five independent samples of examinees, each of which responded to a different set of six algebra or prealgebra exercises. Four categories of items were found, defining five levels of content mastery, ranging from examinees unable to solve any of the exercises (43%) through those able to solve all the exercises (19%). The methods demonstrated are broadly applicable to assessment data, including matrix-sampled data, and provide an aggregate description of examinee abilities independent of the specific characteristics of individual exercises administered.
Item profile analysis for tests developed according to a table of specifications
(1984) Kolen, Michael J.; Jarjoura, David
An approach to analyzing items is described that emphasizes the heterogeneous nature of many achievement and professional certification tests. The approach focuses on the categories of a table of specifications, which often serves as a blueprint for constructing such tests. The approach is characterized by profile comparisons of observed and expected correlations of item scores with category scores. A multivariate generalizability theory model provides the foundation for the approach, and the concept of a profile of expected correlations is derived from the model. Data from a professional certification testing program are used for illustration and an attempt is made to provide links with test development issues and generalizability theory.
Evaluating reading diagnostic tests: An application of confirmatory factor analysis to multitrait-multimethod data
(1984) Marsh, Herbert W.; Butler, Susan
Diagnostic reading tests, in contrast to achievement tests, claim to measure specific components of ability hypothesized to be important for diagnosis or remediation. A minimal condition for demonstrating the construct validity of such tests is that they are able to differentiate validly between the reading traits that they claim to measure (e.g., comprehension, sound discrimination, blending). This condition is rarely tested, but multitrait-multimethod (MTMM) designs are ideally suited for this purpose. This is demonstrated in two studies based on the 1966 version of the Stanford Diagnostic Reading Test (SDRT). In each study, the application of the Campbell-Fiske guidelines and confirmatory factor analysis (CFA) to the MTMM data indicated that the SDRT subscales could be explained in terms of a method/halo effect and a general reading factor that was not specific to any of the subscales; this refutes the construct validity of the 1966 version of the SDRT as a diagnostic test. Other diagnostic tests probably suffer the same weakness and should also be evaluated in MTMM studies.
Procedures for assessing the validities of tests using the "known-groups" method
(1984) Hattie, John; Cooksey, Ray W.
If a test is "valid," one criterion could be that test scores must discriminate across groups that are theoretically known to differ. A procedure is outlined to assess the discrimination across groups that uses only information from means. The method can be applied to many published tests, it provides information that relates to the construct validity of the test, and it presents a way to identify how a new sample can be related to previous studies.
Comparison of difficulties and reliabilities of quantitative word problems in completion and multiple-choice item formats
(1984) Oosterhof, Albert C.; Coats, Pamela K.
Quantitative word problems were written as parallel completion and multiple-choice items, and were administered to 232 undergraduate students to compare the reliabilities and item difficulties associated with these formats. The multiple-choice options were written using specific numerical responses for each of five alternatives, revised by replacing the fifth option with "none of the above," and also by replacing each of the five responses with ranges of numerical values. Differences in distributions of scores imply a need to reestablish standards if changes are made in the proportions of completion and multiple-choice items included in a test. Findings did not support camouflaging the correct response by using "none of the above" or ranges of numerical values as multiple-choice alternatives. The increased time required to develop and administer a multiple-choice test with reliability equal to that of a completion test suggests use of the latter even in classes with relatively large enrollments.
Ability metric transformations involved in vertical equating under item response theory
(1984) Baker, Frank B.
The metric transformations of the ability scales involved in three equating techniques-external anchor test, internal anchor test, and a pooled groups procedure -were investigated. Simulated item response data for two unique tests and a common test were obtained for two groups that differed with respect to mean ability and variability. The obtained metrics for various combinations of groups and tests were transformed to a common metric and then to the underlying ability metric. The results showed that there was reasonable agreement between the transformed obtained metrics and the underlying ability metric. They also showed that the largest errors in the ability score statistics occurred under the external anchor test procedure and the smallest under the pooled procedures. Although the pooled procedure performed well, it was affected by unequal variances in the two groups of examinees.
Comparison of three techniques to assess group-level beta and gamma change
(1984) Schmitt, Neal; Pulakos, Elaine D.; Lieblein, Amy
Alpha, beta, and gamma change concerning student attitudes toward a college course were assessed before and after the first examination in that course for an experimental and control group. Three methodologies were used to assess change. Those proposed by Terborg, Howard, and Maxwell (1980) and Schmitt (1982) produced reasonably similar conclusions concerning change, while the methodology suggested by Zmud and Armenakis (1978) produced relatively different conclusions. The relative advantages and limitations of the procedures are discussed. The major conclusion is that much additional use and comparison of these methodologies for assessing change is necessary before researchers or practitioners can interpret the practical significance of beta and gamma change or the relative utility of various approaches to the measurement of beta and gamma change.

University Digital Conservancy

University of Minnesota Twin Cities

Browse

Recent Submissions