Applied Psychological Measurement, Volume 15, 1991

Persistent link for this collectionhttps://hdl.handle.net/11299/103307

Search within Applied Psychological Measurement, Volume 15, 1991

Browse

Now showing 1 - 20 of 32

Adjustments for rater effects in performance assessment
(1991) Houston, Walter M.; Raymond, Mark R.; Svec, Joseph C.
Alternative methods to correct for rater leniency/stringency effects (i.e., rater bias) in performance ratings were investigated. Rater bias effects are of concern when candidates are evaluated by different raters. The three correction methods evaluated were ordinary least squares (OLS), weighted least squares (WLS), and imputation of the missing data (IMPUTE). In addition, the usual procedure of averaging the observed ratings was investigated. Data were simulated from an essentially τ-equivalent measurement model, with true scores and error scores normally distributed. The variables manipulated in the simulations were method of correction (OLS, WLS, IMPUTE, averaging the observed ratings), amount of missing data (50% missing, 75% missing), rater bias (low, high), and number of examinees or candidates (N = 50, N = 100). The accuracy of the methods in estimating true scores was assessed based on the square root of the average squared difference between the estimated and known true scores. The three correction methods consistently outperformed the procedure of averaging the observed ratings. IMPUTE was superior to the least squares methods. Index terms: EM algorithm, incomplete data, incomplete rating designs, least squares adjustments, performance assessment, rater calibration.
A comparison of bivariate smoothing methods in common-item equipercentile equating
(1991) Hanson, Bradley A.
The effectiveness of smoothing the bivariate distributions of common and noncommon item scores in the frequency estimation method of common-item equipercentile equating was examined. The mean squared error of equating was computed for several equating methods and sample sizes, for two sets of population bivariate distributions of equating and nonequating item scores defined using data from a professional licensure exam. Eight equating methods were compared: five equipercentile methods and three linear methods. One of the equipercentile methods was unsmoothed equipercentile equating. Four methods of smoothed equipercentile (SEP) equating were considered : two based on log-linear models, one based on the four-parameter beta binomial model, and one based on the four-parameter beta compound binomial model. The three linear equating methods were the Tucker method, the Levine Equally Reliable method, and the Levine Unequally Reliable method. The results indicated that smoothed distributions produced more accurate equating functions than the unsmoothed distributions, even for the largest sample size. Tucker linear equating produced more accurate results than SEP equating when the systematic error introduced by assuming a linear equating function was small relative to the random error of the methods of SEP equating. Index terms: common-item equating, equating, log-linear models, smoothing, strong true score models.
The use of prior distributions in marginalized Bayesian item parameter estimation: A didactic
(1991) Harwell, Michael R.; Baker, Frank B.
The marginal maximum likelihood estimation (MMLE) procedure (Bock & Lieberman, 1970; Bock & Aitkin, 1981) has led to advances in the estimation of item parameters in item response theory. Mislevy (1986) extended this approach by employing the hierarchical Bayesian estimation model of Lindley and Smith (1972). Mislevy’s procedure posits prior probability distributions for both ability and item parameters, and is implemented in the PC-BILOG computer program. This paper extends the work of Harwell, Baker, and Zwarts (1988), who provided the mathematical and implementation details of MMLE in an earlier didactic paper, by encompassing Mislevy’s marginalized Bayesian estimation of item parameters. The purpose was to communicate the essential conceptual and mathematical details of Mislevy’s procedure to practitioners and to users of PC-BILOG, thus making it more accessible. Index terms: Bayesian estimation, BILOG, item parameter estimation, item response theory.
The discriminating power of items that measure more than one dimension
(1991) Reckase, Mark D.; McKinley, Robert L.
Determining a correct response to many test items frequently requires more than one ability. This paper describes the characteristics of items of this type by proposing generalizations of the item response theory concepts of discrimination and information. The conceptual framework for these statistics is presented, and the formulas for the statistics are derived for the multidimensional extension of the two-parameter logistic model. Use of the statistics is demonstrated for a form of the ACT Mathematics Usage Test. Index terms: item discrimination, item information, item response theory, multidimensional item response theory.
Influence of the criterion variable on the identification of differentially functioning test items using the Mantel-Haenszel statistic
(1991) Clauser, Brian E.; Mazor, Kathleen; Hambleton, Ronald K.
This study investigated the effectiveness of the Mantel-Haenszel (MH) statistic in detecting differentially functioning (DIF) test items when the internal criterion was varied. Using a dataset from a statewide administration of a life skills examination, a sample of 1,000 Anglo-American and 1,000 Native American examinee item response sets were analyzed. The MH procedure was first applied to all the items involved. The items were then categorized as belonging to one or more of four subtests based on the skills or knowledge needed to select the correct response. Each subtest was then analyzed as a separate test, using the MH procedure. Three control subtests were also established using random assignment of test items and were analyzed using the MH procedure. The results revealed that the choice of criterion, total test score versus subtest score, had a substantial influence on the classification of items as to whether or not they were differentially functioning in the American and Native American groups. Evidence for the convergence of judgmental and statistical procedures was found in the unusually high proportion of DIF items within one of the classifications and in the results of the reanalysis of this group of items. Index terms: differential item functioning, item bias, Mantel-Haenszel statistic, test bias.
An investigation of ordinal true score test theory
(1991) Donoghue, John R.; Cliff, Norman
The validity of the assumptions underlying Cliff’s (1989) ordinal true score theory (OTST) were investigated in a three-stage study. OTST makes only ordinal assumptions about the data, and provides a means of converting ordinal item information into summary ordinal information about examinees. Stage 1 was a simulation based on a classical (weak true score) test theory model. Stage 2 used a long empirical test to approximate the true order. Stage 3 was an extensive simulation based on the three-parameter logistic model. The results of all three studies were consistent; the assumption of local ordinal uncorrelatedness was violated in that partial item-item gamma (γ) correlations were positive instead of 0. The assumption of proportional distribution of ties was violated-pairs tied on one item were not distributed on the other as prescribed. The item-true order tau (τ) correlation was consistently overestimated, although the estimated τ correlated highly with the true τ. The τ correlation between total score and true order was also consistently overestimated. Stage 3 showed that these effects occurred under all conditions, although they were smaller under some conditions. Index terms: classical test theory, item response models, local independence, monte carlo simulation, nonparametric test models, ordinal regression, ordinal test models, test theory.
Coefficients for interrater agreement
(1991) Zegers, Frits E.
The degree of agreement between two raters who rate a number of objects on a certain characteristic can be expressed by means of an association coefficient (e.g., the product-moment correlation). A large number of association coefficients have been proposed, many of which belong to the class of Euclidean coefficients (ECs). A discussion of desirable properties of ECs demonstrates how the identity coefficient and its generalizations, which constitute a family of ECs, can be used to assess interrater agreement. This family of ECs contains coefficients for both nominal and non-nominal (ordinal and metric) data. In particular, it is pointed out which information contained in the data is accounted for by the various coefficients and which information is ignored. Index terms: association coefficients, correlation, Euclidean coefficients, generalized identity coefficients, interrater agreement.
An equal-level approach to the investigation of multitrait-multimethod matrices
(1991) Schweizer, Karl
An equal-level approach that yields new information for the evaluation of multitrait-multimethod (MTMM) matrices is described. The procedure is based on the analysis of item-composite relations, composite-composite relations, composites, and facets. A main characteristic of the equal-level approach is the induction of equality in data-level prior to carrying out comparisons between coefficients, because in many cases such inequalities may lead to inaccurate conclusions. Methods are proposed for ensuring comparability of coefficients even if an MTMM design includes different numbers of items for traits and methods. The concept of disaggregation is assigned a key position in the investigation of convergent and discriminant validity. In addition, measures are proposed for avoiding other distortions resulting from partial self-correlations. Index terms: disaggregated correlations, equal-level approach, multitrait-multimethod analysis, partial self-correlations, Spearman-Brown formula.
On the efficiency of IRT models when applied to different sampling designs
(1991) Berger, Martijn P. F.
The problem of obtaining designs that result in the greatest precision of the parameter estimates is encountered in at least two situations in which item response theory (IRT) models are used. In so-called two-stage testing procedures, certain designs may be specified that match difficulty levels of test items with abilities of examinees. The advantage of such designs is that the variance of the estimated parameters can be controlled. In situations in which IRT models are applied to different groups, efficient multiple-matrix sampling designs are applicable. The choice of matrix sampling designs will also influence the variance of the estimated parameters. Heuristic arguments are given here to formulate the efficiency of a design in terms of an asymptotic generalized variance criterion, and a comparison is made of the efficiencies of several designs. It is shown that some designs may be found to be most efficient for the one- and two- parameter model, but not necessarily for the three-parameter model. Index terms: efficiency, generalized variance, item response theory, optimal design.
An empirical study of the effects of small datasets and varying prior variances on item parameter estimation in BILOG
(1991) Harwell, Michael R.; Janosky, Janine E.
Long-standing difficulties in estimating item parameters in item response theory (IRT) have been addressed recently with the application of Bayesian estimation models. The potential of these methods is enhanced by their availability in the BILOG computer program. This study investigated the ability of BILOG to recover known item parameters under varying conditions. Data were simulated for a two-parameter logistic IRT model under conditions of small numbers of examinees and items, and different variances for the prior distributions of discrimination parameters. The results suggest that for samples of at least 250 examinees and 15 items, BILOG accurately recovers known parameters using the default variance. The quality of the estimation suffers for smaller numbers of examinees under the default variance, and for larger prior variances in general. This raises questions about how practitioners select a prior variance for small numbers of examinees and items. Index terms: BILOG, item parameter estimation, item response theory, parameter recovery, prior distributions, simulation.
A comparison of two area measures for detecting differential item functioning
(1991) Kim, Seock-ho; Cohen, Allan S.
The area between two item response functions is often used as a measure of differential item functioning under item response theory. This area can be measured over either an open interval (i.e., exact) or closed interval. Formulas are presented for computing the closed-interval signed and unsigned areas. Exact and closed-interval measures were estimated on data from a test with embedded items intentionally constructed to favor one group over another. No real differences in detection of these items were found between exact and closed-interval methods. Index terms: BILOG, closed interval, differential item functioning, item response functions, open interval, signed area, unsigned area.
The relationship of power of statistical tests to range of talent: A correction and amplification
(1991) Humphreys, Lloyd G.
Appropriate moderated regression and inappropriate research strategy: A demonstration of information loss due to scale coarseness
(1991) Russell, Craig J.; Pinto, Jeffrey K.; Bobko, Philip
Paunonen and Jackson (1988) demonstrated that stepwise moderated regression provides a test of interaction effects that protects the nominal Type I error rate. However, the stepwise procedure has also been characterized as failing to detect interaction effects in empirical studies. This issue has led to questions regarding the method’s statistical power (Bobko, 1986; Zedeck, 1971) in applied research. It is demonstrated that, because of a research strategy frequently used in empirical investigations, the probability of Type II error in detecting a true interaction effect is unknown. Specifically, the number of scale steps used in measuring the dependent variable is shown to result in a form of systematic error that can spuriously increase or decrease the expected effect size of the interaction. The problem is also discussed in the context of testing more complex models. Recommendations for eliminating this problem in future research designs are provided. Index terms: information loss, interaction effects, Likert scales, moderated regression, response transformation.
Effects of passage and item scrambling on equating relationships
(1991) Harris, Deborah J.
This study investigated the effects of passage and item scrambling on equipercentile and item response theory equating using a random groups design. For all four tests and for both scramblings used, differences in item and examinee statistics were found to exist between all three forms used (the base form and the two scrambled forms). Up to 50% of the examinees administered a scrambled form would have received a different scale score if the base form equating, rather than the scrambled form equating, had been used to convert their number-correct scores. It is, therefore, suggested that caution be used when scrambled forms are being administered, because in applications such as that studied here, the effects of applying the equating results obtained using a base form to the number-correct scores obtained on a scrambled form can be quite substantial in terms of the numbers of examinees who would receive different scores. Index terms: context effects, equating, item scrambling.
The effect of numbers of experts and common items on cutting score equivalents based on expert judgment
(1991) Norcini, John; Shea, Judy; Grosso, Louis
The effect of different numbers of experts and common items on the scaling of cutting scores derived by experts’ judgments was investigated. Four test forms were created from each of two examinations; each form from the first examination shared a block of items with one form from the second examination. Small groups of experts set standards on each using a modification of Angoff’s (1971) method. Cutting score equivalents were estimated for the matched forms using different group sizes and numbers of common items; they were compared with cutting score equivalents based on score equating. Results showed that a reduction in error is associated with using more experts or having more items in common between the two forms. For 25 or more common items and five or more judges, the error was about one item on a 100-item test. More than five experts or 25 common items made only a very small difference in error. Index terms: cutting scores, equating, expert judgment, standard setting.
Expert-system scores for complex constructed-response quantitative items: A study of convergent validity
(1991) Bennett, Randy Elliot; Sebrechts, Marc M.; Rock, Donald A.
This study investigated the convergent validity of expert-system scores for four mathematical constructed-response item formats. A five-factor model comprised of four constructed-response format factors and a Graduate Record Examination (GRE) General Test quantitative factor was posed. Confirmatory factor analysis was used to test the fit of this model and to compare it with several alternatives. The five-factor model fit well, although a solution comprised of two highly correlated dimensions-GRE-quantitive and constructed-response represented the data almost as well. These results extend the meaning of the expert system’s constructed-response scores by relating them to a well-established quantitative measure and by indicating that they signify the same underlying proficiency across item formats. Index terms: automatic scoring, constructed response, expert system, free-response items, open-ended items.
The influence of test characteristics on the detection of aberrant response patterns
(1991) Reise, Steven P.; Due, Allan M.
Statistical methods to assess the congruence between an item response pattern and a specified item response theory model have recently proliferated. This "person fit" research has focused on the question: To what extent can person-fit indices identify well-defined forms of aberrant item response? This study extended previous person-fit research in two ways. First, an unexplored model for generating aberrant response patterns was explicated. The data-generation model is based on the theory that aberrant item responses result in less psychometric information for the individual than predicated by the parameters of a specified response model. Second, the proposed response aberrancy generation model was implemented to investigate how the aberrancy detection power of a person-fit statistic is influenced by test properties (e.g., the spread of item difficulties). Results indicated that detecting aberrant response patterns was especially problematic for tests with less than 20 items, and for tests with limited ranges of item difficulty. An applied consequence of these results is that certain types of test designs (e.g., peaked tests) and administration procedures (e.g., adaptive tests) potentially act to limit the detection of aberrant item responses. Index terms: aberrancy detection, IRT, person fit, response aberrancy, Z₁ index.
Appropriateness measurement for some multidimensional test batteries
(1991) Drasgow, Fritz; Levine, Michael V.; McLaughlin, Mary E.
Model-based methods for the detection of individuals inadequately measured by a test have generally been limited to unidimensional tests. Extensions of unidimensional appropriateness indices are developed here for multi-unidimensional tests (i.e., multidimensional tests composed of unidimensional subtests). Simulated and real data were used to evaluate the effectiveness of the multitest appropriateness indices. Very high rates of detection of spuriously high and spuriously low response patterns were obtained with the simulated data. These detection rates were comparable to rates obtained for long unidimensional tests (both simulated and real) with approximately the same number of items. For real data, similarly high detection rates were obtained in the spuriously high condition; slightly lower detection rates were observed for the spuriously low condition. Several directions for future research are described. Index terms: appropriateness measurement, item response theory, multidimensional tests, optimal appropriateness measurement, polychotomous measurement.
Sequential reliability tests
(1991) Eiting, Mindert H.
Sequential tests for a stepped-up reliability estimator and coefficient alpha are developed. In a series of monte carlo experiments, the efficiency of the tests relative to each other and to fixed-sample tests is established, as well as the robustness of the alpha test. Both tests proved to be efficient, and the alpha test proved to be reasonably robust to deviations from normality and deviations from equal item error variances. On average, 47% of the sample size can be saved if a sequential test is applied instead of a fixed-sample test. Index terms: monte carlo simulation, reliability hypothesis tests, sampling theory for reliability estimators, sequential probability ratio test.
The measurement of latent traits by proximity items
(1991) Hoijtink, Herbert
A probabilistic parallelogram model for the measurement of latent traits by proximity items (the PARELLA model) is introduced. This model assumes that the responses of persons to items result from proximity relations: the smaller the distance between person and item, the larger the probability that the person will agree with the content of the item. The model is unidimensional and assigns locations to items and persons on the latent trait. The parameters of the PARELLA model are estimated by marginal maximum likelihood and expectation maximization. The efficiency of the estimation procedure is illustrated, a diagnostic for the fit of items to the model is presented, and the PARELLA model is used for the analysis of three empirical datasets. Index terms: expectation maximization, latent trait theory, marginal maximum likelihood, nonmonotone trace lines, single-peaked preference functions, unfolding.

University Digital Conservancy

University of Minnesota Twin Cities

Browse

Recent Submissions