39 Item Response Theory for Scores on Tests Including Polytomous Items with Ordered Responses David Thissen, University of North Carolina at Chapel Hill Mary Pommerich, American College Testing Kathleen Billeaud, University of North Carolina at Chapel Hill Valerie S. L. Williams, National Institute of Statistical Sciences Item response theory (IRT) provides procedures for scor- ing tests including any combination of rated constructed- response and keyed multiple-choice items, in that each response pattern is associated with some modal or ex- pected a posteriori estimate of trait level. However, various considerations that frequently arise in large-scale testing make response-pattern scoring an undesirable solution. Methods are described based on IRT that pro- vide scaled scores, or estimates of trait level, for each summed score for rated responses, or for combinations of rated responses and multiple-choice items. These meth- ods may be used to combine the useful scale properties of IR’r-based scores with the practical virtues of a scale based on a summed score for each examinee. Index terms: graded response model, item response theory, ordered responses, polytomous models, scaled scores. Item response theory (IRT) provides a score scale that is more useful for many purposes (e.g., for the construction of developmental scales or for the calibration of tests comprising different types of items or exercises) than the summed score, percentage correct, or percentile scales. With the exception of the Rasch family of models for which the summed score is a sufficient statistic for the characterization of the latent variable (0) (Masters & Wright, 1984; Rasch, 1960), under IRT models each response pattern is usually associated with a unique estimate of 0. These estimates of 0 may be used as scaled response pattern scores; they have the advantage that they extract all information available in the item responses, if the model is appropriate for the data. In addition, the IRT model produces estimates of the probability that each response pattern will be observed in a sample from a specified population. In applied measurement contexts, however, it is often desirable to consider the implications of IRT analysis for summed scores, rather than response patterns, even if the IRT model used is not part of the Rasch family. For example, in a large-scale testing program it may be desirable to tabulate the IRT scaled scores associated with each summed score on operational forms, using item parameter estimates obtained from item tryout data, before the operational forms are administered. In addition, it may be useful to compute model-based estimates of the summed score distribution (e.g., to create percentile tables for use as an interpretive aid for score reporting). Model-based estimates of the summed score distribution also may have value as a statistical diagnostic of the goodness of fit of the IRT model, including the validity of the assumed underlying population distribution. Many contemporary tests include extended constructed-response items, for which the item scores are ordered categorical ratings provided by raters. In some cases, the constructed-response items comprise the entire test; in other cases, there are multiple-choice items as well. In either case, some total score is often required, combining the ratings of the constructed-response items (and the item scores on the multiple- APPLIED PSYCHOLOGICAL MEASUREMENT Vol. 19, No. 1, March 1995, pp. 39-49 © Copyright 1995 Applied Psychological Measurement Inc. 0146-6216/95/010039-11$1.80 Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 40 choice items, if any are present). Simple summed scores may not be useful in this context, because of the problems associated with the selection of relative weights for the different items and item types, and be- cause the constructed-response items are often on forms of widely varying difficulty. If the collection of items is sufficiently well represented by a unidimensional IRT model, scaled summed scores may be a viable scoring alternative. IRT for Summed Scores For any IRT model for items indexed by i with ordered item scores k = 0, ..., K,, collected in the vector k, the likelihood for any summed score j = Ek, is patterns L~ (0) = LL(kle), (1) 3J=r.k, where the summation is over the response patterns with total score j. The likelihood for each response pattern is L(klO) = I11;.,(e)(e), (2) 1 , where Tk, (0) is the category response function (CRF) for category k of item i (i.e., the conditional probabil- ity of response k to item i given 0) and ~(9) is the population density. Thus, the likelihood for each score is patterns Lj (0) &dquo; L I1 ~,(e)(e). (3) 3,j=lk, i Therefore, the probability of each score j is P~ = j L, (e) d9 , (4) or P’l&dquo;,-s P, = f E L(klO) dO , (5) 3J=r.k, or patterns f = J L I1 l§ (0)#(0) d0 . (6) 3~=Fx~ I ’ Given an algorithm to compute the integrand in Equation 1, it is straightforward to compute the average [or expected a posteriori (EAP)] scaled score (Bock & Mislevy, 1982) associated with each score, EAP(eJ7-=~)=-’ eL’ P (e)de , , (7) and the corresponding standard deviation (sD), SD(Olj _ , fffe-EAP(e~-=~J1B(e)~ ’ Y2 (8)SD(e)./=~,)=Hi―――’’ P, = Ek, )]2 L~ . . The values computed using Equation 7 may be tabulated and used as the IRT scaled-score transformation of the summed scores, and the values of Equation 8 may be used as a standard description of the uncertainty associated with those scaled scores. The score histogram created using the values of Equation 6 may be used to construct summed-score Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 41 percentile tables; if the IRT model fits the data, this can be done accurately using only the item parameters for any group with a known population density. Thus, percentile tables for summed scores can be con- structed using item tryout data, before the operational test is administered. This same histogram may also prove useful as a diagnostic statistic for the goodness of fit of the model by comparing the modeled repre- sentation of the score distribution to the observed data. Algorithms for Computing L~(9) Lord (1953) used heuristic procedures to describe the difference between the distribution of summed scores, L~(A), and the underlying distribution of 0, ())(9) (see also Lord & Novick, 1968, pp. 387-392). However, practical calculation of the summed score distribution implied by an IRT model has awaited both contemporary computational power and solutions to the apparently intractable computational problem. The Brute-Force Method An exact numerical brute-force evaluation of Equation 6, requiring the computation of II(K, + 1) like- lihoods, is possible for a few items; but it is inconceivable for many items. Brute-force may be extended to approximately 20 items by using an algorithm involving the computation of each pattern likelihood from some other previously computed pattern likelihood by a single (list) multiplication; this approach is used in the computer program TESTFACT (Wilson, Wood, & Gibbons, 1991 ). For binary items, by carefully order- ing the computation of the likelihoods for the 2n patterns (where n is the number of items), such an algo- rithm can compute all 2n likelihoods at a computational cost of only a single (list) multiplication for each (Thissen, Pommerich, & Williams, 1993). Nevertheless, due to the exponential computational complexity of this approach, this algorithm cannot be extended to more items regardless of improvements in computa- tional speed. An Approximation Method Lord & Novick (1968) stated that &dquo;...approximations appear inevitable...&dquo; (p. 525), and suggested the use of an approximation to the compound binomial, attributed to Walsh (1963), to compute the likelihood of a summed score for binary items as a function of 0. For n items, this Taylor-series expansion has n terms; however, in practice the first two terms suffice for acceptable accuracy. The two-term version of the approximation is: patterns ~ ~ 11 Tk, (e) = Pn ~j) + 2 UC~j) ~ (9)3j=lk, 1 ’ where nPn(j)= C.1)M’(1-Mn’) forj=O,l,...,n; , 100 otherwise ( ) C J - ~ _ 1) v+t (v)Pn-2~J 2 ~’) ~ (11)v=0 ~I[~)-~ (12)n , [T and Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 42 M= n , I TI, (0). ° (13)nr Yen (1984) used this approximation to develop an algorithm to compute the mode of patterns ~ II Tk, (e> (14) ~H,< i for use as a scaled score for examinees with summed score j on a binary test using the three-parameter logistic model. She reported that the two-term Taylor expansion produced noticeably better results than the one-term solution, which is simply an inverse transformation of the test response function; but the three- and four-term solutions appeared not to add useful precision. The approximation in Equation 9 also may be substituted for the sum of products in Equations 6, 7, and 8 to compute P, EAP(0) j = Lk,), and SD(0) j = Yk,). When the results for the two-term approximation were compared to the exact results for one of the 20-item examples used by Yen (1984), the error of approxima- tion was usually less than .001 for EAP(0) j = Ekr~, and SD~9~ j = £k) (Thissen et al., 1993). Perfect scores were the exception because the second term of the two-term approximation was 0; for those scores, the approximation was inexact by as much as .05. The error of approximation for P was approximately .0001. For practical use in constructing score reporting tables, which usually use no greater precision than tenths of a SD for the scores and their standard errors, and integral values for percentile tables, this degree of precision appears to be sufficient. However, the approximation in Equation 9 is still somewhat computationally bur- densome, and no generalization has been offered for items with more than two response categories. A Recursive Algorithm The problem of the computational burden is solved by an alternative procedure briefly described by Lord & Wingersky (1984). Abandoning the contention of Lord & Novick (1968) that &dquo;...approximation is inevitable...&dquo; (p. 525), Lord and Wingersky described a simple recursive algorithm for the computation of patterns L~ (e) _ ~ II Tk, (e) (15) ~=Dt, i for binary items. The algorithm is based on the distributive law, and generalizes readily to items with any number of response categories. The generalization follows: Let i = 0, 1, ..., n for the items (it is somewhat unusual to index the items from 0 to n for n + 1 items; however, in this case the correspondence of that system with the usual practice of indexing the scores from 0, and the common practice of indexing the item response categories from 0, simplifies both the notation and the software implementation) ; k = 0, 1,..., K, for the response categories for item i; and Tk, (0) be the CRF for category k of item i. In addition, the summed scores for a set of items [0 ... n*] are j = 0, 1,..., Ln.(KJ and the likelihood for summed score j for a set of items [0 ... n*] is L~’(9); the population distribution is ())(8). The generalized recursive algorithm is: Setn*=0 ~’(0) = T&dquo;,(6), for j = 0, 1, ..., Kn. Repeat: For item n* + 1 and scores j = 0, 1,..., 5l~*(K,) , (/~ (0) = L (* (0)l§ (e) . (16) kin*+1 Set n* = n*+1 Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 43 Until n* = n. For a sample from a population with distribution <))(6), the likelihood for score j is Lj (0) = E, (0)0(0) , ( 17) and EAP(9~ j = Ek,), SD(0)y = 1:kJ, and P~(9) can be computed by integrating L~(9). No particular parametric form for the CRFs is assumed in the formulation of the recursive algorithm. In the North Carolina testing program, for which some of this system was developed, the three-parameter logistic model is used with binary-scored multiple-choice items and Samejima’s (1969) graded model is used for mul- tiple-category rated items. However, in principle, any CRFs could be used, such as the nonparametric kernel smooths described by Ramsay (1991). The algorithm would produce accurate, but meaningless, results if it were used with items for which the responses are not ordered. The results would be meaningless because the response patterns included in any particular summed score would not have likelihoods concentrated near the same values of 0; therefore, such summed-score likelihoods would tend to be very flat with very large SDS. Nevertheless, the algorithm is completely general. [An implementation for the LISP-STAT computing environment (Tierney, 1990) is available from the author.] For simplicity of programming, it uses rectan- gular quadrature, or the &dquo;repeated midpoint formula&dquo; (Stroud, 1974, p. 120), to compute the values of the integrals. Stroud described a number of alternative methods for numerical evaluation of such integrals. Some of the more complex methods, such as Gauss-Hermite quadrature, have often been used in IRT. Stroud (1974) noted that although &dquo;often a Gauss formula will be much superior to any other formula with the same number of points... It is not true, however, that a Gauss formula is always the best&dquo; (p. 187). For the integration of functions that depend on a large number of unknown parameters, such as those consid- ered here, Stroud recommended that various quadrature methods be compared over a wide variety of possible values of the parameter set to determine the best method. If such a comparison were to be done, it would be very useful for many other applications of IRT, as well as that discussed here. A Numerical Example This example is based on three binary items with ao= .5, bo=-1.0, a, = 1.0, b, = 0.0, aZ =1.5, b2=1.0, and seven quadrature points at 0 = -3, -2, -1, 0, 1, 2, and 3. Numerical representations of the item response functions (IRFS) and a number of intermediate results are shown in Table 1. The uppermost section shows the values of the IRFs at the seven values of 0. For n* = 0, there are only two possible scores, 0 and 1, and L° (0) is equal to T,,(O). Then, as n* increases and each successive item is used, the likelihood for a score is the sum of the two terms: the product of the likelihood for that score on the preceding items and ~.(0), and the product of the likelihood for that score minus 1 on the preceding items and T,&dquo;.(A) [except, of course, for the summed scores of 0 and n* that involve only a single product]. For polytomous items, again excepting scores less than the number of response categories for the item and scores near the maximum attainable, for each value of n* the sum involves kn, terms. For tests with more than the four score categories illustrated here, seven-point rectangular quadrature is not adequate. However, the relative robustness of the method to quadrature is illustrated by the fact that if quadrature in the example in Table 1 is increased from the seven points at unit intervals shown in the table to 46 points between -4.5 and 4.5 with an interval of .2, the final values of the proportion in each score group differ by less than .0001, and the values of the EAPs differ by less than .01. Example Applications Polytomous Data I Data from the North Carolina End-of-Grade 3 Social Studies exam were used. The test consisted of ;’ three open-ended items, which were administered to 23,374 students in the spring of 1993. The responses Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 44 Table 1 A Numerical Example for Three Binary Items for Values of i and j and n* = 0, 1, 2 at Seven Levels of 0 Variable i j 0 = -3 A=-2 6=-1 0 = 0 6=1 1 9=2 9=3 Initialization: IRF Ordinates T,(9) 0 0 .73106 .62246 .50000 .37754 .26894 .18243 .11920 1 .26894 .37754 .50000 .62246 .73106 .81757 .88080 1 0 .95257 .88080 .73106 .50000 .26894 .11920 .04743 1 .04743 .11920 .26894 .50000 .73106 .88080 .95257 2 0 .99753 .98901 .95257 .81757 .50000 .18243 .04743 1 .00247 .01099 .04743 .18243 .50000 .81757 .95257 Initialization: For n* = 0 ~(A) = T~(A) 0 .73106 .62246 .50000 .37754 .26894 .18243 .11920 ~ (6) = T,o(9) 1 .26894 .37754 .50000 .62246 .73106 .81757 .88080 For n* = 1 ~)(e)=~(e)7.,(e) 0 .69639 .54826 .36553 .18877 .07233 .02175 .00565 ~(e)=~(e)7.,(6)+~(6)7~(6) 1 .29086 .40674 .50000 .50000 .39322 .25814 .15532 ~(e)=~(e)r,,(6) 2 .01276 .04500 .13447 .31123 .53445 .72012 .83902 Forn*=2 zo(e) _ ~(e)Toz(e) 0 .69467 .54224 .34819 .15433 .03616 .00397 .00027 ~ (6) _ ~(A)To2(6) + ~(6)T,2(g) 1 .29186 .40829 .49362 .44322 .23278 .06487 .01275 ~(6) = 11z (A)To2(6) + 4(8)T¡i8) 2 .O 1344 .04898 .1 S 181 .34567 .46384 .34241 1 .18775 ~(6) = ~(6)7~(6) 3 .00003 .00049 .00638 .05678 .26722 .58875 .79923 to the open-ended items were rated on a 4-point scale-item scores ranged from 0-3 and summed scores ranged from 0-9. The parameter estimates for these items were obtained using Samejima’s (1969) graded model and the computer program MULTILOG (Thissen, 1991) (see Table 2). Figure 1 shows the posterior density for the three response patterns that had a summed score of 1: 100, 010, and 001. Of these response patterns, 001 was the most frequently observed in the data (2,830 examin- ees), followed by 010 (2,008 examinees), and then 100 (811 examinees). This pattern reflected the differen- tial difficulty of the items (see Table 2): Item 3 had the lowest threshold (b) for Category 1 (bl = .08), so it was most likely that if an examinee received a single 1 and two Os, the 1 would be for Item 3. It was not much more difficult to obtain a score of 1 on Item 2 (b, _ .12), but obtaining a 1 on Item 1 was substantially more difficult (bl = .65). Table 2 shows that Item 2 was the most discriminating (a), followed by Item 1, and then Item 3. Thus, the posterior distributions for the three response patterns shown in Figure 1 had different locations: The averages, or EAPs, for the response patterns 010, 100, and 001 were .08, -.15, and -.38, respectively. Response-pattern scaled scores reflect the differences among examinees with different response patterns. For example, examinees with response pattern 010 had somewhat higher Os than those with response patterns 001 or 100. Figure 1 also shows the posterior distribution for all examinees who obtained a summed score of 1, which is the total of the three posterior distributions for response patterns 100, 010, and 001, computed using the item parameters in Table 2 and the recursive algorithm described above. The average of this posterior distribution, or EAP(0) j = 1:kJ, may be used to describe the average ability of examinees who obtained a summed score of 1, in the same way that the three different response pattern EAPs described the Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 45 Table 2 Item Parameter Estimates for the Three-Item Social Studies Test [0 was Distributed N(0,1)] Item a b, b2 b3 1 1.87 .65 1.97 3.14 2 2.66 .12 1.57 2.69 3 1.24 .08 2.03 4.30 ability of examinees with a particular response pattern. For the example in Figure 1, the summed-score EAP (EAP(0) j =1)~ was -.18, and the associated SD was .61. The summed-score EAP is, approximately, a weighted average of the EAPS for the response patterns that yield that summed score, and the SD of the summed-score EAP tends to be slightly larger than the SD of the EAPs for most of the patterns with that summed score. For the example of a summed score of 1 on this test, the SDs for the pattern EAPs were .60 (001), .54 (010), and .57 (100). Thus, although there was some loss of precision entailed in computing scaled scores for each summed-score group instead of for each response- pattern group, that loss of precision appears small. For the most frequent response pattern with a summed score of 1 (001), the difference between the SD of the pattern posterior and that for the summed score poste- rior was .01. Figure 1 The Posterior Density for the Three Response Patterns on the Grade 3 Social Studies Test That Had a Summed Score of 1 (100, 010, and 001) and the Posterior Distribution for all Examinees Obtaining a Summed Score of 1 (0 was Standardized) / Summed Score = 1 >- N c: CI)0 / 001 B o 010 ~5 1iS 0 a 100 -3 -2 -1 0 +1 1 +2 +3 0 Table 3 shows the range of response pattern EAPs and the associated SDs for each summed score on this three-item test, as well as the EAPs and SDs for the most common and least common response patterns, with the summed-score EAPs and SDs. For items that differ in discrimination as these did, the response pattern EAPs may be highly variable (as much as a standard unit) within any particular summed score; however, the variation was mostly accounted for by the few examinees who produced unusual response patterns. Most of the responses were in a few common response patterns, and the summed-score EAPs and SDs were very similar to those for the most common response patterns within each score. In general, the increase in the SDs from the smallest values for the response-pattern EAPs to the summed- score EAPs was approximately 10%. This is similar to the values that Birnbaum (1968, p. 477) reported in his study of the difference between summed scores and response-pattern scores, and is also approximately the same value observed in most applications of this procedure. The 10% loss of precision (on the scale of the SDs, which are reported as the standard errors of EAP scaled scores) represents the cost of assigning scores at the summed-score level rather than at the more precise response-pattern level. Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 46 Table 3 For Each Summed Score for the Three-Item Social Studies Test, Response-Pattern EAP Scaled Scores and Their SDs, Summed Score EAPs and Their SDs, and Observed and Modeled Proportion in Each Score Group Most Least Observed Modeled Common Common Score Score Summed Pattern Pattern Pattern Pattern (OIScore) Group Group Score EAP Range SD Range EAP SD EAP SD EAP SD Proportion Proportion 0 -.88 .70 -.88 .70 .321 1 .325 1 -.38 - .08 .54 - .60 -.38 .60 -.15 .57 -.18 .61 .242 .241 2 -.21 - .59 .50 - .62 .39 .51 0.00 .61 .33 .57 .195 .183 3 -.18 - 1.10 .48 - .73 .80 .48 .74 .73 .74 .55 .120 .123 4 .44 - 1.45 .49 - .65 1.00 .49 .44 .47 1.12 .54 .065 .069 5 .76 - 1.87 .48 - .67 1.56 .48 .76 .65 1.48 .54 .030 .035 6 1.42 - 2.21 .47 - .63 1.88 .47 2.21 .63 1.84 .54 .014 .016 7 1.62 - 2.41 .49 - .60 2.36 .51 1.62 .60 2.21 .54 .008 .006 8 2.29 - 2.72 .53 - .54 2.72 .54 2.29 .53 2.62 .56 .003 .002 9 2.99 .56 2.99 .56 .0007 .0003 Table 3 also shows the observed proportions obtaining each summed score on this test and the modeled proportion computed using Equation 1; Figure 2 shows those two distributions. The modeled distribution is very close to the observed distribution. The distribution is very skewed, because this test was extraordi- narily difficult. Nevertheless, the modeled proportions obtaining each score were computed using a Gaussian distribution as (e), illustrating Lord’s ( 1953) argument that the summed-score distribution does not di- rectly reflect the shape of the population distribution for the trait. Figure 2 The Observed Proportions Obtaining Each Summed Score on the Grade 3 Social Studies Test and the Modeled Proportion Computed Using Equation 6 (Error Bars Show Pointwise Twice the Binomial Standard Error for the Proportions) 0.4 Observed Modeled 0 c7.J 0.3 C) c 2: 15 0.2 . 0c 0 ~ 0. I r a 0 .1 2 a.. o 0 1 2 3 4 5 6 7 8 9 Summed Score Dichotomous Data (~ For the second example, data from the spring 1992 administration of two preliminary forms of the North Carolina End-of-Grade 3 Mathematics exam were used. The two forms each contained 80 four-alternative multiple-choice items. Form 301 was administered to 1,053 examinees, and Form 303 was administered to 1,071 examinees. Three-parameter logistic item parameter estimates were computed using MULTILOG (Thissen, 1991 ), with the population distribution specified as N(0,1 ). Figure 3 shows plots of the empirical percentiles for each observed summed score plotted against the Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 47 model-derived percentiles computed by accumulating P, for Forms 301 (Figure 3a) and 303 (Figure 3b). The maximum absolute difference between the observed and model-based percentiles was less than 3.0 for both plots; that is approximately twice the maximum pointwise standard error of the empirical percentiles, which for N = 1,000 was approximately 1.5 near the middle of the distribution. Because Figure 3 shows the data from which the item parameters were estimated, they show an aspect of the goodness of fit of the IRT model (including the underlying normal population distribution) to the data: The model reproduced the observed score distribution fairly accurately. Figure 3 Empirical Percentiles for Each Observed Summed Score Plotted Against the Model-Derived Percentiles Computed by Accumulating P, (Error Bars Show Pointwise Twice the Binomial Standard Error for the Percentiles) a. Form 301 1 b. Form 303 100 100 90 J~ 90 Af 80 ,if 80 ~F 0 7O J’T S 70 T~w w ° 60 IT~ ° 60 ~T g~ g~ I- 50 ~ 50 ~ 2 (L 40- t 1 (L 40- ~1 ’U I u U 1 _L) 2 EL 30 ~~ EL 30 ~’ E y lI. I E - I _ oj 1~1 w -ii 20 ~ 20 ~1 10- 10- O T~T ~~ ITI-I T ~ O0 o 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 Percentiles from IRT Model Percentiles from IRT Model Discussion As Yen ( 1984) noted, IRT scaled scores can be effectively computed for each observed summed score on a test, providing the usefulness of the IRT score scale without the problems associated with response- pattern scoring. Although some loss of information follows from the simplification of scoring from re- sponse patterns to summed scores, that loss of information is small-the corresponding change in the reported standard error would often not result in a visible change in the number of decimals usually reported. The loss may be counterbalanced by more practical or socially-acceptable score reporting: Scaled score report- ing based on summed scores is obviously more practical than response-pattern scoring, because the scaled scores may be obtained from a compact score-translation table; that is not possible, in general, for pattern scoring. Score reporting based on summed scores is often more socially acceptable for many consumers of test scores, because questions of perceived unfairness arise when examinees with the same number-correct score are given different scaled scores. For the most part, the results reported here correspond to Yen’s findings with modal scores based on the likelihood of the item responses alone, without consideration of the population distribution. The differ- ences between the results found here and those reported by Yen ( 1984) may be accounted for by the fact Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 48 that the Gaussian population distribution and EAP(0) j = Ik,), were used here in place of the likelihood alone and its mode. Yen reported greatly increased variability with scores at the extremes of the distribu- tion ; those problems do not arise when the full model is used. In addition, when the population distribution is included in the IRT model, computation of the observed score distribution itself is straightforward. The expected distribution can be used to provide smoothed percentile tables for the current form of the test, or preoperational percentile tables for tests assembled based on IRT, using the item parameters and the parameters of the population distribution. If the population distribution assumed in the IRT model does not represent the distribution of 0 well for the examinees, then the inferred score distribution will depart from the observed score distribution, which has both positive and negative implications. A positive implication is that such a departure should be useful as a diagnostic suggesting misspecification of the population distribution. On the negative side, the inferred score distribution will not be accurate as a source of preoperational percentile tables or for other similar uses. The extent to which the inferred score distribution might be sensitive to misspecification of the population distribution has not yet been examined. In all cases in which it has been used thus far with the North Carolina testing program, the assumption of a normal population distribution for 0 has produced score distributions very much like the observed score distributions. In principle, the recursive algorithm used here for the computation of EAP(0) j = ~k,), SD(0) j = Y-k,), and P may be used for tests that combine binary-scored multiple-choice sections with open-ended items scored in multiple categories, simply by accumulating the &dquo;points&dquo; for each item into the score j. However, in practice this solution may not produce scaled scores with adequate precision. The problem is that rated &dquo;points&dquo; on open-ended items may reflect very different changes in scaled scores than do the &dquo;points&dquo; associated with each correct response to the multiple-choice items. Indeed, the rated &dquo;points&dquo; for the open- ended items may be associated with very different increases in scaled scores at different levels on the 0 continuum. For example, for the third grade Social Studies test (see Table 3), an increase of one &dquo;point&dquo; from 0 to 1 was associated with an increase in the scaled score from -.88 to -.18, or .7 standard units; however, an increase of one &dquo;point&dquo; between scores of 8 and 9 was associated with an increase in the scaled score of only half as much (.37). For the multiple-choice tests (percentiles are depicted in Figure 3), an increase in the summed score of one &dquo;point&dquo; was associated with a difference in scaled scores of between .05 and .2 standard units, depending on the location on the score scale. To some extent, this difference in the relative value of &dquo;points&dquo; may be adjusted by scoring the open-ended items with more &dquo;points;&dquo; this analysis indicates that the difference between open-ended rating values was approximately 4-5 multiple-choice points. ) A better solution to the problem of combining binary-scored multiple-choice sections with open-endeditems scored in multiple categories may involve a hybridization of summed-score and response-pattern computation of scaled scores. To implement this approach, compute Z~(8) for the multiple-choice section and L°E(9) for the open-ended section using the recursive algorithm. Next, for each combination of a given summed score on the multiple-choice section with any summed score on the open-ended section, compute the product L)~(0) Lp~(0)#(0) . Then, taking these products as the posterior density for that response pattern (score j on the multiple-choice section and score j’ on the open-ended section), compute the expected values and SDs for each of those posterior distributions. The resulting two-way score translation table would provide scaled scores and their standard errors for each &dquo;response pattern,&dquo; where the &dquo;pattern&dquo; refers to the ordered pair (score j on the multiple-choice section, score j’ on the open-ended section). This procedure would offer many of the practical advantages of summed scores, and it would preserve the differences in scaled scores that may be associated with very different values of &dquo;points&dquo; on the multiple- choice and open-ended sections. Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 49 References Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 395-479). Reading MA: Addison-Wesley. Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431-444. Lord, F. M. (1953). The relation of test score to the trait underlying the test. Educational and Psychological Measurement, 13, 517-548. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading MA: Addison-Wesley. Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score "equatings." AppliedPsychological Measurement, 8, 453-461. Masters, G. N., & Wright, B. D. (1984). The essential process in a family of measurement models. Psycho- metrika, 49, 529-544. Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika, 56, 611-630. Rasch, G. (1960). Probabilistic models for some intelli- gence and attainment tests. Copenhagen: Danish In- stitute for Educational Research. Expanded edition, University of Chicago Press, 1980. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Monograph, No. 17. Stroud, A. H. (1974). Numerical quadrature and solu- tion of ordinary differential equations. New York: Springer-Verlag. Thissen, D. (1991). MULTILOG user’s guide: Multiple, categorical item analysis and test scoring using item response theory. Chicago: Scientific Software. Thissen, D., Pommerich, M., & Williams, V. S. L. (1993, June). Some algorithms for computing E(&thetas;|summed score), and the implied score distribution, using item response theory. Paper presented at the meeting of the Psychometric Society, Berkeley CA. Tierney, L. (1990). Lisp-Stat: An object-oriented envi- ronment for statistical computing and dynamic graph- ics. New York: Wiley. Walsh, J. E. (1963). Corrections to two papers concerned with binomial events. Sankhya, 25, Series A, 427. Wilson, D. T., Wood, R., & Gibbons, R. (1991). Testfact: Test scoring, item statistics, and item factor analysis. Chicago: Scientific Software. Yen, W. M. (1984). Obtaining maximum likelihood trait estimates from number-correct scores for the three- parameter logistic model. Journal of Educational Measurement, 21, 93-111. Acknowledgments The research reported here was supported by the North Carolina Department of Public Instruction, in conjunc- tion with the development of the North Carolina End-of- Grade Testing Program. The authors thankRichard Luecht, Robert McKinley, Robert Mislevy, James Ramsay, and Linda ~ghtman for their help in the course of this work. Author’s Address Send requests for reprints or further information to David Thissen, L. L. Thurstone Psychometric Laboratory, Uni- versity of North Carolina, CB #3270, Davie Hall, Chapel Hill NC 27599-3270, U.S.A. Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/