The six papers in this issue summarize 10 years of theory development, empirical research, and
practical experience in criterion-referenced testing. Much of the theory development has focused on
questions and issues raised by Popham and Husek (1969), who pointed out that much of traditional
psychometric theory did not work well when applied to criterion-referenced tests. The six papers,
taken together, represent an attempt to answer four basic questions:
1. How should the reliability of a criterion-referenced test be measured?
2. How should it be decided how many items are needed in a criterion-referenced test?
3. How should criterion-referenced tests be used to make decisions about the people taking the
4. What kind of evidence should be provided for the validity of a criterion-referenced test?
Attempts to answer these questions have been complicated by the lack of a universally accepted,
unambiguous definition of the term "criterion-referenced test." Glaser’s (1963) article, in which the
term first appeared, defined criterion-referenced measures as those that "depend on an absolute
standard of quality" (p. 519). However, Glaser went on to say that "the standard against which a student’s
performance is compared when measured in this manner is the behavior which defines each
point along the achievement continuum" (p. 519) and that "we need to behaviorally specify minimum
levels of performance..." (p. 520). These two ideas-absolute standards and behavioral test content
specifications-received varying degrees of emphasis from the different individuals who attempted to
develop criterion-referenced tests and to theorize about criterion-referenced testing. As a result, there
are now several different answers to some of the questions that Popham and Husek (1969) raised.