Applied Psychological Measurement, Volume 09, 1985

Persistent link for this collection

Search within Applied Psychological Measurement, Volume 09, 1985

Browse

Recent Submissions

Now showing 1 - 20 of 34
  • Item
    An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data
    (1985) Ansley, Timothy N.; Forsyth, Robert A.
    The purpose of this investigation was to study the nature of the item and ability estimates obtained when the modified three-parameter logistic model is used with two-dimensional data. To examine the effects of two-dimensional data on unidimensional parameter estimates, the relative potency of the two dimensions was systematically varied by changing the correlations between the two ability dimensions. Data sets based on correlations of .0, .3, .6, .9, and .95 were generated for each of four combinations of sample size and test length. Also, for each of these four combinations, five unidimensional data sets were simulated for comparison purposes. Relative to the nature of the unidimensional estimates, it was found that the [circumflex a] value seemed best considered as the average of the true a values. The [circumflex b] value seemed best thought of as an overestimate of the true b[subscript 1] values. The [circumflex theta] value seemed best considered as the average of the true ability parameters. Although there was a consistent trend for these relationships to strengthen as the ability dimensions became more highly correlated, there was always a substantial disparity between the magnitudes of these values and of those derived from the unidimensional data. Sample size and test length had very little effect on these relationships.
  • Item
    A minimum chi-square method for developing a common metric in item response theory
    (1985) Divgi, D. R.
    The θ scale in item response theory has arbitrary unit and origin. When a group of items is calibrated twice, estimates from one calibration must be transformed to the metric of the other. A new method is presented for doing so. It is simpler than an earlier method based on test characteristic curves, and makes more complete use of available information.
  • Item
    Comparing fit of nonsubsuming probability models
    (1985) Alvord, Gregory; Macready, George B.
    A "mixture" probability model that incorporates two component models defined by nonsubsuming sets of parameters is introduced, and a strategy for using this model in the selection of a preferred component model is developed. Example applications of the suggested strategy are considered for the special case in which the Rasch item response model and a Latent State Mastery model are the component models compared. Simulated data sets generated under each of these models were used to provide example applications of the proposed model selection strategy.
  • Item
    Latent trait item analysis and facet theory--a useful combination.
    (1985) Balla, John R.; McDonald, Roderick P.
    Computer programs for fitting latent trait models to data provide indices of item misfit. An analysis of the consistency of item misfit determination is presented. Two content-equivalent forms of 71 items representing the behavioral domain of arithmetic skills were generated. Each item was defined in terms of its combination of facet elements, and the ith item on each form represented the same selection of facets. The dichotomously scored responses to the two forms were analyzed using the computer programs NOHARM and BICAL. Misfitting items were identified by use of the residual covariances in the case of NOHARM and Total-t and Between-t in the case of BICAL. The consistency of misfit was measured by the extent of agreement in selection of misfitting items across the parallel forms. It was found that the analysis of residual covariances provided a more consistent means of determining item misfit. It was concluded that the use of the Between-t and Total-t indices as a basis for editing items should be viewed cautiously. In addition, misfitting items were grouped according to common facet elements and reasons for misfit were postulated. Thus, the analysis of residual covariances of items defined in terms of their combination of facet elements seems to provide a very satisfactory method of item analysis.
  • Item
    Chance baselines for INDSCAL's goodness-of-fit index
    (1985) Dong, Hei-Ki
    For purposes of evaluating INDSCAL’s goodness-of-fit index (R₁), random data were generated for various combinations of N (the number of individuals), ranging from 3 to 30, and n (the number of stimuli), ranging from 10 to 40. Results from INDSCAL analyses of random data were then used to provide an easy-to-use equation for estimating the expected value of R₁ as a function N, n, and r (the number of dimensions). The equation was highly successful in predicting expected values of R₁ for random data, as indicated by a multiple correlation of .9965.
  • Item
    Full-information item factor analysis: Applications of EAP scores
    (1985) Muraki, Eiji; Engelhard, George, Jr.
    The full-information item factor analysis model proposed by Bock and Aitkin (1981) is described, and some of the characteristics of expected a posteriori (EAP) scores are illustrated. Three simulation studies were conducted to illustrate the model, and an application of full-information item factor analysis to a set of real data is described.
  • Item
    The difficulty of test items that measure more than one ability
    (1985) Reckase, Mark D.
    Many test items require more than one ability to obtain a correct response. This article proposes a multidimensional index of item difficulty that can be used with items of this type. The proposed index describes multidimensional item difficulty as the direction in the multidimensional space in which the item provides the most information and the distance in that direction to the most informative point. The multidimensional difficulty is derived for a particular item response theory model and an example of its application is given using the ACT Mathematics Usage Test.
  • Item
    A simulation study of item bias using a two-parameter item response model
    (1985) McCauley, Cynthia D.; Mendoza, Jorge
    Possible underlying causes of item bias were examined using a simulation procedure. Data sets were generated to conform to specified factor structures and mean factor scores. Comparisons between the item parameters of various data sets were made with one data set representing the "majority" group and another data set representing the "minority" group. Results indicated that items that required a secondary ability, on which two groups differed in mean level, were generally more biased than those items that do not require a secondary ability. Items with different factor structures in two groups were not consistently identified as more biased than those having similar factor structures. A substantial amount of agreement was found among the bias indices used in the study.
  • Item
    A nonparametric scale analysis of the development of conservation
    (1985) Kingma, Johannes; TenVergert, Elisabeth M.
    The purpose of this study was to investigate the development of conservation by using the nonparametric Mokken scale analysis. Subjects were 801 children from kindergarten and primary school Grades 1 and 2 who completed 13 conservation tasks derived from Piaget’s publications. It was shown that some selections (i.e., seven, eight, and nine tasks, respectively, at three successive administrations with three-month’s interval) formed strong Mokken scales, which were invariant for different samples at the same point in time of test administration. Furthermore, it was found that during the course of development the number of tasks which fitted on the scale increased. However, some reversals of the relative positions of a small number of tasks were found for the scales at different points in time of test administration. It was concluded that application of nonparametric Mokken scale analysis resulted in a new, but very useful instrument for analyzing the order of acquisition of conservation.
  • Item
    Factors defined by negatively keyed items: The result of careless respondents?
    (1985) Schmitt, Neal; Stults, Daniel M.
    A frequently occurring phenomenon in factor and cluster analysis of personality or attitude scale items is that all or nearly all questionnaire items that are negatively keyed will define a single factor. Although substantive interpretations of these negative factors are usually attempted, this study demonstrates that the negative factor could be produced by a relatively small portion of the respondents who fail to attend to the negative-positive wording of the items. Data were generated using three different correlation matrices, which demonstrated that regardless of data source, when only 10% of the respondents are careless in this fashion, a clearly definable negative factor is generated. Recommendations for instrument development and data editing are presented.
  • Item
    Non-Gramian and singular matrices in maximum likelihood factor analysis
    (1985) Dong, Hei-Ki
    In some cases, a correlation matrix may be singular because of the multicollinearity in data, and it may become non-Gramian because of computational inaccuracies. In such cases, popular methods of factor extraction, such as maximum likelihood factor analysis, image factor analysis, and canonical factor analysis, cannot be used because of computational difficulties. This article provides a simple heuristic procedure for converting such a matrix into a proper matrix, so that maximum likelihood factor analysis may be performed.
  • Item
    A comparison of five methods for estimating the standard error of measurement at specific score levels
    (1985) Feldt, Leonard S.; Steffen, Manfred; Gupta, Naim C.
    The Standards for Educational and Psychological Testing (1985) recommended that test publishers provide multiple estimates of the standard error of measurement-one estimate for each of a number of widely spaced score levels. The presumption is that the standard error varies across score levels, and that the interpretation of test scores should take into account the estimate applicable to the specific level of the examinee. This study compared five methods of estimating conditional standard errors. All five of the methods yielded a maximum value close to the middle of the score scale, with a sharp decline occurring near the extremes of the scale. These trends probably characterize the raw score standard error of most standardized achievement and ability tests. Other types of tests, constructed using alternative principles, might well exhibit different trends, however. Two methods of estimation were recommended: an approach based on polynomial smoothing of point estimates suggested by Thorndike (1951) for specific score levels and a modification proposed by Keats (1957) for the error variance derived under the binomial error model of Lord (1955).
  • Item
    A sixty-year perspective on psychological measurement
    (1985) Guilford, J. P.
    Based upon experiences with most kinds of methods of psychological measurement, this article presents comments on a variety of uses, including psychophysics, scaling, testing, and factor analysis. Some difficulties are pointed out, some faults are mentioned, and a variety of applications are discussed, some of them unusual.
  • Item
    Correcting for restriction of range in both X and Y when the unrestricted variances are unknown
    (1985) Alexander, Ralph A.; Hanges, Paul J.; Alliger, George M.
    Correction of correlation coefficients that have arisen from range restricted populations is commonly suggested and practiced in research on testing and measurement. Until recently, that research has operated under two important limitations. First, the majority of the research has dealt with range restriction on one variable only, and second, the correction formulas have assumed that the variance of the variable(s) in the unrestricted population was known. This article presents a method for estimating such corrections from the data in the restricted sample and applies the method to a recently developed approximation for restriction on both and Y. The procedure is evaluated and found to produce sufficiently accurate results to be useful in many practical range restriction settings.
  • Item
    Estimating the validity of a multiple-choice test item having k correct alternatives
    (1985) Wilcox, Rand R.
    In various situations, a multiple-choice test item may have more than one correct alternative, and the goal is to determine how many correct alternatives an examinee actually knows. For a randomly sampled examinee, the validity of an item is defined as the probability of deciding that the examinee knows i correct alternatives, when in fact exactly i correct alternatives are known. This article describes how latent class models can be used to estimate this probability.
  • Item
    Agreement between retrospective accounts of substance use and earlier reported substance use
    (1985) Collins, Linda M.; Graham, John W.; Hansen, William B.; Johnson, C. Anderson
    This present study examined agreement between retrospective accounts of substance use and earlier reported substance use in a high school age sample. Three issues were addressed: (1) extent of overall agreement; (2) evidence for the presence of a response-shift bias; and (3) extent to which current use biases recall of substance use. Subjects were 415 high school students who took part in a smoking prevention program. At the last measurement, which took place 2½ years after the pretest, the students were asked to recall pretest use of tobacco, alcohol, and marijuana, and use one year earlier. Results showed an overall tendency for students to recall less use of uncontrolled substances than had been previously reported. For the one controlled substance included in the questionnaire, marijuana, current nonusers tended to recall less use than they had reported at the time, whereas current users tended to recall more use than had been reported. The present study found no evidence for a response-shift bias. It is suggested that the explicitly worded anchors on the response scales helped prevent such a bias. Finally, the results suggest that current use biases recall of past use to a substantial extent, and that this bias affects recall of alcohol use most severely.
  • Item
    The development of a Rasch-type loneliness scale
    (1985) De Jong-Gierveld, Jenny; Kamphuis, Frans
    This paper describes an attempt to construct a measuring instrument for loneliness that meets the criteria of a Rasch scale. Rasch (1960, 1966) proposed a latent trait model for the unidimensional scaling of dichotomous items that does not suffer from the inadequacies of classical approaches. The resulting Rasch scale of this study, which is based on data from 1,201 employed, disabled, and jobless adults, consists of five positive and six negative items. The positive items assess feelings of belongingness, whereas the negative items apply to three separate aspects of missing relationships. The techniques for testing the assumptions underlying the Rasch model are compared with their counterparts from classical test theory, and the implications for the methodology of scale construction are discussed.
  • Item
    The analysis of item-ability regr essions: An exploratory IRT model fit tool
    (1985) Kingston, Neal M.; Dorans, Neil J.
    The use of item-ability regressions (the comparison of the regression of the observed proportion of people answering an item correctly on estimated &thetas; with the estimated item response function) to investigate the psychometric properties of particular item types in a given population was explored using data from four administrations of 10 item types (a total of 806 items) from the Graduate Record Examinations General Test. Although the method does not allow an absolute determination of fit for a latent trait model (in this case, for the three-parameter logistic model), it does show that certain item types consistently fit the model worse than other item types, and it led to and supported a specific hypothesis as to why the model probably did not fit these item types.
  • Item
    On the perceptual salience of features of Chernoff faces for representing multivariate data
    (1985) De Soete, Geert; De Corte, Wilfried
    In order to assess the perceptual salience of features of Chernoff faces, a study was conducted in which subjects had to rate the similarity of pairs of Chernoff faces. It was found that the facial features differ considerably in perceptual salience. Highly salient features are the curvature of the mouth, the half-face height, the half-length of the eyes, and the length of the eyebrows, whereas the eccentricity of the lower ellipse of the face, the position of the center of the mouth, the separation and the slant of the eyes, and the height of the brows have little influence on the perception of Chemoff faces.
  • Item
    Empirical tests of scale type for individual ratings
    (1985) Westermann, Rainer
    This article describes eight studies that tested empirically the hypothesis that rating procedures lead to interval-scale measurements for each single subject. In order to enhance the probability of obtaining interval scales, subjects made numerical ratings and were deliberately instructed to choose their responses so that the algebraic differences between numbers represented the subjective differences between the corresponding objects with respect to the attribute under study. This approach is based on axiomatic measurement theory. It is exemplified by a study from clinical psychological research pertaining to the subjective fear aroused by each of 160 objects or situations. Any subject’s ratings are regarded as interval-scale measurements of his or her individual degree of fear if the testable axioms of a finite, equally-spaced difference structure are satisfied empirically. These axioms pertain to ordinal judgments on differences, and they are tested empirically by deriving statistical hypotheses and using a refined significance-test method as an error theory. For the eight studies criteria were chosen primarily to avoid accepting false interval-scale hypotheses at the expense of relative high risks for false rejections. Nevertheless, empirical data allow acceptance of the hypothesis for 54 of the 114 subjects. As a consequence, for at least half of the subjects, this rating procedure seems to result in interval scales.