367 Factors Defined by Negatively Keyed Items: The Result of Careless Respondents? Neal Schmitt Michigan State University Daniel M. Stuits Quaker Oats Company A frequently occurring phenomenon in factor and cluster analysis of personality or attitude scale items is that all or nearly all questionnaire items that are nega- tively keyed will define a single factor. Although sub- stantive interpretations of these negative factors are usually attempted, this study demonstrates that the negative factor could be produced by a relatively small portion of the respondents who fail to attend to the negative-positive wording of the items. Data were generated using three different correlation matrices, which demonstrated that regardless of data source, when only 10% of the respondents are careless in this fashion, a clearly definable negative factor is gener- ated. Recommendations for instrument development and data editing are presented. Most textbooks or publications listing recom- mendations concerning attitude scale construction include the caveat that questionnaire items include both negatively and positively worded item stems (e. g. , Anastasi, 1980; Adkins-Wood, 1961; Thorn- dike, 1971; Wiggins, 1973). However, there is a relatively large body of literature on response styles, which indicates that these wording changes may make significant differences in the factor structure of scales and the item validities (Bentler, Jackson, & Messick, 1971). Bentler et al. argued convinc- ingly for two different types of acquiescence re- sponse styles. Agreement acquiescence results when a person responds positively to all statements in a personality instrument or attitude scale. Accep- tance acquiescence occurs when a person considers all personality characteristics or attitude statements as descriptive of him/herself or some object but disagrees with all statements that deny such char- acteristics. The objective of this article is not to resurrect the debate over types of response styles or even their existence (Block, 1971 ; Rorer, 1965), but rather to demonstrate that a small portion of respondents who are careless in reading the items may be responsible for the appearance of a factor consisting solely of negatively keyed items. These negatively keyed items may be either polar oppo- sites (happy-sad) or a negation of some trait or descriptor (happy-not happy). At the outset, it is very important to define what is meant by &dquo;careless&dquo; in this article. The careless respondent, who is the subject of this article, is not responding randomly. He/she is simply reading a few of the items in a measuring instrument, infer- ring what it is the items are asking of the respon- dent, and then responding in like manner to the remainder of the items in the instrument. This means that any item that is phrased inconsistently with the rest of the items in the instrument will elicit a response that is inconsistent with responses to the rest of the item pool and inconsistent with the re- sponder’s real position on the construct being mea- sured. For example, a student responding to a teacher evaluation instrument with a 5-point Likert-type APPLIED PSYCHOLOGICAL MEASUREMENT Vol. 9, No. 4, December 1985, pp. 367-373 © Copyright 1985 Applied Psychological Measurement Inc. 0146-6216/85/040367-07$1.60 Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 368 scale may decide the faculty person is above av- erage as a teacher and may intend to mark 4s on the scale. Instead of reading the items in the eval- uation instrument, the respondent simply marks 4 to all items, including those that express a negative opinion about the faculty member. In analyzing these responses, all negatively keyed items are re- coded. The result of this carelessness on the part of the respondent is not random; it is systematic. All negatively keyed items will be positively cor- related with each other and negatively correlated with the remaining positively worded items. This type of responding is consistent with Bentler et al.’s (1971) notion of agreement acquiescence. All paper-and-pencil instruments are subject to this problem. In addition to student evaluations of faculty, the same type of error is possible in per- formance evaluations of faculty, in performance evaluations used in other contexts, manipulation checks in social-experimental research, personality measures, attitude scales, interest inventories, and survey research. This article illustrates what can happen in factor analytic research of data which include a relatively small portion of respondents who are careless as defined above. This is a matter of convenience only; a similar possible problem exists with any questionnaire or self-report mea- sure, whether or not the measure is factor analyzed. Further, it is important to note that this study was not demonstrating that people have responded this way in any previous research, but it demonstrates that this type of careless responding is one feasible explanation of the appearance of this factor. Iden- tification of people who do respond in this fashion is more problematic, though some possibilities are suggested in the discussion section below. Very frequently, authors reporting factor or clus- ter analyses of responses to an attitude scale or personality inventory find that a majority of the negatively keyed items (usually a minority of the items in most measures) load on, or define, a single factor. For example, Schmitt and Coyle (1976) fac- tor analyzed a 74-item questionnaire concerning the reactions of college student applicants to placement interviewers. The second of six applicant reaction factors identified in their study was defined by neg- ative descriptors such as the following: irritable, defensive, used inappropriate words, lost train of thought, explained in unnecessary detail, self- conscious, and so forth. Of 13 items defining this factor, only 1 was positive. Further, only 6 neg- ative items were loaded most highly on other fac- tors. A similar pattern of factor loadings is seen in a study of perceived support for innovation in sec- ondary schools. Siegel and Kaemmerer (1978) evaluated a pool of 525 statements thought to be descriptive of innovative and traditional organi- zations. Their final three-factor solution included a factor they titled Tolerance of Differences, which included a predominance of negatively keyed items such as &dquo;This place seems to be more concerned with the status quo than with change&dquo; and &dquo;The best way to get along in this organization is to think the way the rest of the group does.&dquo; Another example of this phenomenon appears in industrial/organizational and work stress research and involves a measure of role conflict and am- biguity (Rizzo, House, & Lirtzman, 1970). Re- cently, Tracy and Johnson ( 1981 ) have pointed out that all eight items of the role conflict scale are worded to represent stressful or conflict-laden char- acteristics of a work role. The six role ambiguity items are all worded to represent nonstressful or unambiguous characteristics of the role. The in- tended meaning of the scales (conflict vs. ambi- guity) is totally confounded with the difference in wording indicating either stress (which was labeled role conflict) or comfort (labeled role ambiguity). A similar effect has been noted in early research on the F-scale (Adomo, Frankel-Brunswik, Lev- inson, & Sanford, 1950). Items consist of relatively strongly worded opinions, most of which express a critical attitude about human nature. When in- vestigators began questioning whether F-scale scores reflected an authoritarian personality or a response style, a reflected F-scale was constructed. Corre- lations between the original F-scale and this re- flected scale were only .20 (Chapman & Campbell, 1957; Messick & Jackson, 1958). Jackson and Messick (1961, 1962) found similar factor analytic resultsfortheMMPi, thoughsubsequentitem-reversal studies (the original items are reversed) of the MMPI indicated high correlations between the original Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 369 measures and the reversal measures (Lichtenstein & Bryan, 1965; Rorer & Goldberg, 1965a, 1965b). In summary, the result of factor analyses on scales with negatively keyed items frequently leads to the identification of a factor defined wholly or mostly by those negatively keyed items. The literature cited alone indicates that this finding is relatively wide- spread in the sense that it occurs in a variety of research areas. Examples included studies of in- terview impressions, personality scales, and role ambiguity. The objective of the present study was to show how a &dquo;negative factor&dquo; can be produced by a relatively small number of careless respondents who do not notice that some items are the opposite in meaning to the majority of the items. In a series of simulations, the proportion of &dquo;careless&dquo; re- spondents and the proportion of negatively keyed items were varied for data generated from three different correlation matrices reflecting different levels of item intercorrelation. Method Data Generation To simplify comparisons, three 30-item corre- lation matrices were selected to serve as the sources of the data which were generated and analyzed. These matrices were chosen so as to represent dif- ferent levels of item intercorrelation and different substantive content.’ 1 The first matrix (ASSMT) represented the inter- correlations of ratings on 15 skill dimensions by two raters in an assessment center (see Schmitt, 1977, for a description of the rating dimensions). The average item intercorrelation across the 30 items was .36; the range of item intercorrelations was from .00 to .82. Principal components analysis yielded seven factors with an eigenvalue greater than 1.0. Eigenvalues for these seven factors were 10.78, 3.42, 2.14, 1.85, 1.51, 1.35, and 1.0. Al- though use of the eigenvalue criterion would have ’The three correlation matrices are available from the first au- thor. suggested seven factors, the scree criterion (Cattell, 1966) suggested three factors, as did content con- siderations in earlier component analyses (Schmitt, 1977). The second matrix consisted of intercorrelations of responses to 30 items in the Central Life Interest (CLI) measure developed and researched by Dubin and his colleagues (Dubin, 1956; Dubin & Cham- poux, 1974; Dubin & Goldman, 1972). These 30 items are meant to measure a single factor, but they are dichotomously scored, hence item intercorre- lations are relatively low. In this sample, average item intercorrelations were .13; the range of inter- correlations was from - .13 to .39. Eigenvalues for the 11 factors whose eigenvalues were greater than 1.0 were 3.91, 1.86, 1.63, 1.47, 1.32, 1.28, 1.23, 1.11, 1.05, 1.02, and 1.01. Use of the scree criterion, plus the fact that these are items designed to measure a single concept, would have suggested a single factor. Because of the relatively low level of item intercorrelation, many &dquo;small&dquo; factors were obtained. The third matrix (SEMSQ) of intercorrelations was generated by responses to the 10 items of the Rosenberg self-esteem measure (Rosenberg, 1965) and the 20 items of the Minnesota Satisfaction Questionnaire (Weiss, Dawis, England, & Lof- quist, 1967), which is usually divided into intrinsic and extrinsic satisfaction subscales. Average item intercorrelations were .28; the range of intercor- relations was from -.01 to .71. The eigenvalues of the seven factors with eigenvalues greater than 1.0 were 8.53, 3.25, 1.65, 1.25, 1.18, 1.09 and 1.04. Again, the scree criterion as well as content considerations might have suggested a three-factor solution. The content of the items that served as the basis of this study was not particularly important, but the matrices were chosen because they represented actual, but relatively diverse, matrices in terms of item intercorrelation. Data for the study were generated in the follow- ing manner for each of the three initial correlation matrices: 1. The complete factor loading matrix (Number of factors equals 30) was computed for each matrix. Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 370 2. The signs of the factor loadings for 4, 8, or 12 randomly selected items were changed to represent unreflected negatively keyed items. 3. Each of these factor loading matrices was then used as input to the Ohio State Correlated Score Generation Method (Wherry, Naylor, Wherry, & Fallis, 1965), and the &dquo;responses&dquo; of 400 people were generated. Each &dquo;posi- tively&dquo; keyed item was given a mean of 5; negatively worded items were given means of 3. All items had standard deviations of 1.2. Decimals were truncated and response values greater than 7 were recoded to 7; those less than 1 were recoded to 1. The result was a set of 400 responses to 30 items, each with a 1 to 7 response scale and intercorrelations that were representative of the original correlation matrices. Data Analysis All negatively keyed items were recoded for all 400 cases for each of three basic sets of data as they normally would be, and principal components analyses were conducted. The factor loadings mat- rices for these analyses should be reflective of what would be obtained if substantively meaningful interpretations were made by all respondents to all items (0% careless results). The eigenvalue crite- rion was used to determine how many factors to rotate as would be fairly typical in exploratory fac- tor analyses. Varimax rotation of these factors was used in all analyses. Next, factor analyses were conducted for the same set of data when a randomly selected subset of the cases was left unrecoded. Data matrices based on four different proportions of unrecoded cases (5%, 10%, 15%, and 20%) were analyzed to determine how many careless re- spondents can create a factor loading matrix in which there is a factor identified primarily or wholly by negatively keyed items. Dependent Variable As evidence that these manipulations were cre- ating a factor identified solely by unrecoded items, the number of negatively keyed items that appeared on each factor and the number that appeared on the same factor for each condition were counted. In all cases, the factor loading that was highest determined the placement of a variable on a factor. Results The results of the counts of the factor loadings of negatively keyed items are presented in Table 1. As can readily be seen in Table 1 for each data set (ASSMT, CLI, SEMSQ), a clearly identifiable &dquo;neg- ative&dquo; factor appears when only 10% of the re- spondents answer as if they failed to notice that some portion of the items were worded inconsis- tently with the majority of the items. Some clus- tering of negative items is already present when only 5% of the respondents are &dquo;careless,&dquo; but probably not enough that investigators would rec- ognize the problem. The number of &dquo;negatively&dquo; keyed items in the item pool did not seem to have much effect on the identification of a negative fac- tor, though with an increase in the number of such items, the negative factor became more prominent in the solution. Originally it was expected that there would be differences across matrices in how easily a negative factor is created by careless responding, but this did not seem to take place. Results were consistent across the three matrices studied. Space considerations preclude reproduction of all factor matrices for the conditions which were generated, but the results for all three correlation matrices were highly similar. 2 With no &dquo;careless&dquo; respondents, the negatively keyed items were scat- tered across all factors (using highest factor loading as a means of defining factors) as would be ex- pected if the respondents were sensitive to the con- tent of the items. This pattern continues to be true when 5% of the cases were not coded appropri- ately. With 10% &dquo;careless&dquo; respondents, how- ever, the first factor is typically defined by the &dquo;negative&dquo; items. In the event of 15% and 20% &dquo;careless&dquo; respondents, all negatively worded items were found to load on the first factor. As the per- centage of &dquo;careless&dquo; respondents increased from 2These factor analytic results are available from the first author. Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 371 Table 1 Number of Factors on Which Negatively Keyed Items Appear and the Largest Number of Negatively Keyed Items on a Single Factor aNF is the number of different factors on which the negatively keyed items were most highly loaded. bNNS is the number of negatively keyed items which loaded highest on a single factor. This count was always done on the factor defined by the largest number of negative items. 10% to 15% to 20%, the size of the factor loadings for the negative items increased. Although there were slight variations in this pattern, the results across matrices and number of negatively keyed items were remarkably similar. Conclusions and Recommendations The results of these analyses have a clear im- plication for researchers who factor or cluster ana- lyze data in which the wording of items is varied. Such researchers should be highly suspicious of factors loaded primarily with negatively keyed items. Likewise, consumers of this research should ques- tion substantive interpretations of such negative factors. The results of this study indicate that, with only 10% of the respondents ignoring the wording of items, a negative factor will appear regardless of the substantive meaning of the items. What can a researcher do if he/she is concerned about this problem or when he/she recognizes that it is a potential problem in the analysis of item responses? First, questionnaire instructions may in- clude a warning to potential respondents that some questions will be negatively keyed and that they should attend to all items. Second, researchers should be especially con- cerned with overall questionnaire length or with a Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 372 lengthy set of items that employ the same response format. The temptation to include similar items to increase the internal consistency of a set of items measuring a single construct must be balanced by a concern that respondents will become fatigued or bored when they answer many like-sounding items. This precaution is consistent with research by Trott and Jackson (1967), who found that an acquiescence factor was strongly associated with the speed of presentation of personality items. When items were presented under speeded conditions, the largest factor obtained indicated an almost com- plete separation between true- and false-keyed scales. With less demand for speed, the acquiescence fac- tor was sixth largest and not as clearly defined. Moreover, the content factors were more easily determined. Third, researchers should be especially cautious concerning negative factors when responses to questionnaires are &dquo;involuntary&dquo; or when there is some reason to sabotage the research effort. This is certainly possible when the respondents are col- lege sophomores, but it is equally likely in data collection efforts carried out with varying degrees of organizational sponsorship. All three of these recommendations are qualitative and speculative. Based on available evidence, the exact influence each of these factors has in producing careless re- sponses of the type described in this article cannot be indicated. In this context, it may be useful to experiment with the wording of directions and the length of questionnaires/instruments as well as the serial position of any negatively keyed items. Fur- ther, the context in which data are collected could be varied in an effort to assess the effect of context on the presence or absence of &dquo;negative&dquo; factors. Fourth, data should be edited in a way in which unusual response patterns may be detected. For example, each respondent’s data should be ex- amined to find unusual responses. If negative and positive items are recoded so as to be consistent, then a respondent whose primary responses on a 7-point scale are 5 and 6 would be suspicious if negatively worded item responses were 2 and 3. Responses from these individuals would be best deleted prior to any further analyses. A more sys- tematic analysis of these &dquo;careless&dquo; respondents is possible with use of item response theory (IRT). Latent trait analyses (Lord, 1980; Wright & Stone, 1979) allow the determination of which item re- sponses made by an individual are not well pre- dicted by the IRT model. As a consequence, it is possible to detect unusual responses at the individ- ual level. These unusual responses would be a de- viation from those predicted by the IRT model. Since sample size and number of item requirements for IRT analyses are large, however, latent trait parameters may not be obtainable for many instru- ments. Finally, editors and reviewers of papers report- ing factor analyses in which a negative factor ap- pears should demand that authors consider the pos- sibility that some portion of their respondents were careless and that appropriate editing of data take place. It bears repetition that all this study demonstrates is what could happen if respondents failed to notice negatively keyed items. Research directed to a de- termination that respondents actually do respond this way should be conducted. Data editing to iden- tify such respondents would be necessary. It should be pointed out that this study analyzed randomly generated data based on only three cor- relation matrices. Other factors may influence the appearance and prominence of a negative factor. However, the consistency with which the negative factor presents itself, even when the proportion of &dquo;careless&dquo; examinees is relatively small, indicates that this is a likely explanation of the occurrence of at least some of the reports of negative factors in the published literature. Given the relative fre- quency with which a negative factor is reported in the literature and the ease with which such a factor is produced, researchers should be especially wary when their factor analyses produce factors that are loaded primarily by negative items. Further, users of questionnaires should also take steps to mini- mize the problem in the construction of their in- struments and the directions which accompany those instruments. References Adkins-Wood, D. (1961). Test construction. Columbus OH: Merrill. Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 373 Adorno, T. W., Frankel-Brunswik, E., Levinson, D. J., & Sanford, R. W. (1950). The authoritarian person- ality. New York: Harper. Anastasi, A. (1980). Psychological testing. New York: MacMillan. Bentler, P. M., Jackson, D. N., & Messick, S. (1971). Identification of content and style: A two-dimensional interpretation of acquiescence. Psychological Bulle- tin, 76, 186-204. Block, J. (1971). On further conjectures regarding ac- quiescence. Psychological Bulletin, 76, 205-210. Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245- 276. Chapman, L. J., & Campbell, D. T. (1957). Response set in the F-scale. Journal of Abnormal and Social Psychology, 54, 129-132. Dubin, R. (1956). Industrial workers’ worlds: A study of the "central life interests" of industrial workers. Social Problems, 3, 131-142. Dubin, R., & Champoux, J. E. (1974). Workers’ central life interests and job performance. Sociology of Work and Occupations, 1, 313-326. Dubin, R., & Goldman, D. R. (1972). Central life in- terests of American middle managers and specialists. Journal of Vocational Behavior, 2, 133-141. Jackson, D. N., & Messick, S. (1961). Acquiescence and desirability as response determinants on the MMPI. Educational and Psychological Measurement, 21, 771- 790. Jackson, D. N., & Messick, S. (1962). Response styles on the MMPI: A comparison of clinical and normal samples. Journal of Abnormal and Social Psychology, 65, 285-299. Lichtenstein, E., & Bryan, S. H. (1965). Acquiescence and the MMPI: An item-reversal approach. Journal of Abnormal Psychology, 70, 290-294. Lord, F. M. (1980). Applications of item response the- ory to practical testing problems. Hillsdale NJ: Erl- baum. Messick, S., & Jackson, D. N. (1958). The measure- ment of authoritarian attitudes. Educational and Psy- chological Measurement, 18, 241-253. Rizzo, J. R., House, R. J., & Lirtzman, S. I. (1970). Role conflict and ambiguity in complex organizations. Administrative Science Quarterly, 15, 150-163. Rorer, L. G. (1965). The great response-style myth. Psychological Bulletin, 63, 129-156. Rorer, L. G., & Goldberg, L. R. (1965a). Acquiescence and the vanishing variance component. Journal of Ap- plied Psychology, 49, 422-430. Rorer, L. G., & Goldberg, L. R. (1965b). Acquiescence in the MMPI? Educational and Psychological Mea- surement, 25, 801-817. Rosenberg, M. (1965). Society and the adolescent self- image. Princeton NJ: Princeton University Press. Schmitt, N. (1977). Interrater agreement in dimen- sionality and combination of assessment center judg- ments. Journal of Applied Psychology, 62, 171-176. Schmitt, N., & Coyle, B. W. (1976). Applicant deci- sions in the employment interview. Journal of Applied Psychology, 61, 184-192. Siegel, S. M., & Kaemmerer, W. F. (1978). Measuring the perceived support for innovation in organizations. Journal of Applied Psychology, 63, 553-562. Thorndike, R. L. (1971). Educational measurement. Washington DC: American Council on Education. Tracy, L., & Johnson, T. W. (1981). What do the role conflict and role ambiguity scales measure? Journal of Applied Psychology, 66, 464-469. Trott, D. J., & Jackson, D. N. (1967). An experimental analysis of acquiescence. Journal of Experimental Re- search in Personality, 2, 278-288. Weiss, R. V., Dawis, G., England, G. W., & Lofquist, L. W. (1967). Minnesota Studies in Vocation Reha- bilitation : Manual for the Minnesota Satisfaction Questionnaire. Minneapolis: University of Minne- sota. Wherry, R. J., Sr., Naylor, J. C., Wherry, R. J., Jr., & Fallis, R. F. (1965). Generating multiple samples of multivariate data with arbitrary population param- eters. Psychometrika, 30, 303-313. Wiggins, J. S. (1973). Personality and prediction. Reading MA: Addison-Wesley. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press. Author’s Address Send requests for reprints or further information to Neal Schmitt, Department of Psychology, Michigan State University, East Lansing MI 48824-1117, U.S.A. Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/