It’s Not Insignificant: I-O Psychology’s Dilemma of Nonsignificance
Maura J. Mills and Vivian A. Woo
For years, the meta-analytic design has recognized the importance of nonsignificant results. Thorough meta-analyses include not only published articles on a topic but also unpublished ones. Why? Because, as most I-O psychologists are painfully aware, studies that fail to find significance of hypothesized relationships are rarely published, particularly in top-tier journals. Therefore, the (published) literature is unjustifiably biased toward findings of statistical significance. In fact, part of the reason we value meta-analyses to such a great extent is because they include this oft-neglected (but equally important) part of the literature. Authors of such studies realize that any meta-analysis that neglected such research would be insufficient. So what makes the literature and publishing standards as a whole so different? Why do we as a discipline mistakenly equate “nonsignificance” with “insignificance”? As a result of this misplaced assumption, studies yielding nonsignificant results are generally treated as trivial at best and unpublishable at worst. As intellectuals and professionals, we should hold the same logical, comprehensive, and integrative standards for all research as we recognize is the basis for good meta-analyses.
The Value of Nonsignificant Research
The issue at the heart of this paper has at one time or another crossed the mind of every frustrated I-O psychologist who has spent months, perhaps even years, laboring over a study only to have his or her results yield a p value exceeding .05. The implications for such nonsignificant findings can, of course, be just as meaningful as significant findings. Yet too often, regardless of a researcher’s initial belief in any given research project, these significance-pursuing researchers become dejected when they fail to find significant results. This is likely largely due to the researcher’s knowledge that publication likelihood has plummeted. However, by this token researchers have inevitably succumbed to the same fallacy as does the publication community, and, as a result, researchers’ views of nonsignificant results are restricted in scope, and they fail to see the forest for the trees. Nonsignificant research holds great, although largely untapped, potential to further both the science and practice of I-O psychology.
Research is the means by which we—as either researchers and/or practitioners—expand our knowledge and inform practice, and the main method of dissemination of this knowledge is through publication. In actuality, however, published research is only the tip of the iceberg: Far more lurks beneath the surface, as the predominant exclusion of nonsignificant research findings from publication biases our conclusions and subsequent practice. This leads to what Rosenthal (1979) called the “file drawer problem.” Particularly in meta-analyses, research conclusions drawn from combining a given number of studies could be negated if there remain a certain number of nonsignificant studies out there, unseen. Howard et al. (2009) argue that no matter how many corrections are made by estimating the number of “file drawer” studies that exist on any given topic, conclusions from meta-analyses—although better than not considering unpublished and/or nonsignificant studies at all—will still be inaccurate.
The unknown number of nonsignificant studies is a troubling matter. Not only could they affect scientific conclusions, but they could also impact directions for future research. In knowing that a particular research topic or methodology has repeatedly yielded nonsignificance in past studies, future researchers can avoid repeating the same dead-end research, wasting time and resources that could otherwise be allocated elsewhere. The reverse is also a possibility: Nonsignificant findings could illuminate promising avenues for future researchers, avenues which may otherwise remain unexplored.
To speak of the potential of nonsignificant results in guiding fruitful paths for future research, a discussion of one of the great influences of nonsignificance is warranted. Specifically, the likelihood of finding significance is heavily impacted by sample size. Garnering enough participants to obtain adequate statistical power can be challenging, particularly in group-level research such as when studying work units or departments. Thus, a single study with results that only slightly fail to meet the .05 threshold for statistical significance may not be viewed as important, but if a similar pattern is established across a number of studies, valuable insights may be revealed, and with it, the potential for significant results should a larger sample be obtained. However, such a possibility is contingent upon awareness of the nonsignificant studies that have been conducted.
How Nonsignificant Became (Perceived as) Insignificant
This idea of the importance of nonsignificant findings is not new. Indeed, Rosenthal (1979) recognized this issue a full 3 decades ago, and most researchers are frustrated by it on a regular basis. Why then do we as a discipline continue to subject ourselves to it in a learned helplessness sort of way? In this manner, at least, we are our own worst enemy.
Null hypothesis statistical testing (NHST) is the origin of the current circumstances surrounding studies yielding nonsignificant results. It is hard to imagine a time when researchers did not run statistical tests of significance when conducting a study. Nevertheless, before the 1940s, there were very few empirical articles that included NHST. However, in a rapid change of pace, by the 1950s, approximately 80% of the articles in the top four psychology journals used NHST (Sterling, 1959). There was a similar pattern in the usage of statistical significance testing in articles published in top-tier journals such as the Journal of Applied Psychology. Between 1917 and 1939, approximately one third or fewer of the articles published in this highly-ranked journal used NHST, contrasted with over 90% in the 1990s (Hubbard, Parsa, & Luthy, 1997). Today, it is difficult to locate an empirical article in any top journal that does not include NHST.
One major underlying issue is many researchers’ ignorance of the proper use and application of statistics, contributing to social scientists placing inflated importance on NHST. Significance testing is a means of assessing the reliability of the statistical findings (Gelman & Stern, 2006). Null hypothesis significance testing encourages dichotomous thinking and as a result is often misinterpreted. Many I-O psychologists (rightly) critique techniques such as median splits for exactly this type of dichotomous thinking, which puts juxtaposing results in entirely separate—indeed, opposing—categories. However, despite many researchers’ legitimate critiques of such procedures, significance testing puts us at the mercy of such dichotomous thinking (and subsequent decision making) on a regular basis.
In order to move past our fear of nonsignificant results, we as a discipline must ensure that our institutions of higher education are training future I-O psychologists in appropriate statistical understanding and interpretation. In a shocking study by Oakes (1986), psychology faculty were asked for their interpretations of a p value less than .01, and only 11% responded with the correct interpretation. Further, as an example of the dichotomous thinking facilitated by NHST, Nelson, Rosenthal, and Rosnow (1986) found an abrupt drop in participants’ confidence in a finding as p increased past the .05 level, indicating binary thinking to reject or fail to reject at the most common p value. Therefore, although we would like to believe that the majority of researchers understand the potential pitfalls of relying too heavily on NHST, unfortunately it appears that this is not always the case. In fact, by continuing to fall victim to the publication bias, we make the very same mistake, repeatedly succumbing to the status quo that significance is always best.
By ignoring the limitations of significance testing, we do not give other valuable practices the weight that they deserve. Replication, for instance, is essential to the advancement of science, to the extent that Kline (2004) went so far as to argue that statistical tests are unnecessary, given ample replication of results. Although we do not jump to such a conclusion here, we do argue that significance testing is best used in conjunction with other indicators of research quality and potential for contribution. Although the APA has recognized the importance of this for significant results in that it now requires effect sizes to be included along with statistical significance values, more could be done to encourage I-O psychologists to outwardly recognize, both theoretically and in practice, that there are additional indicators of research quality, usefulness, and publishability for nonsignificant results.
Publish (Significant Results) or Perish
An important issue surrounding the practice of NHST is the resulting publication bias that is rampant in our field. Publishing bias—specifically, a bias toward publishing manuscripts yielding significant results as well as those confirming existing theory (Pagell et al., 2009)—has long been problematic. Such a bias increases the frequency of Type I errors published and ultimately results in a body of literature that is unrepresentative at best and misleading at worst. More than half a century ago, Sterling (1959) studied the low percentage rates of articles with nonsignificant results in psychological journals. As time has passed, the number of publication outlets has grown, but it has been surpassed by the number of researchers attempting to publish. The end result is the same, but the effect is magnified by the sheer number of competitors vying for the limited space in journals. Using the criteria of statistical significance is an easy way to screen out submitted manuscripts, but such an approach overlooks manuscripts’ holistic quality and potential contribution to the field. As an example, Atkinson, Furlong, and Wampold (1982) asked 101 consulting editors for APA journals to review three manuscripts that differed only by their level of statistical significance and to make a publication recommendation for the manuscripts. Results indicated that the nonsignificant versions were three times more likely to be rejected than the significant version of the manuscript.
Further examination of the bias against nonsignificant research leads us to the overarching “publish or perish” culture that plagues academia in the United States. Faculty, particularly junior faculty, are inundated with an overwhelming pressure to churn out paper after paper in order to gain reappointment, tenure, and promotion. Although the rationale behind this requirement is noble (e.g., that both faculty and student experiences are broadened if faculty are contributors in the field), it remains likely that any given faculty member’s research productivity may be linked more to the happenstance that his or her research yielded significant results than to the amount of time that the faculty member has dedicated to his or her research program or the quality or promise of the research program. According to Fanelli (2010), the more papers published by researchers within the same U.S. state, indicative of a competitive academic environment within that state, the more likely those papers were to contain statistically significant results. Considering such a finding, coupled with the current requirements for prolific publication, it comes as no surprise that researchers are under increasing pressure to conduct studies yielding significant results from the get-go; anything else could be viewed as a waste of time, resources, and effort.
This has in fact been so ingrained in us that we can no longer place blame solely upon the journals for failing to accept papers with nonsignificant results, but as authors we must also now accept some of the blame ourselves. That is, we have so gravely succumbed to this misconception that we actually self-select out of even preparing and submitting a manuscript for review if the research contained therein is largely composed of nonsignificant results. However, if the trivialization of nonsignificant results is so widespread, is there any hope? We believe so.
Lifting the Stigma of Nonsignificance
First and foremost, we must recognize that an overreliance on NHST to the exclusion of other considerations is the underlying problem plaguing nonsignificant research. This is not to say that significance is a non-issue, or that we should forego NHST entirely: Certainly not! Rather, we argue that statistical significance is merely one of many issues that warrant consideration when determining the value of a study. Our discipline has historically foregone other statistical approaches (e.g., the Bayesian approach) that may yield more accurate conclusions about our data and have done so in favor of relying primarily on NHST. In reality, however, we must value all well-designed research, regardless of whether or not the results are statistically significant.
Urging journals to start including more studies with nonsignificant findings is a deceptively easy solution. Of course, not every unpublished article with nonsignificant results is worthy of publication. Specifically, building upon Rosenthal’s file drawer analogy, Pagell and colleagues (2009) suggested two types of manuscripts with nonsignificant results that should be pulled from the proverbial file drawer and published: studies with results in opposition of existing theory and those suffering from methodological or theoretical flaws from which future researchers could learn but that nonetheless have scientific importance.
I-O psychology has recently taken a step in the right direction, however, thanks to the Journal of Business and Psychology, which recently issued a call for papers for an upcoming special issue focusing on nonsignificant results. Going a step further, beyond this laudable, single-issue effort to publish nonsignificant results is the Journal of Articles in Support of the Null Hypothesis, an entire journal in the broader psychology field devoted to disseminating such research. This free, online journal may be one potential answer to the publication bias in psychological literature.
However, this and other such journals focusing on nonsignificant findings are generally online and free. This stands in stark contrast to the higher cost and limited space of traditional print journals that have partially contributed to the publication bias of nonsignificant research. This pattern begs the question: Are we placing greater value on significant research findings partly in an attempt to justify the scarcity and cost of the outlets biased toward publishing them? Perhaps we are not getting what we are paying for. It may be time to shed the old misconceptions about the value of free and readily accessible knowledge and to embrace new ways of thinking about our research and its dissemination.
However, although both of these efforts—special issues and specialized journals—are admirable beginnings, the “file drawer” of forgotten research (Rosenthal, 1979) is still overflowing with wisdom to be had. One issue of any journal will never be enough nor will journals dedicated solely to publishing nonsignificant results, as if to render them “separate but equal.” The goal, then, must be for all journals to welcome and encourage manuscripts yielding meaningful nonsignificant results with as much enthusiasm as they do manuscripts yielding significant results and to integrate the two into the same program of publication and knowledge dissemination. Is this an unattainable or unreasonable aim? Surely not. In contemplating the possibility of beginning to publish nonsignificant results, one of the questions we must pose is, “What are the consequences of continuing to not publish such results?”
The devaluation of nonsignificant results has been occurring slowly over the past half-century, a side effect of the pervasive popularity of statistical testing to the exclusion of other determinants of a study’s value. The end result is a scientific literature that is biased by the underrepresentation of manuscripts yielding nonsignificant findings, and a resulting misunderstanding regarding the potential value of such results. Nevertheless, although this issue has to some degree been the subject of ongoing unrest (albeit largely silent) in the field for a number of years, until recently there has been little progress in lifting the stigma of nonsignificance. Recently, occasional publication of nonsignificant findings has been occurring, in both single issues and entire journals dedicated to this type of research. However, despite this small degree of progress as of late, the problem endures, and these stop-gap measures, while valuable, cannot themselves fill the void left by the exclusion of such research from consistent publication in regular issues of journals. As such, there is far to go in lifting the black veil placed over nonsignificant results, and there is no indication that sustainable resolutions to the issue are still being considered. As a discipline, we must demand the regular and unbiased inclusion of nonsignificant results, no longer viewing them as necessarily insignificant. The plight of nonsignificant research did not reach its current state overnight, and remedying this issue will take perseverance, dedication, and—for some editors, reviewers, and researchers—a willingness to challenge one’s long-held beliefs about what defines publishable research, all with the ultimate goal of advancing I-O psychology in the right direction.
Atkinson, D. R., Furlong, M. J., & Wampold, B. E. (1982). Statistical significance, reviewer evaluations, and the scientific process: Is there a (statistically) significant relationship? Journal of Counseling Psychology, 29, 189–194.
Carver, R. P. (1993). The case against statistical significance testing, revisited. Journal of Experimental Education, 61, 287-292.
Dickersin, K., & Min, Y. (1993). Publication bias: The problem that won’t go away. Annals of the New York Academy of Sciences, 703, 135–148.
Eberly, L. E., & Casella, G. (1999). Bayesian estimation of the number of unseen studies in a meta-analysis. Journal of Official Statistics, 15, 477–494.
Howard, G. S., Hill, T. L., Maxwell, S. E., Baptista, T. M., Farias, M. H., Coelho, C., ... Coulter-Kern, R. (2009). What’s wrong with research literatures? And how to make them right. Review of General Psychology, 13, 146–166.
Hubbard, R., Parsa, R. A., & Luthy, M. R. (1997). The spread of statistical significance testing in psychology: The case of the Journal of Applied Psychology, 1917–1994. Theory & Psychology, 7, 545–554.
Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington DC: American Psychological Association.
Oakes, M. (1986). Statistical inference: A commentary for the social and behavioural sciences. New York, NY: Wiley.
Pagell, M., Kristal, M., & Blackmon, K. (2009). Special topic forum on nonsignificant and contradictory results. Journal of Supply Chain Management, 45, 70.
Rosenthal, R. (1979). The “file drawer problem” and tolerance for null results. Psychological Bulletin, 86, 638–641.
Sterling, T. .D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance–or vice versa. Journal of the American Statistical Association, 54, 30–34.