An Interview With Frank L. Schmidt
Editor’s Note: The following is an excerpt from the International Society for Intelligence Research’s (ISIR) Distinguished Contributor interview with Frank Schmidt. The interview was conducted at the 2010 ISIR annual conference before a plenary session.
When you and Jack Hunter began your work, the general conception was that there was little relationship between occupational performance or occupational status and intelligence. That conception has completely changed, such that it is currently hard to believe that anyone ever thought otherwise. What were the main obstacles to changing peoples’ minds, and what resistance did you encounter along the way?
I think there were at least four sources of resistance.
First, I-O psychologists were being paid to conduct situational validity studies. The concept of generalizable validity for general mental ability (GMA) was a threat to this activity because it shows such studies are unnecessary. This research also showed that situational validity studies did not “work” anyway because their statistical power to detect validity was typically about .50, meaning that only about half of such studies gave the correct answer about the presence of validity. This was also disturbing to many.
Second, many I-Os were wedded to differential aptitude theory and believed that job performance was best predicted by a job-specific weighted combination of specific aptitudes (such as spatial, verbal, numerical, etc.). They were reluctant to accept the evidence that this belief was false (but eventually did).
Third, the idea of the dominance of GMA was a threat to the equalitarian idea that the main difference between people is in their pattern of abilities, not in overall ability. This belief was common among laypeople as well as psychologists.
Fourth, it is very hard, nearly impossible, for mature scientists to change their core theories. Older people in my field found it hard to change. For example, to change meant they would have had to redo all the courses they taught. We were told many were hoping this would just be a fad (like so many other things in psychology), and they could just wait it out. Acceptance of our work greatly increased once these people retired or died off. Max Plank once said “Old scientists never change their theories, they just die and are replaced by young scientists who accept the newer theories.” In our case, this process took about 20 years! The big turnaround occurred when we got the APA Distinguished Scientific Award for Applications of Psychology in 1994.
We found out that is it not true that if you build a better mouse trap people will beat a path to your door. They will, but only after about 20 years.
What is your definition or theory of intelligence (over and beyond the brief definition you and Hunter give in your articles)? Do you believe that g is sufficient to represent the intelligence domain? What is your opinion of John Carroll’s analysis of the data and his proposed three-stratum model?
We have defined intelligence as the ability to learn. There are many different definitions of intelligence, but they are just different ways of describing the same ability and are all compatible with each other. I have no patience with articles that maintain that because of these minor differences in definitions, there is no agreement on what intelligence is. That is hogwash.
I accept the hierarchical models of the structure of mental abilities—the Carroll 3-stratum model, the Gustafsson model, the Bouchard-Johnson model, and others. My only complaint is about how these models are sometimes interpreted because people get causality reversed. They say that the specific abilities (level-1 tests) combine to create the general aptitudes (e.g., verbal), which then combine to produce the g-factor. In fact, causality goes in the opposite direction. GMA is the main cause of general aptitudes, which are then the main cause of performance on the specific level-1 tests (specific aptitudes).
What is your current point of view on the psychological significance of specific abilities beyond “g.”
The evidence is very strong that specific aptitudes make no contribution to the prediction of complex performances (such as job performance) over and above GMA. The big research project that Jack Hunter did for the Pentagon, as well as the work of Malcolm Ree and some of my own work, shows that the whole burden of prediction of complex real-world performances is borne by the g-factor. In regression equations, adding level-1 or level-2 ability measures does not increase validity over that of GMA alone. Much of the evidence that seems to suggest otherwise is based on sloppy research in which there is no control for the biasing effects of measurement error. In our research, once we controlled for measurement error, no specific aptitude got a nonzero beta weight, but they did before this control. This finding may not hold for simple performances such as adding and subtracting numbers. But these are not the kind of performances that are of practical value and interest. Complex, multifaceted real-world performances like job performance seem to be learned over time based on GMA, in the same way the specific aptitudes like verbal, quantitative, and so on are learned.
The phrase “over time” is important here. Specific mental skills might contribute over and above GMA to individual differences in initial performance if some individuals have these skills going in and others (independent of GMA) do not. But over longer time periods, GMA swamps these initial differences in specific mental skills in the determination of real-world complex performances such as job performance.
Lloyd Humphreys liked spatial ability, perhaps because social class differences are smaller on this indicator variable for GMA than on other indicators such as verbal and quantitative. Spatial ability is a useful indicator variable (among many) for GMA, but Lloyd never presented any empirical evidence to me showing incremental validity over GMA for the prediction of complex real-world performances. He said he had such evidence but never provided any.
I endorse Cattell’s investment theory of ability (although not his gc vs. gf theory). Investment theory holds that the individual’s interests and values determine where each person invests his/her GMA to develop specific aptitudes (which are mental skills). For example, people with technical, scientific, and engineering interests invest GMA in the development of spatial and technical ability, which then becomes good indicators of GMA for them but which do not per se make any independent contribution to their overall professional achievement and success.
What aspects of your approach to research could be generalized to other areas with profit?
I think my most important contribution is the development of psychometric meta-analysis methods with Jack Hunter. These methods were initially developed for validity generalization purposes but over the last 30 years have spread to many other research areas, not only in I-O psychology and other areas of psychology, but also to applications in economics, political science, wildlife management, medical research, nursing research, and other areas. These methods are very widely used today, not just in the U.S. but around the world. I provide software to implement these methods. Over half the orders are from foreign countries. Usually they say they learned of the software while reading the 2004 Hunter-Schmidt meta-analysis book. About 300 copies of this software package have been distributed, which is a lot for specialized software. By contrast, in retrospect the areas of personal selection and student selection seem somewhat narrow.
What led you and Jack Hunter to the basic ideas that underlie meta-analysis (e.g., your original work on validity generalization)?
One of my professors at Purdue, Hubert Brogden, had been the research director at the Army Research Institute in Washington, DC. In 1967 in a conversation with him at Purdue, he stated that the military validity estimates were stable across samples. I asked him why this was not true for civilian estimates, and he said “sampling error.” Nine years later in DC (in 1976) sitting in my office, I was looking at one of Ghiseli’s highly variable validity distributions, and I remembered what Brogden had said. Then it occurred to me that you could use the sampling-error formula to estimate how much of the observed variance was due to sampling-error variance. I did some quick calculations on a calculator and found it was 70% or more. I was excited by this and called Jack. He immediately said it was a great idea and wrote a letter saying it was the best idea I had ever had. We then worked out the details of applying this idea and published it in 1977.
Why do you think this idea did not take hold much earlier, either in psychology or other disciplines?
In retrospect, we can see there were some early examples of nascent meta-analysis methods in the 1930s. People like E. L. Thorndike averaged several rs and presented the average as the best estimate. But meta-analysis was not widely adopted until the 1980s.
The decade of the 1980s was approximately the time when research literatures became so voluminous that it was clear a method was needed to make sense of them. This was the effect of the information explosion and the resulting information overload.
Before then, there were at least two reasons why there was no meta-analysis. The first is that there was a failure to appreciate how large sampling error is. Researchers believed that sampling error could be controlled with Ns of 50 or 100. This led to the false belief that you could answer a question with a single study in this sample-size range. Second is the myth of the perfect study, the belief that truth is revealed by the one perfect study and the rest should be thrown out. The prescription was to search for the one perfect study.
Some of my colleagues do not like meta-analysis. They think there is always a perfect or near-perfect experiment or study that can answer any question. Explain why this idea is wrong.
This argument is addressed in some detail in our 2004 MA book. This argument is essentially the same as the myth of the perfect study. Those advocating it usually argue that a single large N study can provide the same precision of estimation as the numerous smaller N studies that go into a meta-analysis. But with a single large N study, there is no way to estimate SD-rho or SD-delta, the variation of the population parameters. This means there is no way to demonstrate external validity. A meta-analysis based on a large number of small studies based on different populations, measurement scales, time periods, and so on, allows one to assess the effects of these methodological differences. If SD-rho is small or zero, one has demonstrated that these factors do not matter and has demonstrated external validity (generalizability of the findings). If SD-rho is not small, then one knows there may be moderators operating and can attempt to identify them. None of this is possible with a single large sample study, no matter how well conducted it is. The specific methods and sample used in a single large sample study can always be challenged as not generalizable. A single large study does not address the many questions related to generalizability of findings.
Schmidt and Hunter go together in people’s minds. Could you give us an insider’s perspective on this collaboration, its highs and lows?
I don’t recall any lows. Jack was wonderful to work with, the smartest person I have ever known and a wonderful friend too. Lee J. Cronbach remarked that Jack was the smartest PhD student he’d ever known. Jack critiqued Cronbach’s Generalizability Theory and found an important flaw, and Cronbach accepted this. Students loved Jack, but many faculty were intimidated by him and so did not work with him. I never felt intimidated by him. If I came up with an idea (for example, the idea for VG), he immediately saw the value and then made important contributions to developing and perfecting it. But he did not like administrative or editorial work, so I did almost all of that. For example, I handled journal submissions of all our joint work.
In Cronbach’s (1957) presidential address he made a great deal about the importance of interactions. He followed this up in 1975 in “Beyond the Two Disciplines of Scientific Psychology.” Some believe the search for interactions has been a dismal failure whereas others believe it is the future. Indeed this is now a hot topic in behavior genetics. What is your take on the importance of interactions in science generally?
I think it is important to remember that in the Cronbach (1975) article referred to here, he stated that cumulative knowledge in psychology was not possible, that there were so many complex interactions in psychological processes that we faced an impossible “hall of mirrors.” Well, he was wrong, and applications of meta-analysis in many literatures have shown he was wrong and that cumulative knowledge is possible. Amazing as it seems, Cronbach seems not to have appreciated the ability of sampling error and other artifacts to create illusions of complexity where the underlying reality was actually simple.
In terms of interactions I think we have to distinguish between the detectability of interactions and their substantive scientific importance. In 1978, Jack Hunter and I published an article in Personnel Psychology entitled “Moderator Variables and the Law of Small Numbers” in which we showed that the sample sizes needed to have adequate power to detect interactions are much larger than most people believe—often requiring 10,000 or more people. We also showed that even small amounts of measurement error greatly reduce power to detect interactions. And measurement error ALWAYS exists. So detection is very difficult. If your response is don’t use significance tests and you won’t need to worry about power, then consider that confidence intervals are also wide for interactions unless N is very large.
What about substantive importance? Crossover (i.e., disordinal) interactions, if they exist, could have great substantive importance. (These appear as a big “X” when graphed.) But neither Cronbach nor Snow was able to demonstrate any such interactions in the area of aptitudes and learning via different presentation modes. I doubt whether there are many crossover interactions.
Now consider a typical noncrossover interaction. On the graph you see two lines that are not quite parallel. The difference in elevation between the two lines is somewhat larger on one end than on the other. To me, this suggests that what is really important is the main effect, despite the fact that there is an interaction. Possibly because I am an applied psychologist, my preference is to go after the big effects and emphasize these. I think many experimental psychologists study effects that are quite small, but they often don’t realize this because they do not compute any indices of magnitude of the effects, such a d-values or correlation coefficients. They rely solely on statistical significance (p values).
You’ve been very involved in the use of cognitive tests for hiring. What are your ideas about how they fit into the overall hiring scheme? For instance, in some settings, notably public safety but also education and sales, it may be important that the total workforce be perceived as being part of the community they serve. Note that the issue isn’t one of dealing with government-enforced quotas as targets. It’s an issue of evaluating the effectiveness of the total workforce, in the setting in which they are working. Individual competencies may be only part of this. What are your thoughts on this topic?
This is an important question. As we know, there are increasingly stringent legal constraints on affirmative action attempts to increase minority representation. The recent Supreme Court decision involving firefighters is an example of this. But I will ignore this in answering this question and assume a situation in which decision makers have free rein.
There are costs and benefits to both sides of this question, but the problem is that these have never been calibrated and compared, and as a result we do not have the evidence we need to make a rational decision.
Here are the trade offs. Suppose we consider a police force in city with a large Black population. If we lower valid selection standards to get more Black police officers, then (a) the police force will look more like the community, and this is believed to have benefits. These benefits are postulated but never measured; they are hypothesized. And (b) the level of competence and performance of the police will be lowered, and because police services are used more heavily in Black areas, Black citizens will be the big losers from this. But we do not have any reliable or valid measures of this effect either, although the DC police case shows how bad this effect can be in the extreme. It is possible to use selection utility models to estimate the decline in performance in SD units, but this may not give us a picture of the real impact on the community.
If we decide to maintain police selection standards, then there is no performance decrement. But the force is mostly White and does not mirror the community in racial makeup. Some believe this leads to bad outcomes. But again, we do not have a reliable assessment of this outcome.
The upshot is that we do not have the information to answer this question. We just have to guess or use intuition—and so people will not agree.
Do you think we should find new ways for assessing g in real-life settings?
The big problem here is the assessment of adult intelligence. GMA is a latent variable that can be assessed only through its effects (verbal ability, general knowledge, etc.), that is, by measuring the areas in which people have “invested” their GMA in developing aptitudes and skills. These effects of GMA are called indictor variables. The problem is that adults differ dramatically in where they invested their GMA, and it is not practical to look in all such possible places, which is required for a perfectly construct valid GMA measure. Phil Ackerman, among others, has made this point.
For example, someone with technical interests will probably invest much GMA in the development of mechanical, spatial, electrical, and similar skills. If your GMA test does not include these domains, this is a construct deficiency for that type of person. But note that this does not imply invalidity. It just means the test is not as valid a measure of GMA as it could be. Fortunately for us, most people have invested a lot of their GMA in the development of verbal and quantitative skills, and so we rely heavily on these two indicators.
The many different possible indicator variables for the latent GMA construct sometimes have different properties. For example, perceptual speed measures favor women, whereas spatial measures favor men. As another example, measures of “fluid” ability (such as the Ravens test) favor younger individuals, whereas measures of “crystallized” ability (such as vocabulary) favor older people. Measures of “working memory” and “executive attention” are often studied as causes of GMA, when in fact they are just additional indicator variables for GMA. This is probably also true of the “elementary cognitive tasks,” such as choice reaction time, that some hypothesize to be causes of GMA.
Why is personality a worse predictor of job performance than intelligence?
It could be that this is just a fact of nature and that GMA simply dominates personality in determining complex real-world performances. But it could be because self-report personality measures are not very construct valid. Recent research has shown that when personality is measured by the combined ratings of several people who know the focal people, the validities are much higher, sometimes near .50. Of course, some people then argue that ratings by others do not measure “real personality” and that only the individual can validly answer questions about his or her real personality. These critics also point out that the pattern of correlations among personality traits is very different when personality is assessed via ratings by others. This debate is still going on. It is a construct validity question.
If intelligence and educational achievement are highly related, why not use the latter for predicting job performance? It is easier for lay people to accept educational achievement as a valid predictor because it comprises intelligence plus other relevant traits such as zeal, persistence, diligence, and so on.
Grade-point average (GPA) does in fact predict job performance. This has been shown by meta-analysis for both UG GPA and graduate GPA. But the validity is not as high as that of GMA, probably because of problems in measuring academic achievement on the same scale across individuals (who have different levels of rigor in their past schools, in their majors, etc.).
We do know that high school GPA is a slightly better predictor of college performance than the SAT or ACT. We also know that the GRE Advanced score (measuring academic achievement in one’s major) is a better predictor of graduate performance than GRE-V or GRE-Q. It is still true that past performance is the best predictor of future performance.
Cronbach (1957) wrote his famous paper on the two disciplines of scientific psychology over 50 years ago. He argued “Psychology continues to this day to be limited by the dedication of its investigators to one or the other method of inquiry rather than to scientific psychology as a whole.” Have things improved much since Cronbach penned that statement?
I think they have in I-O psychology and educational psychology. In both these areas, there are research studies that take into account both the role of mental ability and the role of training methods. For example, in the training area of I-O, current researchers do not ignore the role of GMA when evaluating training methods.
I think there is an asymmetry here that Cronbach ignored. Differential psychologists do not deny the importance of treatment conditions. They recognize, for example, that training programs and education do have real effects on people. However, many experimentalists and experimental social psychologists flatly deny any scientific value or impact of traits, whether GMA or personality, or whatever. In fact, they often publicly decry what they call “the fallacy of dispositionalism.” This includes prominent people like Philip Zimbardo and Albert Bandura. You are hoping against hope if you think these people are going to produce an integrated psychology.
Has applied psychometrics hit an asymptote in terms of future creative developments?
No. There is much room for improvement:
There is a need to learn to simultaneously control for sampling error and measurement error. Too much research focuses on one or the other but not both. Statisticians focus on sampling error and ignore measurement error. In effect, they assume perfectly reliable measurement, so they can focus only on sampling error. Psychometricians and psychometric textbooks focus on measurement error and ignore sampling error. In effect, they say “Assume a large sample, so we can ignore sampling error and focus only on measurement error.” Simultaneously controlling both sampling error and measurement error is a major contribution of meta-analysis. All data sets have both kinds of error and unless both are addressed research results are distorted.
Psychometricians need to develop a substantive understanding of the different kinds of measurement error and stop just referring to “random error” or “random response error.” Specific factor error and transient measurement error are important, but most psychometricians do not address them and are not interested in the substantive psychological processes that produce these errors. As an example, Le, Schmidt et al. (2010) found r = .91 between organizational commitment and job satisfaction when the correlation is properly corrected for all sources of measurement error. This sort of finding has important implications for construct redundancy and construct proliferation. Some constructs in the literature have been “shown to be distinct” by the expedient of not correcting for measurement error or not correcting fully. As a result, there is a lot of construct redundancy.
You have written a lot about how to make psychology more rigorous, and a lot of the research out there falls way short of your standards. What effect do you think this has on the likely accuracy of the things we think we know?
I see the following effects of lack of rigor:
1. People conclude that results on any given hypothesis or question vary when in fact they do not. The inconsistency illusion is created by sampling error and low statistical power (typically about .50).
2. People conclude that the relationships that are demonstrated are “small” because they do not correct for the downward biasing effects of the artifacts of measurement error, range restriction, dichotomization, and so on. In fact, these relationships are often not small at all.
3. Also contributing to false conclusions that relationships are small is reliance on indices of “percent variance accounted for,” which is a biased statistic relative to what we want to know. The path coefficient, not percent variance, reveals the causal leverage of one variable on another. Substantial path coefficients can correspond to small percentages of variance accounted for. For example, 9% of variance can be a path coefficient of .30.
What areas do you see as especially fruitful for intelligence research?
Linda Gottfredson’s work on the role of GMA and everyday life tasks, such as understanding bus schedules and filling out Social Security forms. This is an area of opportunity for future research that could lead to better understanding of many social phenomena. I would also include Linda’s work on the role of GMA in personal health management and health outcomes. Medical researchers have repeatedly been puzzled by the fact that even when poor people get exactly the same medical care as middle class people, their health outcomes are still worse. Linda has proposed that this difference is due to average social class differences in GMA. This is an area of research opportunity for differential psychologists. Finally, I would add the revolution in developmental psychology. As the old guards of environmentalism in that field (like Maccoby and Kagan) die off or retire, opportunities are appearing for researchers who study human development from the viewpoint of individual differences in traits and behavior genetics.
Who are your heroes?
My heroes are those who defended the scientific value of traits and dispositions against the arrogant denial of their value by social and experimental psychologists, from the 1960s on. My heroes are those psychologists who kept differential psychology alive in the face of the predictions that it was an anachronism and was dying. These defenders include not only differential psychologists but also people in behavior genetics.
It is gratifying to see that neuroscience researchers now take our psychological research on traits so seriously that they use fMRI brain imaging to seek the brain differences underlying the trait differences that differential psychologists have studied.
And it is also gratifying to see molecular geneticists searching for specific genes underlying the traits elucidated by differential psychologists. I suspect the social and experimental psychologists are gritting their teeth about these developments.