Home Home | About Us | Sitemap | Contact  
  • Info For
  • Professionals
  • Students
  • Educators
  • Media
  • Search
    Powered By Google

Informed Decisions: Research-Based Practice Notes

Steven G. Rogelberg
Bowling Green State University

Given the incredible amount of time and money that are allocated to training activities each year, it is not surprising that a pervasive interest in training evaluation exists. To that end, I have asked Tanya Andrews and Brian Crewe to author this quarter's Informed Decisions column. Tanya and Brian effectively discuss the current state of training evaluation. Their interviews with practitioners across a number of organizations provide nice insights into how the challenges of training evaluation are being met. If you have any comments/questions concerning this particular column you can contact Tanya (tcastig@bgnet.bgsu.edu). If you have any ideas for future columns or would like to propose authoring a column, please contact me at rogelbe@bgnet.bgsu.edu.

Examining Training Evaluation: Reactions,
Lessons Learned, Current Practices, and Results

Tanya L. Andrews and Brian D. Crewe
Bowling Green State University

Despite its rich history, the area of training and development has recently been the subject of relatively few empirical publications. We found this especially to be true in the research on training evaluation. Kirkpatrick's (1959, 1960) four levels of evaluation criteria (i.e., reactions, learning, behavior, and results) have dominated the training literature. No other model hasemerged as a stronger contender for evaluating training programs. Kirkpatrick's (1959, 1960) model of training evaluation has been the most popular, most pervasive, and most cited evaluation criteria since its inception. It is probably cited in every I-O, OD, and OB introductory textbook. We asked ourselves whether this was due to the perception of the impossibility of designing a single evaluation procedure for the great diversity of training programs (in terms of content, process, purpose, and generalizability). But despite this concern, the Kirkpatrick criteria can be applied to a variety of training programs. So, is Kirkpatrick's model the model for training evaluation? In search of an answer, we consulted the literature and practitioners in the field to determine the current state of training evaluation. We will present critiques and empirical evidence concerning Kirkpatrick's model, results of a small-scale practitioner survey, and future research directions for training evaluation.

Kirkpatrick's (In)Fallible Model of Training Evaluation

Notwithstanding the popularity of Kirkpatrick's model, several authors have commented on the lack of completeness of the model. For example, based on a demonstrated negative relationship between perceptions of training difficulty and subsequent measures of training effectiveness, Warr and Bunce (1995) asserted that perceptions of training difficulty should be a sub-level of Kirkpatrick's reactions criterion. Alliger, Tannenbaum, Bennett, Traver, and Shotland (1997) went further and proposed an augmented model. Specifically, they suggested that (a) reactions should be assessed as to both affect and utility/usefulness, (b) learning should be subdivided into learning that occurs immediately following training and learning that occurs a period of time after training, and (c) behavior should be classified as transfer of training.

Phillips (1996) suggested the addition of a fifth potential criterion to the model: return on investment (ROI). ROI has been used extensively in other organizational functions and has made its way into training and development. ROI, the number of dollars returned to the organization per dollar of training investment, is calculated as (program benefits _ program costs) / program costs (Phillips, 1996). The primary purpose of this type of evaluation is to determine whether the value of a training program exceeds its monetary costs. Researchers and practitioners are concerned about using ROI in the training field because (a) assigning a monetary value to subjective benefits data is difficult and (b) utility analysis/ROI should be used to decide between program alternatives, not to justify the use of a program after the fact (Alliger, Tannenbaum, & Bennett, 1996).

Empirical Evidence for Kirkpatrick?

The lack of completeness is not the only criticism of Kirkpatrick's model. The model also lacks empirical support for its hierarchical levels and causality. However, contemporary researchers have found some promising results. For example, Mathieu, Tannenbaum, and Salas (1992) concluded that reactions moderate the relationship between motivation and learning such that less motivated trainees may learn if they have positive reactions to the training and motivated trainees may not learn if they have negative reactions to the training. Alliger and Janak (1989) found that learning was moderately related to behavior and results, and that behavior and results were correlated. Therefore, reactions may impact learning, which affects (and theoretically precedes) behavior and results. It is essential to note, however, that these correlational results do not illustrate causality.

Of course, there is conflicting evidence. Alliger et al. (1997) demonstrated that the levels were not correlated. Furthermore, although participants had positive reactions to training and demonstrated learning, Campion and Campion (1987) reported no difference in behavior or results for the trained versus untrained group. Alliger et al. (1996) warned that Kirkpatrick's model should not be taken as the last word in training criteria. Furthermore, Kirkpatrick's assumption of causality was probably mistaken (Alliger & Janak, 1989).

Although more research is needed in training evaluation, researchers may be hindered by the limited evaluation criteria used by organizations. Ralphs and Stephan (1986) reported that 86% of organizations usually evaluate programs by using the reactions criterion. Only 15% of organizations regularly utilize pre- and post-test learning measures (Ralphs & Stephan, 1986). Furthermore, only 10% of evaluations measure behavioral change (Tannenbaum & Yukl, 1992). Of course, these data are likely to be 8 to 14 years old. This begged for a current survey on training evaluation practices.

The Practitioner Survey

We conducted semi-structured interviews via telephone/e-mail with non-randomly selected training representatives from three organizations and four consulting firms: Aeroquip-Vickers, Andersen Worldwide, Davis-Besse Nuclear Station, Developmental Dimensions International (DDI) Center for Applied Behavioral Research, Lucent Technologies Microelectronics University, Personnel Decisions International (PDI), and an unnamed manufacturing company. Special appreciation goes to our participants: Paul Bernthal, Steve Callender, Brian Crewe, Mike Fillian, Scott Harkey, Dave LaHote, and two participants who chose to remain anonymous.

The representatives were asked, "What criteria/methods does your organization use to evaluate training programs?" The results are organized below according to six categories based on Kirkpatrick's model: (a) reactions criterion, (b) learning criterion, (c) behavior criterion, (d) results criterion, and (e) ROI. Results are summarized in the table and explained at length in the text. Note that opinions of the participants do not necessarily reflect those of the organization.

1. Reactions criteria

Written/e-mail/web-based surveys. Trainees rate the effectiveness of the program; the strengths/areas for improvement regarding the content, activities, instructors, and conditions; the clarity of the objectives; and the usefulness of the program. Surveys contain open-ended items. Assessment occurs immediately following the training program. The purpose of the reactions data is to increase the quality of the program. While the six companies report using paper/pencil surveys, Andersen and Lucent also report employing e-mail and web-based surveys, respectively. At Davis-Besse, the reactions of the trainers are also assessed.

Interviews and focus groups. In addition to surveys, Andersen evaluates reactions through interviews and focus groups, though on an infrequent basis.

2. Learning criteria

Written/web-based tests. Davis-Besse, DDI, Lucent, and the manufacturing company administer written tests at the end of each skill training or at the end of the training program to determine knowledge acquisition. Only the learning of hard/technical skills is typically assessed. Lucent reports using web-based tests also. Davis-Besse administers these knowledge tests at least three times per year to assess the maintenance of trained knowledge.

Work samples. Davis-Besse, DDI, and the manufacturing company use work samples to assess knowledge acquisition. The work sample is an observational assessment of either training performance or lab environment job performance.

Simulations. In an attempt to measure soft skills, Davis-Besse operates a plant simulator to determine the acquisition of hard/technical skills as well as interpersonal skills.

3. Behavior criteria

60- to 90- day follow-up reports. Aeroquip-Vickers, DDI, PDI, and the manufacturing company use 60- to 90- day follow-up reports to assess the application of training content to the job. Reports may include measures of knowledge/skill application, performance improvements, obstacles to transfer, and behavioral changes in job performance. Aeroquip-Vickers surveys a random 20% of trainees and their supervisors via the telephone with a structured interview format. DDI and the manufacturing company distribute written surveys to trainees and their supervisors. PDI conducts a follow-up session during which the trainees reconvene for discussion regarding the application of training. PDI has also had placed participants in small groups to use their trained skills on real issues faced by the trainees on the job.

4. Results criteria

60- to 90- day follow-up reports. During the 60- to 90- day follow-up session, PDI has trainees report the impact of the training in terms of organization results.

Objective data. PDI has also made multiple assessments of objective organization performance data such as revenue, profits, and production before and after the training.

5. ROI

Return on investment figures. DDI reports ROI figures. Lucent attempts to track ROI.

Our Findings in Sum

In summary, given our limited sample size of seven, our conclusions are as follows. Nearly all companies assess trainee reactions via a written survey immediately following the program. Most companies measure the learning of only hard/technical skills through written tests and, to a lesser extent, work samples immediately following the program. Most companies have 60- to 90- day follow-up reports to assess knowledge/skill application, performance improvements, obstacles to transfer, and behavioral changes in job performance. Results criteria tend not to be utilized. ROI is calculated by few companies. Despite our small sample size, we are hopeful that, relative to past reports (Ralphs & Stephan, 1986; Tannenbaum & Yukl, 1992), more organizations now use learning and behavior criteria in their training evaluations.

Regarding the methods for training evaluation, the organizations tend to report one group (sans random selection), post-test only measurement, with immediate and short-term follow-up evaluation via written surveys. As a general practice, companies are not involved in electronic technology for training evaluation.

Practitioner Comments

In addition to the structured interview questions, the representatives commented on training evaluation in general. Their comments are as follows:

The definition of training effectiveness becomes the driver of the answer to whether training was effective. A positive change in pre- and post-test measures may look like training was effective, but if a person does not perform the changes on the job, the training was not effective.

Attributing training as the single factor for a positive business performance result is overreaching, especially when other changes are occurring in the organization.

[Training departments] recognize that they can only very rarely prove that training has a cause-effect relationship with specific outcomes. On top of that, most can only control reactions or learning. Although asked to measure training effectiveness, they are often not capable of directly influencing those outcomes.

Some say [evaluation] is not worth trying especially with soft skills training because the time to develop the method and criteria is not value-added time. How can you quantify effectiveness and ensure that it's a reliable measure?

Most clients see a need to both "prove" and "improve" the value of training. They need to demonstrate that their efforts make a difference for the organization's bottom line. This means collecting data for strictly marketing purposes. All organization's systems must demonstrate value if they hope to receive funding. On the other hand, most training departments really want to improve their training implementations and facilitate transfer.

The main problem is that by adopting the [Kirkpatrick] approach, [clients] may be answering questions no one has asked and may be missing critical issues expressed by their internal customers. The best approach is to start with the customer of the evaluation.

It's too difficult to apply lab methods/criteria in the real world with uncontrolled and changing conditions.

We use a model to determine if training is necessary. If you show before training that it is appropriate and necessary, then you don't need ROI. Companies that believe in training seldom go through the ROI procedures. With ROI, you deal in monopoly money anyway—it's all subjective.

It becomes a credibility factor when you try to assert ROI because there are more unknown than known variables.

If there is a very strong emphasis on proving bottom line impact or ROI, the training department has not done a very good job of gaining buy-in from internal customers. If the training fits with the organizational strategy, is supported by management, and fits with the job needs, then why shouldn't it make a difference?

Where Do We Go From Here?

What is in store for the future of training evaluation? Based on past research and current practices, we assert that future training evaluation research will address two points: the impact of learning technologies and the evaluation of training effectiveness.

1. The impact of learning technologies. Learning technologies are best described as electronic technology used to deliver information and facilitate the development of knowledge and skills (Bassi, Cheney, & Van Buren, 1997). Advancements in presentational methods (e.g., virtual reality, multimedia, interactive TV, and computer-based training) and distributive methods (i.e., e-mail, world wide web, intranet, CD-ROMS, and satellite TV) offer promise for facilitating evaluation. Currently, in training programs, 50% of organizations use computer-based training via CD-ROM, 31% use the internet/world wide web, 21% use the intranet/organization's internal computer network, and 20% use satellite/broadcast TV ("Industry Report," 1998). Although only 10% of training was delivered by learning technologies in 1997, that figure is expected to triple by the year 2000 (Bassi et al., 1997). The distributive advancements further organizational ability to offer training evaluation via personal computers, which allows trainees to retrieve and complete evaluation materials when and where they are ready. In addition, current intranet-based training programs allow for the administration, analysis, feedback, and storage of reactions and learning criteria during the training program (Bassi et al., 1997). Researchers should assess the effectiveness and utility/ROI of learning technologies in training and evaluation versus traditional instructor-led training and written evaluation.

2. The evaluation of training effectiveness. Trainers are not accountable for training effectiveness. In fact, although they may be able to influence reactions and learning criteria, they have virtually no control over the behavior and results criteria which determine the effectiveness of a program. A plethora of research places the burden of training effectiveness on an organization's transfer climate/continuous learning culture and the personality of the trainee (Baldwin & Ford, 1988; Noe, 1986; Tesluk, Farr, Mathieu, & Vance, 1995; Tracey, Tannenbaum, & Kavanagh, 1995). Researchers should determine, confirm, and model the personal and workplace factors that inhibit or facilitate training effectiveness. Longitudinal investigations may be especially helpful.

We began with a discussion of Kirkpatrick's pervasive training evaluation model, including criticisms and empirical evidence. Through a small-scale practitioner survey, we determined that most companies may now be using the first three levels of Kirkpatrick's criteria to evaluate training programs. Finally, we proposed two directions for training evaluation research. We hope that this article will spark interest in the area and spur further research and development in training evaluation.


Alliger, G. M., & Janak, E. A. (1989). Kirkpatrick's levels of training criteria: Thirty years later. Personnel Psychology, 42, 331_342.

Alliger, G. M., Tannenbaum, S. I., & Bennett, W., Jr. (1996). A comparison and integration of three training evaluation approaches: Effectiveness, utility, and anticipatory evaluation of training. Interim technical report for period September 1993 to August 1995. Brooks Air Force Base, TX: Armstrong Laboratory, Air Force Materiel Command.

Alliger, G. M., Tannenbaum, S. I., Bennett, W., Jr., Traver, H., & Shotland, A. (1997). A meta-analysis of the relations among training criteria. Personnel Psychology, 50, 341_358.

Baldwin, T. T., & Ford, J. K. (1988). Transfer of training: A review and directions for future research. Personnel Psychology, 41, 63_105.

Bassi, L. J., Cheney, S., & Van Buren, M. (1997). Training industry trends 1997: An annual look at trends. Training and Development, 51, 46_59.

Campion, M. A., & Campion, J. E. (1987). Evaluation of an interviewee skills training program in a natural field experiment. Personnel Psychology, 40, 675_691.

Kirkpatrick, D. L. (1959, 1960). Techniques for evaluation training programs. Journal of American Society for Training Directors, 13, 28_32.

Mathieu, J. E., Tannenbaum, S. I., & Salas, E. (1992). Influences of individual and situational characteristics on measures of training effectiveness. Academy of Management Journal, 35, 828_847.

Noe, R. A. (1986). Trainee's attributes and attitudes: Neglected influences on training effectiveness. Academy of Management Review, 11, 736_749.

Phillips, J. J. (1996). ROI: The search for best practices. Training and Development, 50, 42_47.

Ralphs, L. T., & Stephan, E. (1986). HRD in the Fortune 500. Training and Development Journal, 40, 69_76.

Tannenbaum, S. I., & Yukl, G. (1992). Training and development in work organizations. Annual Review of Psychology, 43, 399_441.

Tesluk, P. E., Farr, J. L., Mathieu, J. E., & Vance, R. J. (1995). Generalization of employee involvement training to the job setting: Individual and situational effects. Personnel Psychology, 48, 607_632.

Tracey, J., Tannenbaum, S. I., & Kavanagh, M. J. (1995). Applying trained skills on the job: The impact of the work environment. Journal of Applied Psychology, 80, 239_252.

Training Magazine's industry report 1998: A snapshot of employer-sponsored training in the United States. (1998). Training, 35, 43_76.

Warr, P., & Bunce, D. (1995). Trainee characteristics and the outcomes of open learning. Personnel Psychology, 48, 347_375.

April 1999 Table of Contents | TIP Home | SIOP Home