Informed Decisions: Research-Based Practice Notes
Steven G. Rogelberg
Bowling Green State University
Given the incredible amount of time and money that are allocated to
training activities each year, it is not surprising that a pervasive interest in training
evaluation exists. To that end, I have asked Tanya Andrews and Brian Crewe to
author this quarter's Informed Decisions column. Tanya and Brian effectively discuss the
current state of training evaluation. Their interviews with practitioners across a number
of organizations provide nice insights into how the challenges of training evaluation are
being met. If you have any comments/questions concerning this particular column you can
contact Tanya (email@example.com). If you
have any ideas for future columns or would like to propose authoring a column, please
contact me at firstname.lastname@example.org.
Examining Training Evaluation: Reactions,
Lessons Learned, Current Practices, and Results
Tanya L. Andrews and Brian D. Crewe
Bowling Green State University
Despite its rich history, the area of training and development has
recently been the subject of relatively few empirical publications. We found this
especially to be true in the research on training evaluation. Kirkpatrick's (1959, 1960)
four levels of evaluation criteria (i.e., reactions, learning, behavior, and results) have
dominated the training literature. No other model hasemerged as a stronger contender for
evaluating training programs. Kirkpatrick's (1959, 1960) model of training evaluation has
been the most popular, most pervasive, and most cited evaluation criteria since its
inception. It is probably cited in every I-O, OD, and OB introductory textbook. We asked
ourselves whether this was due to the perception of the impossibility of designing a
single evaluation procedure for the great diversity of training programs (in terms of
content, process, purpose, and generalizability). But despite this concern, the
Kirkpatrick criteria can be applied to a variety of training programs. So, is
Kirkpatrick's model the model for training evaluation? In search of an
answer, we consulted the literature and practitioners in the field to determine the
current state of training evaluation. We will present critiques and empirical evidence
concerning Kirkpatrick's model, results of a small-scale practitioner survey, and future
research directions for training evaluation.
Kirkpatrick's (In)Fallible Model of Training Evaluation
Notwithstanding the popularity of Kirkpatrick's model, several authors
have commented on the lack of completeness of the model. For example, based on a
demonstrated negative relationship between perceptions of training difficulty and
subsequent measures of training effectiveness, Warr and Bunce (1995) asserted that
perceptions of training difficulty should be a sub-level of Kirkpatrick's reactions
criterion. Alliger, Tannenbaum, Bennett, Traver, and Shotland (1997) went further and
proposed an augmented model. Specifically, they suggested that (a) reactions should be
assessed as to both affect and utility/usefulness, (b) learning should be subdivided into
learning that occurs immediately following training and learning that occurs a period of
time after training, and (c) behavior should be classified as transfer of training.
Phillips (1996) suggested the addition of a fifth potential criterion
to the model: return on investment (ROI). ROI has been used extensively in other
organizational functions and has made its way into training and development. ROI, the
number of dollars returned to the organization per dollar of training investment, is
calculated as (program benefits _ program costs) / program costs (Phillips, 1996). The
primary purpose of this type of evaluation is to determine whether the value of a training
program exceeds its monetary costs. Researchers and practitioners are concerned about
using ROI in the training field because (a) assigning a monetary value to subjective
benefits data is difficult and (b) utility analysis/ROI should be used to decide between
program alternatives, not to justify the use of a program after the fact (Alliger,
Tannenbaum, & Bennett, 1996).
Empirical Evidence for Kirkpatrick?
The lack of completeness is not the only criticism of Kirkpatrick's
model. The model also lacks empirical support for its hierarchical levels and causality.
However, contemporary researchers have found some promising results. For example, Mathieu,
Tannenbaum, and Salas (1992) concluded that reactions moderate the relationship between
motivation and learning such that less motivated trainees may learn if they have positive
reactions to the training and motivated trainees may not learn if they have negative
reactions to the training. Alliger and Janak (1989) found that learning was moderately
related to behavior and results, and that behavior and results were correlated. Therefore,
reactions may impact learning, which affects (and theoretically precedes) behavior and
results. It is essential to note, however, that these correlational results do not
Of course, there is conflicting evidence. Alliger et al. (1997)
demonstrated that the levels were not correlated. Furthermore, although participants had
positive reactions to training and demonstrated learning, Campion and Campion (1987)
reported no difference in behavior or results for the trained versus untrained group.
Alliger et al. (1996) warned that Kirkpatrick's model should not be taken as the last word
in training criteria. Furthermore, Kirkpatrick's assumption of causality was probably
mistaken (Alliger & Janak, 1989).
Although more research is needed in training evaluation, researchers
may be hindered by the limited evaluation criteria used by organizations. Ralphs and
Stephan (1986) reported that 86% of organizations usually evaluate programs by using the
reactions criterion. Only 15% of organizations regularly utilize pre- and post-test
learning measures (Ralphs & Stephan, 1986). Furthermore, only 10% of evaluations
measure behavioral change (Tannenbaum & Yukl, 1992). Of course, these data are likely
to be 8 to 14 years old. This begged for a current survey on training evaluation
The Practitioner Survey
We conducted semi-structured interviews via telephone/e-mail with
non-randomly selected training representatives from three organizations and four
consulting firms: Aeroquip-Vickers, Andersen Worldwide, Davis-Besse Nuclear Station,
Developmental Dimensions International (DDI) Center for Applied Behavioral Research,
Lucent Technologies Microelectronics University, Personnel Decisions International (PDI),
and an unnamed manufacturing company. Special appreciation goes to our participants: Paul
Bernthal, Steve Callender, Brian Crewe, Mike Fillian, Scott Harkey, Dave
LaHote, and two participants who chose to remain anonymous.
The representatives were asked, "What criteria/methods does
your organization use to evaluate training programs?" The results are organized
below according to six categories based on Kirkpatrick's model: (a) reactions criterion,
(b) learning criterion, (c) behavior criterion, (d) results criterion, and (e) ROI.
Results are summarized in the table and explained at length in the text. Note that
opinions of the participants do not necessarily reflect those of the organization.
1. Reactions criteria
Written/e-mail/web-based surveys. Trainees rate the
effectiveness of the program; the strengths/areas for improvement regarding the content,
activities, instructors, and conditions; the clarity of the objectives; and the usefulness
of the program. Surveys contain open-ended items. Assessment occurs immediately following
the training program. The purpose of the reactions data is to increase the quality of the
program. While the six companies report using paper/pencil surveys, Andersen and Lucent
also report employing e-mail and web-based surveys, respectively. At Davis-Besse, the
reactions of the trainers are also assessed.
Interviews and focus groups. In addition to surveys, Andersen
evaluates reactions through interviews and focus groups, though on an infrequent basis.
2. Learning criteria
Written/web-based tests. Davis-Besse, DDI, Lucent, and the
manufacturing company administer written tests at the end of each skill training or at the
end of the training program to determine knowledge acquisition. Only the learning of
hard/technical skills is typically assessed. Lucent reports using web-based tests also.
Davis-Besse administers these knowledge tests at least three times per year to assess the
maintenance of trained knowledge.
Work samples. Davis-Besse, DDI, and the manufacturing company
use work samples to assess knowledge acquisition. The work sample is an observational
assessment of either training performance or lab environment job performance.
Simulations. In an attempt to measure soft skills, Davis-Besse
operates a plant simulator to determine the acquisition of hard/technical skills as well
as interpersonal skills.
3. Behavior criteria
60- to 90- day follow-up reports. Aeroquip-Vickers, DDI, PDI,
and the manufacturing company use 60- to 90- day follow-up reports to assess the
application of training content to the job. Reports may include measures of
knowledge/skill application, performance improvements, obstacles to transfer, and
behavioral changes in job performance. Aeroquip-Vickers surveys a random 20% of trainees
and their supervisors via the telephone with a structured interview format. DDI and the
manufacturing company distribute written surveys to trainees and their supervisors. PDI
conducts a follow-up session during which the trainees reconvene for discussion regarding
the application of training. PDI has also had placed participants in small groups to use
their trained skills on real issues faced by the trainees on the job.
4. Results criteria
60- to 90- day follow-up reports. During the 60- to 90- day
follow-up session, PDI has trainees report the impact of the training in terms of
Objective data. PDI has also made multiple assessments of
objective organization performance data such as revenue, profits, and production before
and after the training.
Return on investment figures. DDI reports ROI figures. Lucent
attempts to track ROI.
Our Findings in Sum
In summary, given our limited sample size of seven, our conclusions are
as follows. Nearly all companies assess trainee reactions via a written survey
immediately following the program. Most companies measure the learning of only
hard/technical skills through written tests and, to a lesser extent, work samples
immediately following the program. Most companies have 60- to 90- day follow-up reports to
assess knowledge/skill application, performance improvements, obstacles to transfer, and behavioral
changes in job performance. Results criteria tend not to be utilized. ROI
is calculated by few companies. Despite our small sample size, we are hopeful that,
relative to past reports (Ralphs & Stephan, 1986; Tannenbaum & Yukl, 1992), more
organizations now use learning and behavior criteria in their training evaluations.
Regarding the methods for training evaluation, the organizations tend
to report one group (sans random selection), post-test only measurement, with immediate
and short-term follow-up evaluation via written surveys. As a general practice, companies
are not involved in electronic technology for training evaluation.
In addition to the structured interview questions, the representatives
commented on training evaluation in general. Their comments are as follows:
The definition of training effectiveness becomes the driver of the
answer to whether training was effective. A positive change in pre- and post-test measures
may look like training was effective, but if a person does not perform the changes on the
job, the training was not effective.
Attributing training as the single factor for a positive business
performance result is overreaching, especially when other changes are occurring in the
[Training departments] recognize that they can only very rarely prove
that training has a cause-effect relationship with specific outcomes. On top of that, most
can only control reactions or learning. Although asked to measure training effectiveness,
they are often not capable of directly influencing those outcomes.
Some say [evaluation] is not worth trying especially with soft skills
training because the time to develop the method and criteria is not value-added time. How
can you quantify effectiveness and ensure that it's a reliable measure?
Most clients see a need to both "prove" and
"improve" the value of training. They need to demonstrate that their efforts
make a difference for the organization's bottom line. This means collecting data for
strictly marketing purposes. All organization's systems must demonstrate value if they
hope to receive funding. On the other hand, most training departments really want to
improve their training implementations and facilitate transfer.
The main problem is that by adopting the [Kirkpatrick] approach,
[clients] may be answering questions no one has asked and may be missing critical issues
expressed by their internal customers. The best approach is to start with the customer of
It's too difficult to apply lab methods/criteria in the real world with
uncontrolled and changing conditions.
We use a model to determine if training is necessary. If you show
before training that it is appropriate and necessary, then you don't need ROI. Companies
that believe in training seldom go through the ROI procedures. With ROI, you deal in
monopoly money anywayit's all subjective.
It becomes a credibility factor when you try to assert ROI because
there are more unknown than known variables.
If there is a very strong emphasis on proving bottom line impact or
ROI, the training department has not done a very good job of gaining buy-in from internal
customers. If the training fits with the organizational strategy, is supported by
management, and fits with the job needs, then why shouldn't it make a difference?
Where Do We Go From Here?
What is in store for the future of training evaluation? Based on past
research and current practices, we assert that future training evaluation research will
address two points: the impact of learning technologies and the evaluation of training
1. The impact of learning technologies. Learning technologies
are best described as electronic technology used to deliver information and facilitate the
development of knowledge and skills (Bassi, Cheney, & Van Buren, 1997). Advancements
in presentational methods (e.g., virtual reality, multimedia, interactive TV, and
computer-based training) and distributive methods (i.e., e-mail, world wide web, intranet,
CD-ROMS, and satellite TV) offer promise for facilitating evaluation. Currently, in
training programs, 50% of organizations use computer-based training via CD-ROM, 31% use
the internet/world wide web, 21% use the intranet/organization's internal computer
network, and 20% use satellite/broadcast TV ("Industry Report," 1998). Although
only 10% of training was delivered by learning technologies in 1997, that figure is
expected to triple by the year 2000 (Bassi et al., 1997). The distributive advancements
further organizational ability to offer training evaluation via personal computers, which
allows trainees to retrieve and complete evaluation materials when and where they are
ready. In addition, current intranet-based training programs allow for the administration,
analysis, feedback, and storage of reactions and learning criteria during the training
program (Bassi et al., 1997). Researchers should assess the effectiveness and utility/ROI
of learning technologies in training and evaluation versus traditional instructor-led
training and written evaluation.
2. The evaluation of training effectiveness. Trainers are not
accountable for training effectiveness. In fact, although they may be able to influence
reactions and learning criteria, they have virtually no control over the behavior and
results criteria which determine the effectiveness of a program. A plethora of research
places the burden of training effectiveness on an organization's transfer
climate/continuous learning culture and the personality of the trainee (Baldwin &
Ford, 1988; Noe, 1986; Tesluk, Farr, Mathieu, & Vance, 1995; Tracey, Tannenbaum, &
Kavanagh, 1995). Researchers should determine, confirm, and model the personal and
workplace factors that inhibit or facilitate training effectiveness. Longitudinal
investigations may be especially helpful.
We began with a discussion of Kirkpatrick's pervasive training
evaluation model, including criticisms and empirical evidence. Through a small-scale
practitioner survey, we determined that most companies may now be using the first three
levels of Kirkpatrick's criteria to evaluate training programs. Finally, we proposed two
directions for training evaluation research. We hope that this article will spark interest
in the area and spur further research and development in training evaluation.
Alliger, G. M., & Janak, E. A. (1989). Kirkpatrick's levels of training criteria:
Thirty years later. Personnel Psychology, 42, 331_342.
Alliger, G. M., Tannenbaum, S. I., & Bennett, W., Jr. (1996). A comparison and
integration of three training evaluation approaches: Effectiveness, utility, and
anticipatory evaluation of training. Interim technical report for period September 1993
to August 1995. Brooks Air Force Base, TX: Armstrong Laboratory, Air Force Materiel
Alliger, G. M., Tannenbaum, S. I., Bennett, W., Jr., Traver, H., & Shotland, A.
(1997). A meta-analysis of the relations among training criteria. Personnel Psychology,
Baldwin, T. T., & Ford, J. K. (1988). Transfer of training: A review and directions
for future research. Personnel Psychology, 41, 63_105.
Bassi, L. J., Cheney, S., & Van Buren, M. (1997). Training industry trends 1997: An
annual look at trends. Training and Development, 51, 46_59.
Campion, M. A., & Campion, J. E. (1987). Evaluation of an interviewee skills
training program in a natural field experiment. Personnel Psychology, 40, 675_691.
Kirkpatrick, D. L. (1959, 1960). Techniques for evaluation training programs. Journal
of American Society for Training Directors, 13, 28_32.
Mathieu, J. E., Tannenbaum, S. I., & Salas, E. (1992). Influences of individual and
situational characteristics on measures of training effectiveness. Academy of
Management Journal, 35, 828_847.
Noe, R. A. (1986). Trainee's attributes and attitudes: Neglected influences on training
effectiveness. Academy of Management Review, 11, 736_749.
Phillips, J. J. (1996). ROI: The search for best practices. Training and
Development, 50, 42_47.
Ralphs, L. T., & Stephan, E. (1986). HRD in the Fortune 500. Training and
Development Journal, 40, 69_76.
Tannenbaum, S. I., & Yukl, G. (1992). Training and development in work
organizations. Annual Review of Psychology, 43, 399_441.
Tesluk, P. E., Farr, J. L., Mathieu, J. E., & Vance, R. J. (1995). Generalization
of employee involvement training to the job setting: Individual and situational effects. Personnel
Psychology, 48, 607_632.
Tracey, J., Tannenbaum, S. I., & Kavanagh, M. J. (1995). Applying trained skills on
the job: The impact of the work environment. Journal of Applied Psychology, 80,
Training Magazine's industry report 1998: A snapshot of employer-sponsored training in
the United States. (1998). Training, 35, 43_76.
Warr, P., & Bunce, D. (1995). Trainee characteristics and the outcomes of open
learning. Personnel Psychology, 48, 347_375.
April 1999 Table of Contents | TIP Home | SIOP Home