The Security of Employment Testing: Practices That Keep Pace With Evolving Organizational Demands and Technology Innovations
Tracy Kantrowitz and Sara Gutierrez
Industrial-organizational (I-O) psychologists face unique challenges when determining the security of tests used for employee processes. Consumers of employment testing often need tests to be available anytime, anywhere. In attempts to keep pace with changes in technology, the economy, and business strategy, I-O psychologists have responded to calls for constructing testing programs in line with the evolving needs of organizations. Whereas employment testing once resembled educational and certification testing (i.e., proctored, offline group testing, finite number of administrations), testing in contemporary organizations often takes the form of on-demand, unproctored, online testing with immediate score feedback.
Testing professionals have responded to the evolution of employment testing by conducting research to support the use of tests in the manner organizations desire to use them. Studies on equivalence, norms, validity, and reliability are routinely done, but far less attention has been paid to cheating detection or identifying score anomalies. The effects of a compromised employment test may be far reaching, as the validity for a compromised test comes into question.
In this paper, we (a) highlight employment test security challenges; (b) review best practices for mitigating security risks associated with contemporary uses of employment tests by building a program based on protection, investigation, and enforcement; and (c) discuss future trends in employment testing and how they intersect with test security.
Contemporary Uses of Employment Testing
Employment testing is increasingly characterized as unproctored Internet testing (UIT). The benefits of UIT are clear to organizations, with decreased cost being a primary driver. Companies no longer have to pay for test proctors or computer resources to oversee testing. Nonetheless, unproctored testing carries substantial risks that center around test security. Although we recognize the presence of risks, we believe it is more productive to focus on a research agenda that informs the value, appropriateness, and limitations of unproctored testing.
Research on the risks of unproctored testing has started to accumulate. UIT scores have been found to be relatively stable over time (Beaty et al., 2006). In addition, validation studies using UIT test data shows validity for various performance metrics. In a meta-analysis (Beaty, Nye, Borneman, Kantrowitz, Drasgow, & Grauer, 2011) of accumulated validity data on the same noncognitive tests used in proctored and unproctored environments, results showed that the validity of tests used in unproctored environments was on par with that from proctored environments. The research literature to date is encouraging that the potential risks associated with UIT may not substantially affect tests’ psychometric properties.
Building a Test Security Program
UIT can come in many forms (ITC, 2006) and each form presents increasing security concerns. With supervised testing, there is a level of direct supervision over test-taking conditions. Administrators log in candidates and confirm the test has been properly administered. In controlled mode, a test is administered remotely and made available only to known test takers. Such tests require candidates to log in using a password and username provided to them and often operate on a one-time only basis. Finally, open testing involves no human supervision of testing session with little-to-no registration required by a candidate. We believe that most employment testing is characterized as controlled (for instance, for those organizations that conduct confirmation/verification testing) or open, as many testing programs are exclusively conducted unproctored. This provides an important lens through which to consider test security programs that aspire to mitigate risks associated with unproctored testing.
Tests are increasingly accessible and the conditions under which candidates test are increasingly variable. The need for robust security processes has never been greater to maintain the integrity of professionally developed and validated assessments, testing companies’ intellectual capital, and the value derived by companies who use such assessments.
A forward-looking test security program should account for many factors, including (a) the reputational element associated with safeguarding test content and protecting the company brands of test producers and consumers, (b) psychometric considerations associated with building more cheat-resistant tests that capitalize on available technology, (c) enforcing test security breaches through legal means, and (d) having financial support to continuously (or frequently) create new test content to replace content that has been exposed.
These considerations feed into what the testing industry considers to be the pillars of a robust test security program (see, e.g., ATP Test Security Survey Report, 2012), which include programs focused on protection, investigation, and enforcement. The next section describes activities, methods, and recommendations in each area for increasing the security of contemporary employment testing systems.
Protection: Preventing Test Content From Compromise
Protection describes proactive efforts to prevent test content from being compromised. Efforts in this area can be characterized as designing more cheat-resistant assessments or designing processes to minimize the opportunity for cheating. Each is discussed in more detail below.
Building a More Secure Test
Tests vary in their level of security. An unproctored, fixed-form, cognitive ability test is far more susceptible to cheating than a variable length, computer adaptive assessment. Several factors associated with cheating susceptibility are discussed below.
Test Delivery Method
Methods such as computer adaptive testing (CAT) and linear on the fly testing (LOFT) are based on advances in psychometric theory (namely, item response theory) and have been shown to have superior measurement precision (i.e., reliability) compared to traditional fixed form counterparts based on classical test theory, which is a critical consideration when making decisions about candidates’ qualifications and competencies (Kantrowitz, Dawson, & Fetzer, 2011). Improved test security is yet another key benefit to these testing approaches, as large item banks support the use of these assessments. Although candidates may be presented with 20 test questions in a given testing session, an item bank of 300 or more questions may exist behind the scenes. In the case of CAT and LOFT, instances of a test question being “leaked” (while treated with the utmost concern) tend to not have a material impact on the integrity of tests as question banks consist of hundreds of questions. CAT and LOFT tests are clearly more secure compared to their fixed-form counterparts, in which candidates uniformly experience the same questions. Furthermore, within a CAT, item exposure can be controlled by setting parameters on the frequency with which items should be administered.
Questions with objectively right or wrong answers (typically, cognitive ability and knowledge tests) are traditionally viewed as more susceptible to cheating. Such questions can be practiced if preknowledge of questions is obtained before the live testing event. These questions are also more susceptible to cheating because a proxy test taker whose known (or suspected) ability level exceeds that of the candidate can be obtained to complete a test on a candidate’s behalf. In contrast, noncognitive assessments tend to not include an objectively right answer. It is relatively harder for a candidate to determine the “right” answer to a personality item, for instance.
Performance tests have objectively right answers but make it more difficult to cheat. In these tests, a candidate has to demonstrate his/her proficiency for a given skill. In a typing test, for example, the “right” answer involves keying information accurately and quickly. Unlike cognitive and knowledge tests, it is difficult to cheat on a test like this by obtaining the test information before a live testing session. If a candidate is exposed to a performance test in advance and attempts to learn the answers, this may be construed more as legitimate “practice” than “cheating.” Nonetheless, the potential for cheating exists by having a proxy with better skill in the particular area to be tested sit the test on a candidate’s behalf.
The presence of a test timer typically offsets some of the potential for cheating. Having limited time to find a right answer or incorporating speed into a test score makes it more challenging for examinees to cheat by using outside resources. In contrast, unlimited time to complete a test may open up the possibility that a candidate seeks outside resources (information, people) to assist in completing the test. Similarly, getting more credit for working faster (and generally, more accurately) means that using outside resources will likely slow a candidate down, resulting in a lower test score.
The security of a test’s response scale goes hand in hand with the security of the question type. This raises the notion of applicant faking on employment assessments. Faking in some forms (e.g., blatant extreme responding; Landers, Sackett, & Tuzinski, 2011) may be construed as cheating especially if an intervention (e.g., coaching) leads candidates to use a more extreme response pattern on personality tests.
Biodata is a unique form of noncognitive assessment that asks about past experiences as they may inform future behavior. Such experiences could include job-relevant experiences that bear on the role to which a candidate is applying (e.g., number of sales awards won in previous sales jobs), attitudes about organizations based on past work experiences, or simply facts about one’s history. In many cases, the response scale for biodata items tends to be noncontinuous as response options are developed in such a way to maximally distinguish people, even if this is not transparent to the candidate. Furthermore, the scoring of biodata often relies on empirical keying of the response options to the criterion it is intended to predict, which further obscures the most socially desirable response option.
Situational judgment tests are yet other examples of test types with a distinctive response scale. In a construct-oriented approach (Ployhart, 1999), response options represent different levels of the underlying construct being assessed. In a more criterion-centric approach, various viable response options are developed to represent plausible actions that could be taken, with one or more responses leading to a higher score. If scored empirically, the transparency and cheating potential may be reduced with this test type.
Instructions provided to candidates can be powerful tools in deterring cheating. They provide an opportunity to establish and communicate a clear assessment contract with the candidate. Instructions can inform test takers that cheating behavior is detectable, they can stipulate that a penalty or invalidation of scores may result as a consequence of cheating, they can describe both the detection and consequence of cheating, they can appeal to a candidate’s sense of right and wrong, or they can appeal to a test taker’s interest in being properly fit for a job in which they can succeed. Research and theory have suggested that instructions that involve statements about both cheating detection and cheating consequences are likely to be efficacious (Pace & Borman, 2006).
Even well-designed tests that attempt to deter cheating may be susceptible to security issues. In truly open testing conditions, candidates can work together to gain preknowledge of test questions and “practice” in live testing events to identify a response pattern that earns a passing score. In addition, the threat of proxy test taking exists with unproctored testing unless candidates’ identities are verified.
To mitigate these concerns, attention must be paid to the testing process. In this section, we discuss several process considerations that can enhance the security of an employment testing program.
If cheating is suspected or detected in a given organization, instructions can be easily modified to meet a particular company’s testing needs, level or cheating risk they are willing to assume, and consequences of cheating they may wish to communicate and/or enforce.
Using Technology to Enhance the Test Security Process
Several technology-based methods exist to increase an online test’s security by limiting its exposure. Having a single point of entry for candidates to access a test helps increase test security. A link sent by a recruiter via e-mail to a candidate is relatively more secure than a permastatic link posted to a job board. Similarly, utilizing single-use links that only permit one candidate to test can help make a test less vulnerable to compromise through widespread sharing of content. Other methods of technology-enhanced test security include test item randomization to mitigate the threat of candidates obtaining an “answer key” with a string of numbers/letters to recite when responding to questions.
Confirmation or verification testing is perhaps the single most effective process enhancement to increasing a testing program’s security. No other method authenticates the individual’s identity to ensure that the person completing a test is the same person applying to a job. Confirmation testing generally involves a two-stage testing process, consisting of an initial, unproctored test followed by a second on-site/proctored assessment in which the candidate’s identity can be verified. Confirmation testing can be implemented such that every candidate sits for a confirmation test, or it can be used randomly so that the incidence of cheating can be monitored and/or the simple threat of confirmation testing makes would-be cheaters think otherwise.
The mechanics of confirmation test can function several ways. Traditionally, scores on confirmation tests are compared to the unproctored test. If scores vary substantially, differ more than would be expected due to chance, and/or are outside a particular confidence interval or standard error of measure, they would be flagged for suspected cheating. Organizations then decide how to proceed, with options including administering a parallel full-length test again or disqualifying a candidate from further consideration.
Other methods of confirmation testing mitigate the ambiguity that can result from more traditional methods of confirmation testing. Fetzer and Grelle (2010) describe an approach that uses a candidate’s final score from an unproctored test as the starting point of the confirmation test. Honest candidates who sit for the confirmation test will quickly converge on a reliable score and the test will terminate. Dishonest candidates may have a longer test session as more items are administered to arrive at a reliable and accurate final score. For all test takers, the final score on the proctored test is considered the score of record that should be used in employment decision making.
The choice to implement confirmation testing often boils down to the impact of making hiring decisions on the basis of inaccurate information (Fetzer & Grelle, 2010). In other words, organizations must understand the consequences of hiring someone who may not possess the level of skill or ability that is represented by his/her UIT score. Considerations such as job complexity, applicant flow, and level of risk to the organization should be weighed.
Proctored Only Testing
Testing is largely moving to unproctored environments, but there may be legitimate needs to keep tests better protected through proctored only testing. Similar to the risk/benefit decision that must be weighed with confirmation testing, companies should decide if the cost of making hiring decisions on the basis of inaccurate information is substantial enough to restrict testing to proctored only environments. This decision can be based on several factors, including: 1) knowledge/suspicion that cheating is present, 2) the extent to which a test can be cheated, 3) a job is highly desirable which may lead to higher levels of cheating due to more competition, 4) testing occurs in locations more susceptible to cheating by sharing information based on cultural norms, 5) a job’s responsibilities, its role in the company’s essential functioning, and/or the potential damage that could be done by the individual in this role necessitate highly accurate information.
Proctored testing should be done correctly to be effective. As has been noted in previous papers (Beaty et al., 2009, Drasgow, Nye, Guo, & Tay, 2009), the notion of proctored testing can be more variable than the name implies. Contemporary uses of proctored testing can involve anything from testing centers with staff that monitor and note testing behavior to administering a test down the hall from a company receptionist who is proctoring a test while also greeting visitors and responding to calls and e-mails.
Investigation: Tracking Down Security Breaches
Investigation refers to reactive activities to identify potential breaches of test content. Activities in this area include methods of identifying test breaches and conducting analyses on test data to identify anomalies or trends indicative of compromise.
Identifying potential breaches to the security of an employment testing program is increasingly difficult in the age of online, on-demand, unsupervised testing. Breaches can more readily be identified under more “traditional” testing models, where qualitative and quantitative bits of information can be obtained to triangulate on cheating attempts. Seating charts, response patterns, and testing locations can all be pieces of evidence in an investigation to determine if cheating occurred during a single test event. The results of such an investigation can include withholding or invalidating scores.
The challenge is far greater with online testing. Information about testing location is virtually unknown as is candidate identity, what illicit information they might be using on a test, and if candidates had preknowledge of the test questions.
An established web patrol program is a powerful method for mitigating threats associated with unsupervised assessments. Especially in the social media era, candidates may seek answers to test questions, ask for information about companies’ hiring processes and assessments, post test questions for personal gain and/or profit, or request proxy test-taking services. These threats to test security can be mitigated through a robust web patrol program whereby test details are queried and searched, and “hits” are investigated to determine any intellectual property infringement and/or improper activity.
Cheating by using online tools has become more sophisticated than simply posting or finding screen shots of test questions. Social media has enabled active discussions by candidates looking to cheat and/or compromise a test. Bids are placed on auction sites for proxy test taking, fee-based test-taking services have spawned on the Internet, and documents containing test questions are shared on social networking/file sharing sites. Testing companies are well served by having “moles” patrol social media outlets to investigate the incidence of cheating, determine damages associated with cheating, and identify vulnerabilities in testing processes, with the ultimate goals of taking legal action against individuals “caught” cheating and to deter cheating by making it well known that such outlets are patrolled for such purposes.
Although information monitoring is critical to containing test security breaches, monitoring the voluminous amount of information on the web is a daunting task. Constructing a data warehousing system with automated analyses that can be leveraged to identify data anomalies is a far more manageable method of detecting test security breaches. Data forensics is a large and sweeping term and can be defined differently based on need. As noted by Beaty et al. (2009), research on the best methods for identifying suspicious data could include repeated response patterns, changes in means and pass rates, response latency changes, and individual score decreases in proctored follow-up testing.
A well-constructed data forensics program may not only help to identify score anomalies but may also inform testing programs about the best way to handle such issues based on their severity (e.g., discontinue test all together, rotate items). Analyses will likely be exploratory in nature, conducted with the intent of learning about potential test exposure, and also used to substantiate specific concerns or testing irregularities with the intent of combining findings with other information (e.g., web patrol reports). With sufficient partnership between psychometricians and database specialists, it is possible to design a program that automates as many of the forensic analyses as possible. Additionally, testing programs may benefit from prioritizing “high risk” content (which can be defined in many ways such as development costs, usage/exposure, vulnerability based on how tests are implemented, and/or test properties) if resources for conducting data forensics analyses are minimal.
When designing a data forensics program, it is important to keep in mind the types of questions that the results of the forensic analyses will be used to answer. These questions may differ depending on the testing industry and purpose of the assessment content. For example, in the personnel selection context where testing is often given online, on-demand, with immediate reporting, the end goal may not be to necessarily catch individual cheaters. It may be that changes in the validity of test scores due to cheating or compromised content are the main cause for concern.
Enforcement: Taking Action on Cheating
Enforcement involves the steps and methods taken to “correct” a situation involving cheating or compromise once it is discovered. Enforcement is also critical to containing any content leak in order to avoid it from spreading further. The sooner compromised content is removed, the less chance it has to have been viewed, copied, or shared with others. Establishing robust processes related to enforcement involves close partnership with intellectual property attorneys and/or relying on established intellectual property law (e.g., Digital Millennial Copyright Act; DMCA).
Efforts in this area largely involve sending cease and desist and/or DMCA notices to offending parties. The vast majority of content security incidents are quickly and easily handled via notices submitted to website administrators using the DMCA guidelines. In isolated cases, escalated take down notices from attorneys may be involved to have the content removed.
For issues not involving copyright infringement, specific cases need to be discussed with legal counsel to determine the extent of damages done, the point at which action would be taken, and what information is required to take action.
Looking Ahead and Planning for the Future
What does the future hold for test security? As testing takes on new and different forms and technology advances well beyond the confines of a personal computer, is it reasonable to think that test publishers can maintain and monitor the security of tests?
New testing platforms will bring new frontiers of test security. We see mobile technology infiltrating everything we do and we are on the cusp of using mobile technology to deliver tests. Organizations increasingly seek the capability to deliver tests via mobile devices to meet growing interest from prospective candidates in completing online application processes via mobile devices and tablets. Likewise, companies need to ensure that test scores obtained across devices (i.e., PC, mobile device) are equivalent.
Increasing the accessibility of preemployment tests brings a number of benefits to organizations and candidates but presents a number of potential challenges, including increased exposure of tests, more diversity in test taking environments, and potential changes to the quality of the test-taking experience. Programmatic research can help understand the risks posed by these potential challenges to preemployment testing. We view testing via mobile devices as an extension of research on UIT, which provides a promising outlook regarding the feasibility of mobile testing but critical research is required to establish its appropriateness.
As noted previously, attempts to gain preknowledge of test questions have evolved beyond simple keyword queries on Internet search tools. Candidates are starting to interact on the web to figure out ways of colluding and collaborating to earn passing test scores. We have seen instances of proxy test-taking businesses, and we anticipate this kind of service to rise in demand and prevalence.
New Technology to Prevent Cheating
As technology presents new avenues for candidates to cheat, it also presents new avenues for testing providers to mitigate cheating. New techniques like remote proctoring can be used to increase security associated with UIT. Remote proctoring is an increasingly easy way to virtually monitor test sessions of examinees completing tests in remote environments. Webcams built into examinees’ computers and/or provided by an organization and/or remote proctoring company can be used to monitor testing behavior (e.g., surveillance for the use of any illicit test materials) and/or picture identification (in which a candidate has to present a valid form of photo identification via webcam). Webcams can be shipped to candidates inexpensively so this represents a viable method of proctoring to increase a program’s security.
Novel Statistics for Detecting Cheating
New statistical techniques are being implemented and developed to detect the presence of cheating in test data. Parametric and nonparametric tests are being considered, as is IRT for detecting cheating. At the same time, classic indicators of cheating (e.g., K index, or identical incorrect responding, Holland, 1976) are being considered or reconsidered for online test data.
In addition, new techniques including embedded verification exist to determine the consistency of scores from new/protected item banks with items that have been exposed. This involves “poisoning” an item pool by releasing items with the express purpose of comparing scores on these items to those from a protected item bank. This can be a valuable way of determining preknowledge of test questions, ways/locations in which candidates may obtain questions in advance, and detecting the usage of braindump sites.
Association of Test Publishers. (2012). ATP Test Security Survey report. Washington DC: Author.
Beaty, J. C., Dawson, C. R., Fallaw, S. S., & Kantrowitz, T. M. (2009). Recovering the scientist–practitioner model: How I-Os should respond to UIT. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 58–63.
Beaty, J. C., Grauer, E., & Davis, J. (2006, May). Unproctored Internet testing: Important questions and empirical answers. In J. C. Beaty (Chair), Unproctored Internet testing: What do the data say? Practitioner forum conducted at the 21st Annual Conference of the Society for Industrial and Organizational Psychology, Dallas, TX.
Beaty, J. C., Nye, C., Borneman, M., Kantrowitz. T. M., Drasgow, F., & Grauer, E. (2011). Proctored versus unproctored Internet tests: Are unproctored tests as predictive of job performance? International Journal of Selection and Assessment, 19, 1–10.
Drasgow, F., Nye, C. D., Guo, J., & Tay, L. (in press). Cheating on proctored tests: The other side of the unproctored debate. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 46–48.
Fetzer, M., & Grelle, D. (2010). PreVisor ConVerge: The best practice for unproctored/unsupervised Internet testing. White paper. Alpharetta, GA:â€ˆSHL.
Holland, P. W. (1996). Assessing unusual between the incorrect answers of two examinees using the K-Index: Statistical theory and empirical support. ETS Research Report No. 96-7. Princeton, NJ: Educational Testing Service.
ITC. (2006). International guidelines on computer-based and Internet-delivered testsing. International Journal of Testing, 6, 143–171.
Kantrowitz, T. M., Fetzer, M. S., & Dawson, C. R. (2011). Computer adaptive testing (CAT): A faster, smarter, and more secure approach to pre-employment testing. Journal of Business and Psychology, 26, 227–232.
Landers, R. N., Sackett, P. R., & Tuzinski, K. A. (2011). Retesting after initial failure, coaching rumors, and warnings against faking in online personality measures for selection. Journal of Applied Psychology, 96, 202–210.
Pace, V. & Borman, W. (2006). The use of warnings to discourage faking on non-cognitive inventories. In M. Peterson and R. Griffith (Eds.), A closer examination of applicant faking behavior (pp. 281–302). Charlotte, NC: Information Age.
Pearlman, K. (2009). Unproctored Internet testing: Practical, legal, and ethical concerns. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 14–19.
Ployhart, R. (1999). Integrating personality with situational judgment for the prediction of customer service performance. Doctoral dissertation, Michigan State University.