Jenny Baker / Friday, March 24, 2023 / Categories: TIP, 2023, 604 Max. Classroom Capacity: ChatGPT Shakes Up I-O Psych Education… Loren J. Naidoo, California State University, Northridge Dear readers, On January 23, 2023, Microsoft announced that they were investing in OpenAI, the inventors of ChatGPT (reportedly to the tune of $10 billion USD). In the weeks prior to this announcement, ChatGPT had already received considerable media coverage, including in outlets such as the New York Times, Bloomberg, The Wall Street Journal, and The Atlantic. ChatGPT provides an interface in which you can ask an AI chatbot a question or assign it a task. In most cases, you will receive a relevant response from it.1 I firmly believe that it is important for educators to understand what ChatGPT can and cannot do, both because it is a tool that is already being used by our students, which has important implications for how we conceive of, develop, and administer assessments of learning, and because it is potentially a tool that we may all use to become better or more efficient students and educators. In my prior Max Classroom Capacity columns on this topic (on robograding & robowriting), I was fairly skeptical of AI. Although I am not lighting my torch and sharpening my pitchfork yet, ChatGPT has changed my perspective on AI. Caveats: Below I describe my attempts to understand how ChatGPT may be used in psychology and management classes, and provide some conclusions and ideas for moving forward. Please note that my views are based on a considerable amount of time spent testing ChatGPT with materials from my own classes. However, they are NOT the result of carefully controlled, systematic, and exhaustive peer-reviewed research, nor of in-depth, technical knowledge of the inner mechanics of AI in general or ChatGPT in particular. I tried to use ChatGPT as would a student or instructor and evaluated the results without any particular agenda.2 I encourage you to replicate and go beyond what I’ve done with your own materials and come to your own conclusions (which I would love to hear!). Finally, sorry for all of the footnotes, which I nonetheless encourage you to read.3 Still Bad at Writing Research Papers Back in the January 2022 issue of TIP4 I discussed the likelihood of students using AI to write academic papers by trying out two newly developed apps based on the GPT-3 AI5 platform. My conclusion at that time was that the two GPT-3-based apps were not very good at writing academic papers, especially research papers. Although ChatGPT looks like an improvement over those earlier apps, most of my critiques still apply. ChatGPT can produce essay content that looks reasonable but is not specific nor reliably accurate. For example, when I asked ChatGPT to generate a paragraph about research on employee recognition, a literature with which I am quite familiar, using APA style citations and a full reference list for the citations, like its predecessors, it generated relevant text with citations that looked plausible but very clearly do not exist. Specifically, four out of the five references it generated were fake. Similarly, when I asked ChatGPT to generate a list of peer reviewed journal articles for the same research topic, it responded: “I'm sorry, I am not able to generate a list of peer-reviewed journal articles as I am not connected to a database. However, here are some examples of peer-reviewed journal articles.” Ignoring its own excuse, ChatGPT then generated the same list of references as before. When I asked it which of the articles in the reference list actually exist, confusingly, it first apologized, said there were inaccuracies in the references that it had provided, and then relisted each reference, stating that each “does exist.” When I asked for a web link to one specific article that I knew doesn’t exist, it responded: I apologize, as a language model AI, I don't have the capability to provide you with direct links to journal articles. However, you can access the journal article "ZThe relationship between employee recognition and organizational performance" by J.W. ZLee and Y.J. ZLee,6 published in the ZInternational Journal of Human Resource Management (2001), by searching for the title in a library database or by searching for the journal title "ZInternational Journal of Human Resource Management" and the volume and issue number "12(2)" in an academic search engine such as Google Scholar or JSTOR. Out of an abundance of caution, I followed the steps, and reconfirmed that the article does not exist.7 Next, I stated that this article does not exist and asked why it had generated it. ChatGPT again apologized for the confusion but stuck to its guns, insisting that the article does exist, though it did acknowledge that it doesn’t have “the capability to check the existence of every single reference” (or apparently, even just one reference!). In fact, ChatGPT struggled to generate an accurate quote from any specific source, especially academic ones. Similarly, when I asked it to identify the author of a specific article published in The Atlantic, it identified an author of many Atlantic articles, but not of the article in question. In conclusion, you can’t trust ChatGPT on the details! Like its predecessors, ChatGPT is able to produce student writing that superficially looks credible but will likely be filled with inaccuracies, especially as more specific details are queried.8 This is not an indictment of ChatGPT. It is an important limitation that users should be aware of, including our students. One implication is that if you are concerned with students using ChatGPT as a means of cheating on a paper assignment, then (a) use specific prompts such as requiring an evaluation of research articles or other specific sources, and (b) be prepared to check specific references for accuracy because ChatGPT-generated papers will likely contain either references to nonexistent articles or false characterizations of the content of existing articles. More on grading/cheating later. Surprisingly Good at Taking Exams I was curious to see how ChatGPT would perform on an exam that I wrote for a leadership development class. Sometimes I teach this class in fully online asynchronous format with exams administered remotely via canvas, so the possibility that students could use ChatGPT to take this exam is practically relevant. The exam has a set of multiple-choice (MC) questions, followed by a set of short answer/essay (SA/E) questions. I wrote the MC questions from scratch in such a way as to test the application of concepts rather than identification of definitions, and so on. They aren’t from a textbook question bank and are not available on the internet, as far as I have been able to determine. Here’s an example of the style of many of the MC questions (this is not a question from the actual exam): One study discussed in Tversky and Kahneman (1981) involved asking people the odds that they would hypothetically pay for a $10 dollar theater ticket after they had either (i) lost their ticket which they had already bought for $10 or (ii) lost a $10 bill. What did they find and how did they explain it? a. People were less likely to pay after losing their ticket (vs. after losing a $10 bill) because the psychological cost of the play was $20 rather than $10 b. People were more likely to pay for a second ticket after losing theirs (vs. after losing a $10 bill) because they were more committed to seeing the play c. People were more likely to pay for a second ticket after losing theirs (vs. after losing a $10 bill) because they experienced greater psychological dissonance d. People were equally likely to pay in either scenario because the $10 loss was the same9 Similarly, the SA/E questions were meant to require students to apply concepts to hypothetical scenarios. Here’s an example of the style of question (also not a question from the actual exam): Imagine that you are a high-level HR manager at a large advertising agency. Most of your employees are involved in highly creative, artistic work. The agency is understaffed and experiencing relatively high turnover. Workload is high, deadlines are short, and people are working long hours. You have heard rumors that employees are feeling high levels of stress, and that some of them are even getting burned out. You would like to administer an organizational survey to measure employee stress and burnout, but upper management is resistant to paying for it. They aren't convinced that stress and burnout are worth worrying about. How would you make the case to upper management that stress and burnout are important and should be considered? Please write an e-mail to upper management to convince them. I submitted every exam question to ChatGPT. I noted the answers it gave to MC questions. I graded the SA/E answers based on the rubrics that I had developed. I also searched through past student SA/E answers to find those that were similar to ChatGPT’s as a reliability check. ChatGPT scored 16/20 on MC questions, which is just above the historical median of 78.5% for this exam over the last ~400 students or so who have taken it. It struggled more on questions that asked about specific ideas or findings from journal articles. Interesting, if asked it will also provide a rationale for its answer. ChatGPT’s responses to the SA/E questions were even more impressive, with a total score of 56/60 (93%). ChatGPT’s overall exam score was 88%, which put it in the 81st percentile of my students, historically. Yikes! What Does All of This Mean? If these results are reliable10 and generalize to other, similar exams, then it seems clear that we cannot use such exams and expect that exam performance will reflect student learning unless we implement security measures that prevent students from using ChatGPT (or similar apps) to answer the questions. Some potential security measures include proctoring software that monitors students and prevents them from using other programs during an exam administered on a computer. However, this might not prevent students from using ChatGPT during the exam on a different electronic device (e.g., their phone). Additionally, there are applications that can detect whether text has been created by AI versus a human. I tried the first free AI detector that I could find. I have no idea how it works or whether others exist that work better. This one seemed to do a fairly good job of distinguishing between AI- and human-authored text, especially with longer responses (see Table 1). With shorter responses, it mistook AI text for human text, and vice versa. This makes intuitive sense as recognizable patterns presumably will emerge only with a sufficient amount of data. Telling students that their work will be submitted to an AI-detecting software may discourage students from using AI. However, students could also submit AI-generated content to such detectors and change a few words until their answer doesn’t look like AI-generated content anymore (much like “hacking” Turnitin.com11). Students could also completely rewrite AI-generated responses. Although rewriting AI-generated content would be difficult to do during a timed exam, students could easily “plagiarize” from AI in this way on writing assignments, which would be very difficult to detect.12 Also, there is likely to be an arms race between AI writers and AI detectors—I would bet on the AI writers winning. I suppose one might simply revert to using old-school, in-person, paper-and-pencil exams, which, as a solution, has a certain elegance to it. However, as I wrote last quarter, I’d rather spend my energy finding new and better ways to teach than engaging in a war on cheating. Table 1 Analysis of Writer.com AI Detector Author # Words Estimated % of human-written content AI 123 3% AI 123 0% AI 39 27% AI 15 77% Human 191 100% Human 85 99% Human 40 95% Human 17 52%13 Preventing the use of AI will not only become increasingly difficult from a technological standpoint, it will appear increasingly unreasonable and out-of-touch as all of us become more accustomed to using AI to write (e.g., some form of ChatGPT may eventually be integrated into MS Word and Outlook). If you’ve ever griped about a student’s terrible writing, then, in some sense, ChatGPT is your dream come true! If you let students use it, you may never have to read a terribly written paper again. This is a remarkable innovation!14 True to my I-O psychology roots, I have mostly talked about implications for assessment validity. However, AI raises much bigger questions as well. If AI can do a credible job of generating grammatically correct, readable, and relevant text in response to a query, then perhaps (a) it is pointless to teach students how to write independently (i.e., without being able to collaborate with AI), (b) independent writing is no longer an important competency for most workplaces of the future, and (c) we should shift toward other means of assessing student knowledge and building student competencies that AI cannot (yet?) perform (e.g., oral presentations, demonstrations, discussions). These arguments might sound ridiculous right now, but it is easy to envision a near future in which most writing is a collaboration between humans and AI. Your students are able to use AI to help write their exams and papers right now. ChatGPT helped me write a title for this column.15 The way we are talking about AI now is very different compared to just 5 years ago. I don’t want to sound alarmist, but we need to rethink some things. One clear implication mentioned earlier is that perhaps we shouldn’t teach students to write anymore. However, an important skill that many of us teach I-O psych students is to critically evaluate theory and findings from specific research studies in order to formulate conclusions and ideas for future research. ChatGPT struggles with questions about content from single, specific sources. Thus, although ChatGPT can write plausible, general content at a high school/early undergraduate level, at the moment it seems unable to meet the more stringent requirements of writing an undergraduate research paper, thesis, or dissertation. In other words, AI may raise the floor for writing quality, but it may also lower the ceiling. However, if we outsource most student writing to AI, and forgo teaching students how to write, how will we prepare students to write at a level that is appropriate for undergraduate research and to prepare them for research-intensive master’s and doctoral programs? I don’t have a great answer to this. Paper assignments could start with an AI-generated first draft distributed to the class, with students required to edit the draft while tracking their changes. For research papers, students could use AI to generate a first draft with the bulk of their time spent fact-checking against the relevant academic literature and revising the paper accordingly. Maybe intensive, advanced writing courses will be necessary for those students interested in pursuing graduate study or careers in academia. Alternatively, maybe we just wait a few years until AI can write a research paper too! One alternative approach to writing-focused teaching and assessment is to concentrate on students creating products. For example, students in a psychometrics class may be tasked with creating a self-report survey measure of personality for use in a personnel selection context. ChatGPT is capable of generating such a survey measure, though I have found that many of the items suffer from well-known psychometric limitations (e.g., double-barreled wording). In this assignment students may use AI or any other relevant tools (short of plagiarizing others’ work16) and are evaluated on the quality of the end product. One might also require students to explain and defend their product, perhaps in an oral presentation, as a means of assessing their knowledge. However, if AI is eventually able to create the product without any help from humans, then maybe we shouldn’t teach that anymore because no one will hire our students to create a product that AI can create for free. A slightly different approach would be to focus on students demonstrating relevant behaviors—like an assessment approach to evaluating student learning. For example, in an industrial psych or leadership class, rather than writing an essay about how to effectively deliver performance feedback, students must demonstrate this behavior in a business simulation exercise and are assessed on how well they executed the behavior. These are two simple ideas for how to adjust to ChatGPT in the short term. However, the reality is that as a field we need to engage in discussions about the future of I-O psychology practice and education given these new (and likely future) developments in AI. We need to experiment with AI tools in our research and classrooms to better understand their potential uses—there are amazing opportunities here to develop innovative practices that transform education! Moving forward, I plan to encourage students to use ChatGPT for my writing assignments to better understand how they use AI tools, what changes I need to make to rubrics and other aspects of assessment, and what students learn from such assignments. I believe that in the years to come we will be repeatedly confronted with evidence that AI can do more and more of the tasks that we consider core components of higher education. Many people are already questioning the value of higher education at this moment, for a variety of reasons that have nothing to do with automation. Educators have been discussing how to make students “robot proof” for almost a decade now. Practicing I-O psychologists perform many tasks that AI cannot presently, and maybe never will be able to do, such as delivering training, coaching and mentoring, facilitating employee communication and collaboration, designing and conducting research, and providing ethical and legal guidance.17 These recent developments to AI, including ChatGPT, do not just force us to rethink how we assess learning, they force us to rethink everything that we are trying to accomplish as educators. We cannot provide value to our students if our academic programs focus on developing competencies that are no longer valued, prepare students for jobs that will not exist, or fail to incorporate the newest technologies (e.g., ChatGPT) into our classrooms. Welcome to the future! Notes  For more info on how it works, please see my prior Max. Classroom Capacity column on robowriting 2 I have no financial relationship with Open AI, writer.com, or Monkeylearn. 3 Footnotes are used primarily to obscure from my editor the fact that I’ve trampled all over my word limit. Sorry Adriane! [Editor’s note: I’m on to you! Note to self: Loren needs a footnote limit.] 4 The year before, in the January 2021 issue, I wrote a Max. Classroom Capacity column on using AI to grade exams. I also recently tested whether ChatGPT is a helpful tool for grading. The short answer is that I don’t think it is. I wish I had the space to go into more detail—if you’re interested, feel free to e-mail me. 5 I use the term “AI” loosely; I understand that some experts do not consider ChatGPT an AI. 6 Extraneous “Z”s added to limit the proliferation of fake references online. Some tech writers call this propensity for AI to be convinced of something that isn’t so “AI hallucination.” 7 In fairness, I did manage to get ChatGPT to write a paragraph about employee recognition in which all of the citations existed. However, as far as I could determine, ChatGPT’s descriptions of the articles and findings were completely inaccurate, as employee recognition was not part of any of them. 8 In contrast, ChatGPT seems able to generate plausible and somewhat novel nonspecific research ideas. It also was able to generate a reasonable if not too detailed research design and methodology to develop a new self-report measure of employee experiences, describing the design (mixed-methods), methodology (literature review, item generation, pilot testing, scale validation, interviews) and analyses (thematic analysis of interview data, CFA, correlations). 9 Although not part of the exam, I asked ChatGPT to answer this question as well. It answered b. Although answer b is plausible, it does not reflect what Tversky and Kahneman actually wrote in their 1981 article (i.e., answer c). 10 I subsequently resubmitted each exam question to ChatGPT and found similar, but not identical results. One MC question it answered incorrectly the first time it got right the second time, and vice versa for another question. ChatGPT performed slightly worse on the short answer/essay part the second time, providing some very similar answers, and others that were substantially different. 11 I was going to link to a website that details how to cheat Turnitin but was too nauseated to go through with it. Such websites are not hard to find. 12 I have colleagues who take samples of student writing at the beginning of a class and use them to evaluate the academic integrity of students’ writing on subsequent written assignments. 13 I cannot rule out whether this student’s response was written with the help of AI, but I think it’s very unlikely. 14 Also intriguing: If you give ChatGPT a list of the ingredients you have on hand, it will generate a recipe for you using those ingredients. Does it work? The proof is in the chia seeds, nutritional yeast, canned black beans, tapatio, and pickled onions pudding… (I’m too scared to eat robo food yet!) 15 Query: Please evaluate my current title for the article abstract that I showed you earlier and propose an improved version: "ChatGPT--OK, Maybe We Have to Rethink Some Things" Answer: The current title "ChatGPT--OK, Maybe We Have to Rethink Some Things" effectively captures the author's shift in perspective toward ChatGPT and AI, but it could be more engaging and attention grabbing. An improved version could be: "ChatGPT Shakes Up Education: Rethinking Assessments and Efficiency" This title is shorter and more direct, it highlights the impact of ChatGPT on education and it gives the reader an idea of what to expect in the article. It also gives a sense of urgency, making it more appealing to read. 16 I and many others have discussed the interesting questions concerning AI authorship and citation. 17 I asked ChatGPT to generate a list of things I-O psychologists do that are unlikely to be replaced by AI. All of these items are from its answer. Print 496 Rate this article: 5.0 Comments are only visible to subscribers.