Crash Course in I-O Technology
Richard N. Landers
A Crash Course in Natural Language Processing
This issue, we’ll be exploring another concept core to data science called natural language processing (NLP). Many people assume that NLP is a particular analysis, as if you open up a dataset and “apply NLP” to it. But NLP is in reality an entire field of study attempting to explore and understand how humans interpret language and, in turn, how computers can mimic that interpretation. Once a computer “understands” language, you can run a lot of targeted analyses on that language to address your research questions—or in the language of data science, you can apply a variety of different algorithms to develop insights.
NLP is often confused with closely related concepts text mining and text analytics. Text analytics refers to any of a wide family of approaches used to derive meaning and develop useful insights from text data. Text mining is a bit more specific and involves any systematic application of algorithms to break down text into meaningful chunks and explore the interrelationships between those chunks. For example, you might count all the times any particular word appears in an open-ended response item and then create a word cloud, with the assumption that words appearing more often are more common concerns among the people responding to that item than words that don’t. Text mining and text analytics are thus human processes; they are actions you take.
In contrast, the term NLP refers to a variety of algorithms used in both text analytics and text mining. If you go practice text mining, you will employ some number of NLP algorithms. Predictive machine learning algorithms like those I described previously are most often sorted by the types of data they best analyze (e.g., best algorithms to predict categorical data from continuous data when sample sizes are small). In contrast, algorithms involved in NLP are best sorted by purpose. This is because text analytics projects involve the execution of multiple algorithms in a particular sequence, and each of these algorithms has a different purpose.
As a reminder from last time, an algorithm is simply any procedure executable by a computer that takes a specific input, processes it somehow, and produces an output based upon an articulated set of steps that the computer follows. When SPSS takes your dataset (input), iterates over it to compute a slope and intercept (process), and then spits out tables (output), it is executing a regression algorithm. That algorithm will be slightly different from the regression algorithm executed in R, because the specific steps taken by the computer will likely differ slightly. Because both ultimately produce the same slope and intercept, the underlying statistical analysis is the same even though the algorithms themselves are different. This is a crucial distinction.
When sorting through the various algorithms utilized in NLP, I find it most useful to distinguish between four major steps: (1) data munging, (2) preprocessing, (3) dataset generation, and (4) analysis, which I summarize in Table 1. I’ll cover each of these steps in turn.
Steps Taken (and Types of Algorithms) Involved in Natural Language Processing
Data wrangling, also called data munging, refers to the systematic cleaning and combining of data sources into something that can be understood by the algorithms you want to apply to it next. In the case of NLP, the ultimate goal of data wrangling is a dataset containing both text and other variables of interest associated with that text. Data wrangling should be a familiar problem to any I-O practitioner or any academic who has worked with people management software. If you’ve ever tried to get turnover data from Kronos, performance data from Peoplesoft, and survey data you collected in Qualtrics all into a single file that R or SPSS can read, you have been data wrangling already.
Commonly, data wrangling involving text starts with some sort of interesting text out in the wild, wherever that text happens to be located, and ends with a dataset containing a column of plain text responses and associated variables. In practice, it is best to design data collection efforts to prevent excessive data wrangling if at all possible. For example, text responses should be stored in CSV files, not in PDFs. If a dataset that looks like this can be created by the software that collects it, it should be. This sort of foresight and planning will minimize headaches (and therefore time costs) later.
NLP algorithms are designed to apply to a corpus, which is essentially a database containing all of the raw text you’re trying to analyze plus any associated meta-data (i.e., those other variables I described earlier). That means this particular data format is the end result of data wrangling. For example, if you were looking at open-ended survey responses, you might include the respondent’s identity, or their department, or any other such information in the corpus alongside the response itself. Most importantly, each corpus contains all of the text data you later want to analyze as a single unit. For I-Os, this often means all responses to a single open-ended survey question.
Generating a corpus is generally quite easy because whatever program or R libraries you will eventually apply NLP to will create it for you as long as you can clearly specify where the texts you’re interested in are located. For example, if you plan to use the NLP algorithms contained in the R library tm, one line of code can convert a vector of text into a corpus:
corpusVar = VCorpus(VectorSource(dataset$textVar))
From an I-O perspective, this may seem confusing. If it’s just the same text I had before, why does it need to be a “corpus”? Why can’t I just run NLP analyses on a variable containing text? The most straightforward answer is that NLP algorithms expect a certain data type and that data type is “corpus.” In simple NLP, there is no functional difference between corpora and variables containing text. In more advanced NLP, this may not be true; meta-data like sentiment, author, or any other piece of information might be added to the corpus and associated with each text, either algorithmically or by hand (this process is called annotation and creates an annotated corpus). In cutting-edge NLP, these annotations may even include information like sentence structure, grammar, and other properties of the text. Long ago, before computer processing of text was common, corpora used by linguists would often be physical collections of documents, so a lot of the terminology and thinking about corpora comes from that era.
Once you have a corpus to work with, you can apply preprocessing algorithms. The precise approach to take during preprocessing is different depending upon the NLP approach you’ll ultimately be taking, so I’ll briefly review the two major types.
First, the most common approach to NLP involves “bag of words representation.” What this means is that all linguistic information containing in a sentence except for the words themselves is thrown out. Word order? Useless. Grammar? Who cares! Thus, preprocessing with a bag of words representation involves altering the corpus such that every word it contains is meaningful. The precise meaning of “meaningful” in this context is left to the NLP practitioner. I’ll describe this in more detail in a few paragraphs.
Second, the less common but potentially far more powerful approach to NLP involves “semantic representation.” In this approach, no information is thrown out. Tenses, word order, grammar, synonyms, phrases, clauses, and so on all are considered important pieces of the NLP puzzle.
So what leads people to choose a bag of words or semantic representation? The answer is simple: It’s a balance between processing and statistical power with explanatory power. In terms of processing, NLP is one of the most computationally intensive types of analysis you are likely to use because the interpretation of language is incredibly complex. If you’re literate and read a lot, you likely take for granted just how complex of a skill reading actually is. Our brains make reading seem easy, but remember that you’ve been feeding “training datasets” to your brain for decades; R and Python do not have that advantage. If you want to teach a naïve computer how to read, you’ll need to do it from scratch.
Imagine what this might involve in English. For any given sentence, there are a lot of things to which to pay attention: individual words, their denotations, and their connotations; multipart words that have unique meanings beyond their component parts, like “industrial and organizational psychology”; tenses and conjugations; subject-verb agreement; phrases and clauses, along with the conjunction words, prepositions, and punctuation that combine and organize them. This list could continue for a full page, and each of these concepts would itself require lengthy and complex explanation.
Whatever they end up looking like, once we have all the component parts of some text laid out, what do we do with them? The one thing that computers can do more effectively than humans is raw, systematic, exhaustive processing, so the most straightforward approach is to create one variable for each concept. The use of “clean” in a sentence becomes a variable. The use of “cleaned” is another. The use of “cleaner” is a third. In the paragraph before this one, I used 84 words, 20 bits of punctuation, 9 capital letters, 8 verbs in various forms, 5 sentence fragments, and so on. So now, the computer just needs to see all of that content replicated numerous times in different contexts—books, articles, websites, speech, and so on—until it has a sufficiently large sample to start trying to infer meaning.
As you’ve probably caught on, the processing time required to do this and the statistical power required for it to be replicable would simply be incredible. This is a core reason that we don’t have true artificial intelligence yet. It’s just not computationally feasible with current technology. This is not to say that semantic representation is impossibly far in the future. Researchers are constantly developing new ways to simplify semantic representation of text in order to make it more within the realm of the possible given modern technology. Remember that about 55 years ago, humanity managed to get to the moon with far less technology than is currently used to run Internet-connected light bulbs. New algorithms are being developed to accomplish NLP with semantic representation using much less processing power than currently required—but we’ve still got a long way to go.
In fact, most people interested in semantic representation these days take a shortcut in teaching computers how to read by relying on commercial algorithms developed by companies that have thrown an incredible amount of resources at the problem already. These algorithms already know how to read, at least as well as any computer currently can. One of the most common that you might have heard about before is IBM Watson, but similar products are available from Microsoft, Google, and Facebook. If you want to take advantage of these systems (often marketed as “AI,” but remember my warning about that earlier), you typically need to pay a fee based upon the amount of processing time you need. These systems also have the processing power to implement a more complex type of machine learning called deep learning, which involves the application of a concept called neutral networks. In neural network modeling, variables are defined recursively, such that information discovered later in dataset exploration can be used to revise relationships discovered earlier, and individuals nodes in the network can have multiple, complex relationships with other nodes. For example, when the algorithm encounters the word “psychology” the first time, it has relatively little information about it. When it encounters it the hundredth time, it will have likely figured out that psychology is a noun and is used in similar contexts to other “ology” words. When it encounters it the ten thousandth time, it may even be able to generate its own sentences using the word psychology in ways like humans would.
As a result of the additional complexity and cost, virtually everyone casually applying NLP algorithms these days uses a bag of words representation in which each word (or group of words) that is viewed as linguistically and analytically meaningful essentially becomes a count (0 for absent, 1+ for present) variable in a dataset. In the previous paragraph, about 257 distinct words were used, so if you were interested in creating a dataset based upon this article using a bag of words representation, that paragraph would contribute 257 variables, each of which would contain a count (e.g., “you” would have value 4). As more paragraphs were added to the dataset, so would more variables, adding a lot of zeros each time.
Given this approach, the goal of preprocessing before applying a bag-of-words model is to ensure that each variable is itself meaningful (a content validity question) and also in comparison to other variables (a discriminant validity question). So what does this involve specifically? Here are some commonly applied bag-of-words preprocessing algorithms:
- Changing all words to lower case so that words like “Table” and “table” end up being considered the same variable.
- Changing abbreviations to their full words, so that words like “Mr.” and “Mister” are considered the same.
- Changing contractions to their original versions, so that a word like “it’s” is remade into “it is.”
- Stemming words, so that words like “contacted,” “contacts”, and “contacting” are considered the same (in a stemmed corpus, these would all become “contact.”)
- Removing all numbers and punctuation.
- Removing functional words (sometimes called stop words) that are generally not considered linguistically meaningful, like “what,”“his,” “they’ll,” and forms of the verb to be. See Figure 1.
- Removing any words that are uniquely problematic, such as hashtags from Twitter data.
Figure 1. Default stop words in R detected by the tm package.
Importantly, there is no single set of preprocessing algorithms that is applicable to all situations. For example, if you were applying a bag of words approach to tweets, you might want to keep or exclude hashtags, depending upon your NLP goals. It is up to the NLP practitioner to decide what will be meaningful or not for any particular application. One of my interviewees for this column described this as more of an “engineering” approach than a “scientific” one, which I found a very helpful way to think about it. Most decisions made in NLP are made on the spot to solve specific problems, not because a well-validated and complete prior literature suggests an ideal course of action. In many cases, NLP practitioners will fiddle with their preprocessing algorithms until they end up with the result they want, and this not frowned upon in the slightest within the data science community.
From here forward, I’m going to ignore semantic representations and focus on bag of words models. As I described earlier, semantic representation adds a mind-boggling degree of complexity. What that means in practice is that if you’re trying NLP yourself without the help of a computer scientist, you’re probably going to stick to bag of words. So that’s the example I’m going to provide you!
In a bag of words approach, once preprocessing is complete, it’s time to actually create the dataset itself. This process is called tokenization. In practice, it involves the assignment of identifier numbers to each unique word contained within a corpus. Then, the tokenization algorithm will count how many times each word appears in each text, repeating this process for each identifier. Finally, it returns a document-term matrix (DTM), which is essentially a dataset with documents as cases and words as variables.
For example, the following three sentences create the DTM that appears in Table 2: (1) The fox chases the hen. (2) The hen lays an egg. (3) The fox eats the egg.
Simple Document-Term Matrix
These datasets tend to have high variable-to-case ratios, although this problem gets better when “big data” is involved. For example, a 500-case dataset of open-ended responses could easily identify 50,000 unique words used (and therefore variables).
Importantly, up to this point, I’ve only been describing individual words becoming variables. For example, “I love industrial and organizational psychology” would become 6 variables. This is called a unigram tokenization. However, to capture additional nuance, NLP practitioners might add two-word phrases (bigrams), three-word phrases (trigrams), up to any size n (n-grams). Tokenizers typically do this by examining the text for any two-word or three-word phrase that appears more than once in the dataset. So for example, if I applied a four-gram tokenizer to the articles in TIP, “industrial and organizational psychology” would likely show up as a variable in my dataset. This is often viewed as a small step toward semantic representation still achievable within the processing requirements of NLP using a bag of words representation.
Finally, we can actually learn something from this text! Analysis of DTMs can be conducted with two broad families of techniques: visualization or predictive machine learning. You have probably already seen visualization of text data in the form of word clouds or dendrograms. These are just what they appear to be: visual representations of word (or n-gram) frequency.
If you’re taking a predictive machine learning approach, there are many more options. One common approach is sentiment analysis, which involves the assignment of sentiment values to each term in the DTM and the calculation of an overall sentiment score for each document. In most cases, this is done by cross-referencing a previous data collection effort where sentiment of individual words was determined; these datasets are called lexica (or lexical corpora). For example, it’s not uncommon to take all the most common words in a DTM and ask workers on Amazon Mechanical Turk to rate them for sentiment with questions like, “Is this word positive, neutral, or negative?” Once you have mean ratings of sentiment from MTurk workers for each word in your dataset, a mean sentiment level for each document can be calculated. There are also freely available lexica, such as this commonly used 6,800-word one, although words could be in your corpus that don’t appear in your lexicon. Sentiment scores have been used to predict a wide variety of outcomes, such as stock values and box office performance of films.
Another common approach is to use the DTM itself to predict other variables directly. The key difference between this and sentiment analysis is that sentiment analysis predicts outcomes based upon the affective content of the words (loosely defined) whereas direct prediction is from word (or n-gram) frequencies. Commonly this is done with naïve Bayes classifiers for categorical outcomes or ridge regression for continuous ones. In both cases, it’s important to use n-fold cross validation, which takes multiple random subsamples of documents to build the predictive model and then tests each of those models on the remaining nonsubsampled data. Otherwise, overfitting is exceptionally common (as you’d expect in any analysis where the number of variables exceeds the number of cases by an order of magnitude or more!). For more on predictive machine learning algorithms, including ones that can be used in NLP contexts, I recommend consulting Putka, Beatty, and Reeder (in press).
What often frustrates I-O psychologists that I talked to about NLP is that these analyses generally don’t tell you what the predictive algorithm was picking up on to achieve its success rate. You don’t get a list of beta weights; instead, you create a brand new algorithm that you can load new data into and which will spit out predicted values on the basis of your original analysis. In many ways, it’s like using a regression formula to predict y’ without ever knowing what specific terms were calculated for the intercept and slopes. Thus, NLP is not terribly useful as a theory-development technique at least as it current exists; you may never actually know why it worked even if it worked spectacularly well. Some researchers have tried to address this by applying an intermediate step of unsupervised machine learning to extract meaningful word clusters (sometimes called thematic extraction; for an example of this, see Campion, Campion, Campion & Reider, 2016), but this does not really solve the problem; it simply creates word clusters that are also difficult to interpret.
Let’s See It in Action
In my Crash Course on Machine Learning, I didn’t provide a demonstration video in R for several reasons, but most importantly because there was not a lot to demonstrate, just a handful of commands with relatively straightforward interpretations. In NLP, the issues you face become a lot more complicated, and there are many more steps involved. So to address that, I’m going to provide you with a high-level overview here in TIP but leave the details and the demonstration part to the video. So if you want to see NLP in action, I’d recommend watching this video immediately after reading this section.
In this practical exploration of NLP, I’m going to step you through a miniaturized version of an NLP project, much like the demonstration I would do when teaching data science to I-O PhD students who don’t know much programming (yet). Specifically, we’re going to take a dataset that contains a binary variable (self-reported gender) that we want to predict from another variable that contains only text.
As I described earlier, NLP involves four steps. In Step 1, data wrangling, we grab a dataset (in this case, from a tab-delimited dataset which you can download here) and create a corpus using the tm library:
# Grab data from tab-delimited text file
dataset <- read.table("data.tab.txt",
# Call libraries we'll need
# Create base corpus
myVectorSource <- VectorSource(dataset$Email)
myCorpus <- VCorpus(myVectorSource)
In Step 2, preprocessing, we apply algorithms to the text data in the corpus to ensure our bag of words is meaningful. In this case, I replace abbreviations with their nonabbreviated equivalents, replace contractions the same way, remove all numbers, remove all punctuation, change everything to lower case, remove all functional words, and then stem all words that can be stemmed. Any line that doesn’t call content_transformer() is from tm, whereas all the others (except tolower, which is a core R function) are from the qdap package.
myCorpus <- tm_map(myCorpus, content_transformer(replace_abbreviation))
myCorpus <- tm_map(myCorpus, content_transformer(replace_contraction))
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removeWords, stopwords("en"))
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, content_transformer(PlainTextDocument))
myCorpus <- tm_map(myCorpus, stemDocument, language="english")
In Step 3, dataset generation, we tokenize the resulting dataset with unigrams. Compared the all the work we just did, this is easy.
# Create document-term matrix (DTM)
DTM <- DocumentTermMatrix(myCorpus)
In Step 4, we finally get to analyze the resulting dataset. First, we do some basic visualization by looking at a list of the most common words.
# Some basic visualization
DTM.matrix <- as.matrix(DTM)
freq <- colSums(DTM.matrix)
sortfreq <- sort(freq, decreasing = TRUE)
We can also create a word cloud quite easily.
Second, we apply a predictive machine learning algorithm to develop a new algorithm that will predict the variable we wanted to predict. This involves a little more data wrangling, because we need to recombine the original criterion with our new NLP dataset. I’ll be using the caret package, which I wrote about last issue.
# Create a dataset with the DTM
dtm.matrix <- as.matrix(DTM)
sml_dataset <- data.frame(gender=dataset[,2], dtm.matrix)
# Some machine learning
gender_model <- train(gender ~ .,
data = sml_dataset,
We could then use this model to make new predictions on new text in another dataset. Importantly, this example uses a very small dataset (much too small to trust!) and doesn’t apply n-fold cross-validation, both of which would need to be implemented to have any faith in the resulting predictions.
So Who Should Learn Natural Language Processing?
As I mentioned earlier, semantic processing is extraordinarily complicated and realistically requires that you purchase processing time on commercial NLP platforms. So if you’re just starting out, stick to bag of words. It is more than enough for most applications—think about the difference between classical test theory and item response theory (IRT). Sure, IRT gives you a lot more information about your tests and their questions and therefore capabilities like adaptive testing, but it also imposes a lot more restrictions and requirements on the type of data with which you start. Just like you wouldn’t apply IRT when n = 40, you do not want to apply semantic processing without access to the billion-plus-case datasets that provide a starting point for your own analyses. Classical test theory works just fine for most common applications of psychometrics, as does bag of words for most common applications of NLP.
So that means the real questions are “who should learn bag of words NLP?” and “who should hire a computer scientist to do semantic NLP?” I’ll tackle these one at a time.
Bag of words NLP is useful if you ever try to interpret text data. Too often, we do this holistically. For example, we download a gigantic list of comments and just read them, assuming our brains are sophisticated enough to extract key concepts and ideas from a brief perusal. But as anyone who’s done content analysis before knows, it’s very easy to fool yourself when trying to interpret text data. Basic NLP using a bag of words representation is an excellent toolkit to try to prevent that from happening. Don’t drop holistic interpretation entirely, because bag of words misses nuance, sometimes completely. But bag of words does provide an alternative perspective on that holistic understanding. So in practice, I’d recommend reading your comments, applying NLP, and seeing if the interpretations you come at from both perspectives agree. If not, you have some more analysis to do. If so, you can be more confident in what you’ve discovered.
In I-O psychology, people who’ve dabbled in NLP have generally stuck to bag of words. For example, Rudolph and Zacher (2015) were interested in differences in affect toward different generations and the impact of this on workplace relationships, so in pursuit of this, one part of their analysis involved collecting a random sample of Twitter postings and subjecting those posts to sentiment analysis. In that, they found that tweets about Millennials generally were the most positive, followed by Baby Boomers, followed by Generation X. Campion et al. (2016) utilized a bag of words model when analyzing text collected during selection as implemented in the text modeling tools available within SPSS, allowing the software make most of the preprocessing and theme extraction decisions on their behalf.
In contrast to these situations, semantic representation is most useful when you really need to get into the nitty-gritty details of language in order to predict the phenomena you’re interested in predicting. This is strongly reminiscent of the bandwidth-fidelity dilemma (Ones & Viswesvaran, 1996) in that narrow criteria are best with narrow predictors and broad criteria with broad predictors. However, the goalposts have moved a bit. To be “narrow” at the level of “verb conjugation matters,” you need to be facing a problem that requires an incredible level of specificity.
In I-O, I only found one person, Matt Barney, founder and CEO of LeaderAmp, that has been applying semantic processing to solving I-O problems, and frankly, I was astounded by what he said. Specifically, his company is developing a technology to have people speak into a voice-to-text system (like Apple’s Siri) and get immediate feedback on how persuasive they are being. Such a system converts the words people speak into text then applies NLP to that text all in real-time in order to provide immediate feedback. That opens the door to phenomenally powerful assessment systems, which in turn enables incredible automated selection and training systems that I imagine most I-Os haven’t even dreamed of before. Only semantic representation enables this level of specificity, and this is the likely future of many I-O processes. Even so, the number of years or decades before we’re realistically able to do that at scale is still anyone’s guess.
To Learn More
I hope by now, you’re clear on at least one point. If you want to learn NLP, start with bag of words representations and build from there. Bag of words NLP is well within the grasp of any I-O with even a basic understanding of R or Python, and it’s something you can learn the basics of in just a few hours. Assuming you already know R, I suggest DataCamp’s Text Mining course, which walks you through everything required quite quickly. Even if you want to tackle semantic representations, bag of words is the foundation you’ll build that upon.
With basic bag of words down, you can start exploring additional packages in R. Here’s a list. I suggest starting with n-grams (RWeka, tokenizers), more complex stemming (SnowballC), lexical diversity (koRpus), sentiment (tidytext, textir) moving to basic semantic evaluations within bag of words (lsa, RTextTools) and topic modeling (text2vec).
When you’ve got all that down and are ready for the full force of modern semantic representation, which often goes by the names deep learning or neutral networks NLP, a few people suggested Stanford’s course. A simpler version of this course has previously existed in MOOC form, but if you can’t find it now, you can view all the lectures from the Stanford course itself as YouTube videos. Just be aware that this is only a path I’d recommend taking if you really want to deeply, fundamentally understand how deep learning works. Be ready for a lot of calculus.
If you don’t want to get into that, I suggest learning by doing. Microsoft has a few convenient R packages for accessing their deep learning platform (mscsweblm4r or the more useful mscstexta4r), which I’d recommend exploring in combination with their tutorials, although you’ll need to convert instructions from other languages into R commands. If you’re willing to abandon R for Python, you’ll find your learning options open up considerably.
I’ll end with an important point. Most of the tutorials you’ll find are designed for people seeking to use NLP in web applications. So there is often a lot of material within NLP courses you’ll come across that, as an I-O, is completely irrelevant to you. You don’t need to consult a cloud-based API in real-time as people use your website in order to create customized content; you have a stable dataset that you want information about right now. Keep that in mind as you look for learning material.
That’s it for the fourth edition of Crash Course! If you have any questions, suggestions, or recommendations about NLP or Crash Course, I’d love to hear from you (firstname.lastname@example.org; @rnlanders).
Campion, M. C., Campion, M. A., Campion, E. D., & Reider, M. H. (2016). Initial investigation into computer scoring of candidate essays for personnel selection. Journal of Applied Psychology, 101, 958-975.
Ones, D. S. & Viswesvaran, C. (1996). Bandwidth-fidelity dilemma in personality measurement for personnel selection. Journal of Organizational Behavior, 17, 609-626.
Putka, D. J., Beatty, A. S. & Reeder, M. C. (in press). Modern prediction methods: New perspectives on a common problem. Organizational Research Methods.
Rudolph, C. W. & Zacher, H. (2015) Intergenerational perceptions and conflicts in multi-age and multigenerational work environments. In L. Finkelstein, D. Truxillo, F. Fraccaroli & R. Kanfer (Eds.), Facing the challenges of a multi-age workforce: A use-inspired approach (pp. 253-282). New York, NY: Routledge.