A Crash Course in
Machine Learning

Crash Course in IO Technology
Richard N. Landers
Old Dominion University


This issue, I’ll be exploring a concept you’ve undoubtedly heard of but probably don’t know much about: machine learning. To many IO psychologists, the term machine learning describes some complex, unknown, and potentially unknowable “black box” analytic procedures. You often hear the term “data mining” thrown around in a dismissive fashion. But I hope after you read this article, you’ll realize that the term machine learning in fact refers to a variety of analytic techniques that are either identical to or extensions of techniques that most IO psychologists already know. In fact, as you dig into machine learning, you start to realize that there is an entire parallel vocabulary to refer to many concepts that IOs already use.
One of the most common terms you’ll come across in machine learning is training dataset (or training set). My initial approach to understand the term was to deconstruct it: “I know what a dataset is, so what makes it a training type of dataset?” The answer to that question? It is a dataset that contains both predictors and outcomes. In IO terms, you would call this a dataset. In fact, the reason IO datasets are called training datasets by data scientists is in order to distinguish them from two datasets we don’t normally have, test datasets and validation datasets, which along with the training set are typically randomly selected cases from a larger source dataset.
Let’s work through an example. Imagine we’re conducting a simple concurrent validation study, so we collect a dataset containing personality predictors and a job performance outcome. In IO, we might then run an ordinary least squares regression analysis to derive some regression weights. At that point, we’d essentially be finished, because we could then use those regression weights to predict future job performance of anyone that applies for the job. But a data scientist would likely have split the original dataset into three pieces, with roughly 50% of cases in the training dataset, 25% in the test dataset, and 25% in the validation dataset. That way, a model is built on the training dataset using a particular algorithm, the model is tested and tweaked on the test dataset, and then it is finally applied to the validation dataset as a way of verifying that the tweaked model was not overfitted (i.e., took advantage of chance variation to inflate its apparent accuracy). As IOs, we might use terms like splithalf crossvalidation to describe this sort of process, but we would probably simply call everything with variables and cases “a dataset.”
This sort of disconnect in language is quite common and makes machine learning seem a lot more confusing than it actually should be. For example, another unnecessarily confusing term is algorithm. Fundamentally, an algorithm is a highly specific stepbystep process that turns input into output. This is actually a familiar concept. Let’s say for example that I want a computer program to add all the numbers in a set together and then divide that sum by the count of those numbers. This algorithm would process a dataset containing [3, 4, 6] and output 4.333. As an IO, you would probably call this process “calculating a mean.” But to a computer scientist, this is a stepbystep process that they must instruct a computer to follow: an algorithm. If a computer executes a regression analysis, it is executing an algorithm; or more specifically, although ordinary least squares regression is a statistical procedure, the specific way a computer goes about doing it is an algorithm. Thus, every statistical approach SPSS can do involves the execution of an algorithm, from calculating means to modeling in AMOS. So the key to understanding machine learning is to realize that there are many algorithms available that you’ve never heard of before, that try to convert datasets into meaningful information in ways that you’ve not previously considered, and that may in fact produce more useful, more practical, or more accurate answers than any of the algorithms you have now. Right now you’re using principal components analysis (algorithm) or maximum likelihood (algorithm) to determine the factor structure of variables in your dataset, but perhaps you should consider isomap or spectral embedding (algorithms with funny names) instead?
So if there are more algorithms available than what I use now, what are they? Unfortunately for data science dabblers like us, that’s a moving target. New algorithms are being developed all the time. For that reason, traditional academic journals are not particularly useful to understanding data science, because the publication process is simply too slow, a problem we see in IO in only limited contexts. Even by the time conference proceedings are published, sometimes less than 3 months after submission, the knowledge they contain may be out of date. So if you want to stay truly current on machine learning, you need to follow not only conferences but also online discussion boards and comment sections in online code repositories.
Fortunately for your sanity, staying completely current about machine learning is not necessary unless you’re trying to create a selfdriving car. As an IO, if you need machine learning, you probably want to stick to “tried and true” methods, and that list isn’t terribly long. One general distinction among such machine learning algorithms you should know is between supervised and unsupervised algorithms. In supervised learning, the algorithm has something specific it is trying to predict accurately. Thus, because it is predicting a known criterion from predictors, ordinary least squares regression is a supervised machine learning algorithm. It’s called “learning” because the computer can develop the regression formula on its own and then use the formula it learned to predict new values in new datasets, and it’s called “supervised” because you’re telling it the right answers, at least initially. In unsupervised learning, the algorithm is trying to detect patterns and develop meaning automatically. Thus, because you don’t know the extracted components ahead of time, principal components analysis is an unsupervised machine learning algorithm. Same analyses, different words; yet IO only scratches the surface of both types, because there are hundreds of machine learning algorithms available.
To help people find the particular algorithm they need for a particular application, the web application shown in Figure 1 is commonly shared in the same spirit as those “statistical test decision trees” you see at the end of statistics textbooks and provides the most common and fundamental algorithms used in machine learning. It doesn’t have everything you might need, but it does include the ones most people need. If you decided to learn about machine learning in Python, you could even click on each green box to learn about how to execute it.
Figure 1. A screenshot of the scikitlearn algorithm cheat sheet. Software available here.
As you can see, one way to categorize machine learning algorithms is by their four most basic purposes in data science: regression, classification, clustering, and dimension reduction. These should all be pretty familiar. Regressiontype algorithms involve the prediction of continuous variables from other continuous variables, a cousin of the ordinary least squares regression we know. Classification algorithms involve the prediction of nominal variables from continuous variables, a cousin of logistic regression, although many algorithms enable any number of categories in the criterion. Clustering algorithms have the greatest overlap with cluster analytic techniques we already use; you’ll even find KMeans clustering in SPSS. Dimensionality reduction algorithms involve the extraction of patterns of responses among multiple variables. In fact, the “PCA” you see above is principal components analysis, the algorithm you’ve probably already used for factor analysis many times before. In short, these algorithms are not altogether different from the algorithms you already know; they’re just “improved.”
So what does “improved” mean? Pieces of that diagram that might help you answer this question are the repeated questions asking if N > 10,000 or N > 100,000. Those are rather large sample sizes for IO, and the high required sample sizes are a key clue to understanding why these algorithms were developed and are different from the ones we already use. To illustrate, let’s consider a popular machine learning application area, natural language processing, which refers to the use of machine learning to predict quantitative variables from text data. The simplest types of natural language processing involve the breakdown of text into its component parts (this is part of a procedure called preprocessing) and then submission of the resulting data into a regressiontype algorithm. There are many specific natural language processing algorithms, so let’s focus on an example of a very simple one. In this very simple algorithm applied to a project predicting job satisfaction from an openended job satisfaction question, the presence or absence of each word in all text responses is coded as a 0 or 1. So for example, if Person 1 says “I love my job,” then four variables are created (i.e., I, love, my, job), all with the value 1. If Person 2 says, “I hate my job,” then one additional variable is created for the new hate variable, but Person 2 has a 1 for hate and 0 for love whereas Person 1 has a 1 for love but a 0 for hate. Thus, an entire dataset can be constructed from text data. If across all 1,500 surveys there are 2,000 different words used, you thus create a 2,001 variable database: 2,000 dummycoded word variables and 1 outcome.
If you remember your graduate statistics class, you will recognize the problem immediately. It’s probably not a good idea to predict a DV from 2,001 predictors (and that doesn’t even include interactions!) with only 1,500 cases, especially with the statistical approaches common to IO. But if you have a hundred thousand cases, suddenly this model works a little better, even if you’re not quite sure why.
Of course, this is significant simplification. As I mentioned earlier, algorithms are being improved all the time, so modern natural language processing algorithms also may consider word order, sentence structure, word sentiment, synonyms/antonyms, parts of speech, and so forth, which of course means even more variables. This is why the algorithms that are popular in data science tend to take advantage of big data, at least if you want your results to be replicable, and why many (although not all) big data applications really are fundamentally different from the theorydriven analytic approaches currently common to IO psychology (see Campion, Campion, Campion, and Reider [in press] for an example of how natural language processing can be applied to IO problems). Even ordinary regression is not quite the same, as data scientists will often include additional optimizations (commonly called tuning parameters) to improve model fit and/or prediction posthoc (which is why crossvalidating from a distinct training dataset to validation and test datasets is so important).
The willingness to consider this sort of “addon” procedure represents an often underlying point of contention when IOs and data scientists try to work together, even if they don’t realize it at the time. Computer scientists and data scientists don’t think about theory the way IOs do and often consider a lot of what we call theory to be a distraction at best and a waste of time at worst. For example, we might think of theory surrounding employee engagement and job performance, then use psychometrics to guide our decisions regarding the selection of appropriate measures of the particular variables we’re interested in, and then create a research study to test these relationships in a carefully selected sample. In the context of natural language processing, theory instead addresses the question, “What is the most efficient way to predict other variables from text data?” The researchers working on natural language processing don’t particularly care what the other variables actually are or represent any more than you care about how to calculate probabilities from the cumulative distribution function of the normal curve, despite doing exactly that every time you ask for a pvalue from SPSS or R. It’s just not part of their job description. Instead, they refer to their own theory and conclude, “adding wellconsidered optimization procedures after regression increases prediction parsimoniously.” It’s not that they don’t know why a particular optimization procedure helped prediction; it’s that they don’t (usually) care. Or more specifically, they don’t care unless that procedure can be changed to improve prediction further (i.e., an extension of theory). One way to test your own position on this is to answer the following question: Do we need a theory describing how and why every word in the English language relates to every other word before we try to extract meaning from text?
So all of that means for IOs that we don’t necessarily need to know all of the mathematics underlying a machine learning algorithm to understand how that algorithm works in a general sense and make a judgment as to whether or not to trust it with a particular application. The more you know about an algorithm, the less likely it is that you will make a silly mistake and overfit your model without realizing it, but that is not unique to machine learning. Similarly, knowing the formulas and having a working understanding of how ordinary least squares regression works helps avoid inappropriate, inaccurate regression modeling. But critically, both in machine learning and in all of our statistical procedures, there is a distinct point of diminishing returns.
For machine learning, you really just need to know enough to know when you’re in over your head. So if you’re learning machine learning to apply the concepts you learn to a real project, always remember that there is a point in the distance where it will be time to give up and seek a professional data scientist with a PhD in mathematics. But there’s quite a lot you can do yourself before you get to that point, if you ever do.
Let’s See It in Action
So, normally in this section, I would show you a video of myself using the technology, but I decided not to do that for machine learning, because no one wants to watch a video of me writing code for 10 minutes. Instead, I’ll show you a couple of examples in R, which you can execute yourself, demonstrating how to execute machine learning algorithms. If you haven’t used R before, you might consider checking out my previous Crash Course on R. But to be honest, if you’re just getting into machine learning and don’t know R already, you’re better off learning Python instead of R for a few reasons, most importantly (a) because Python has a much easier time handling very large datasets and (b) because Python is a much better designed as a programming language than R, which makes it easier to learn and apply. But because many more IOs know R than Python, I’ll stick to R for this demonstration.
I’m first going to run a simple ordinary least squares regression analysis on a small dataset.
data(randu)
regress_model < lm(formula = y ~ x + z, data = randu)
summary(regress_model)
If you’ve used R before, this is easy. The first line is a command to draw up the “randu” dataset from Rs builtin library of datasets. This creates a data frame called “randu” containing a 400case dataset with three variables: x, y, and z. These variables were each pseudorandomly created, so we won’t expect to find any relationships between them.
The second line invokes the linear model function to execute an ordinary least squares multiple regression analysis, y on x and z, and stores it in a new variable called regress_model.
The third line prints up summary statistics about the model. We’ll focus on this particular output:
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 0.54092 0.03897 13.882 <2e16 ***
x 0.04633 0.05166 0.897 0.37
z 0.06336 0.05276 1.201 0.23

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2936 on 397 degrees of freedom
Multiple Rsquared: 0.005961, Adjusted Rsquared: 0.000953
Fstatistic: 1.19 on 2 and 397 DF, pvalue: 0.3052
From that output, we can conclude that y’ = 0.541  0.046x – 0.063z, and that R^2 is a stunning .006. Now let’s try the same thing using a machine learning toolkit called caret.
install.packages("caret")
library(caret)
trained_model < train(y ~ x + z, randu, method="lm")
summary(trained_model)
Here, the first line installs the caret package if you don’t have it already, and the second line loads it into memory. The third line actually runs a function from caret called train, which as you might guess is used to train a model and stores the result in a variable called trained_model. We specify the same regression model as before (y ~ x), the dataset (randu), and the machine learning algorithm to apply (in this case, the same linear modeling algorithm we used before). The fourth line brings up a summary that looks awfully familiar:
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 0.54092 0.03897 13.882 <2e16 ***
x 0.04633 0.05166 0.897 0.37
z 0.06336 0.05276 1.201 0.23

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2936 on 397 degrees of freedom
Multiple Rsquared: 0.005961, Adjusted Rsquared: 0.000953
Fstatistic: 1.19 on 2 and 397 DF, pvalue: 0.3052
Congrats! A minute ago you were an ordinary IO psychologist running vanilla ordinary least squares regression but now you’re a data scientist applying a supervised machine learning algorithm. Add that to your vita immediately!
As you hopefully noticed, the output from those two approaches was completely identical. As I mentioned earlier, machine learning with an ordinary least squares regression algorithm is the same as ordinary least squares regression the way we learned it in graduate school. So why bother with caret?
The answer is that the caret framework then makes it easy to add and test tuning parameters, to change algorithms to examine varying levels of fit between them, and to run models much more complex than ordinary least squares regression.
So let’s go back to Figure 1. If we follow the path given our data (predicting quantities with less than a hundred thousand cases), we find ourselves at the “few features should be important” decision point on the right (an IO might phrase this as “do you care about parsimony?”). Let’s say that we don’t, which points use to “SVR [support vector regression] (kernel=’linear’).” Sounds fancy.
How do we do this with caret? It’s surprisingly simple. We only change the method parameter. Everything else is automatic.
svr_model < train(y ~ x + z, randu, method='svmLinear3')
svr_model
Note that summary() does something very different with the result of this analysis, so instead we ask for the svr_model variable we created to get meaningful information (at least, meaningful to an IO!). Here’s what it returns:
L2 Regularized Support Vector Machine (dual) with Linear Kernel
400 samples
2 predictor
No preprocessing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 400, 400, 400, 400, 400, 400, ...
Resampling results across tuning parameters:
cost Loss RMSE Rsquared
0.25 L1 0.2982899 0.009883891
0.25 L2 0.2984533 0.005443532
0.50 L1 0.2980380 0.008861607
0.50 L2 0.2987358 0.006089387
1.00 L1 0.2978985 0.007793522
1.00 L2 0.2987444 0.006030348
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were cost = 1 and Loss = L1.
As you can see, the R^{2} in the besttuned model has increased by roughly 30% (an increase from .006 to .008). Importantly, you may see different values when you run this yourself, because a random number seed is used to start this procedure and only 25 bootstraps were used. Additionally, both modeling and tuning was done on the same dataset, so we’d expect these values to be a bit inflated. So to deal with that, you would normally at this point crossvalidate all of the models you developed on the training dataset on a test dataset to see which was most accurate when predicting y on untrained data. There are in fact ways to do this automatically using caret. In your crossvalidation of these data, you would likely discover that this model works no better than our original ordinary least squares approach, as expected when modeling random numbers. But if you’re working through these examples with your own data, perhaps you’ll discover something else.
Because machine learning algorithms often don’t produce interpretable, usable, or at least practical formulas, the next step is to ask R to use the results of the support vector regression we just ran (i.e., what the machine learned, itself a new algorithm) to predict values in a new dataset. Because we only have our training set available, we’ll just use our newly developed algorithm to predict y in that dataset for this demonstration:
predicted = predict(svr_model, randu)
predicted
If you want more information about how a particular algorithm works, you will want to start on this page, which lists all the algorithms available within caret. There, you’ll see that svmLinear3 came from the LiblineaR package, and the command ??liblinear in R will lead you to help pages explaining what it is and how it works plus references to the academic publications that support it.
Although this just barely scratches the surface of machine learning, I hope you can see that it is not fundamentally different from the statistical approaches you’ve already been using. Most critically, I hope you can see that there’s no great barrier to diving in to test a few algorithms and see what they do with your own datasets.
So Who Should Learn Machine Learning?
There’s currently a lot of interest in machine learning, and according to the practitioners I talked to, it’s popping up throughout the IO world. Despite this, I found several of these practitioners either unwilling or unable to talk how they were personally using it as a result of real or perceived employer restrictions. As one anonymous interviewee put it, “A lot of people see [machine learning] as their ‘secret sauce,’ which is funny, because no one can recreate your model without your data, and the information for making your own model is out there already.” The mindset described is a little silly, because as I described and demonstrated above, machine learning algorithms are fundamentally just different algorithms from the ones we already use. I imagine the current level of secrecy as similar to the use of statistics in business in the old days: “Don’t tell them we use multiple regression!!” Yet this is the world in which we now find ourselves.
Undoubtedly, the best uses of machine learning are when you are conducting exploratory or essentially exploratory analyses on large datasets. If you have more than 10,000 cases in any dataset, you have an excellent opportunity to see how differently these newer machine learning algorithms perform in comparison to traditional algorithms you typically use. Additionally, if you have mountains of text data, there’s a great deal of potential to predict other variables from those data. If you never see a dataset with more than 500 cases, you’re probably not going to find much here that you couldn’t already do with the standard IO toolkit. But in general, among those that had explored machine learning, I found a great deal of enthusiasm:
Although classic IO methods have their place and machine learning methods also have downsides, IOs need to be more proactive about exploring these methods and creating guidance for how to utilize them in the IO space. Adverse impact and validity are of course critical concerns, but the way we ensure that these concerns are heard is to make sure we have a seat at the table by getting involved in the use of these methods and guiding best practices.
Like this practitioner, I think it’s important that we as a field don’t bury our heads in the sand. The best way to avoid that is to work directly with data scientists, and a little knowledge of machine learning goes a long way in this sort of collaboration. As an anonymous practitioner told me, by gaining some basic machine learning skills, “I have brought the IO perspective on what constructs we should be investigating, how to clean the data to get our raw data closer to those constructs, and interpreting the results through the IO lens. We often work of a project at the same time and in some cases either of us could handle a task and in others one of us will work within our area of expertise.” It is through these sorts of demonstrations of mutual competence that trust is created, and trust is the first step to the collaborative, interdisciplinary efforts that will produce the greatest value for organizations.
As an applied example, Matt Barney, founder and CEO of LeaderAmp, describes how he has woven together machine learning and IO to bring value to his consultancy:
[We use] machine learning to do assessment in a novel way that avoids the sexism, racism and atheoretical problems that others famously have (e.g. Google). Our early alpha release can use Natural Language ML to unobtrusively assess people with Cialdini's principles of persuasion, and then gives them specific feedback using his science. Our approach blends a Rasch Measurement approach that ensures the lack of DIF/DFT, with an approach I've developed originally for virtual worlds, and naturalistic rating situations called Inverted Computer Adaptive Measurement.
To Learn More
By a wide margin, the resource most recommended to dive into machine learning is a Coursera MOOC by Stanford professor Andrew Ng, which you can find here. It covers machine learning concepts in a fairly friendly way, primarily with videos interspersed with short quizzes. However, there are two aspects of this MOOC that might make it unattractive to a general IO audience. First, he does not hold your hand at all in regards to mathematics; there is a “brief review” of linear algebra, and then you are thrown into the deep end. Second, the primary statistical programs used are Octave and MATLAB, which are not mainstream choices in the IO community.
As an alternative to that, for those of you familiar with R or Python, I’d recommend a trial by fire; specifically, I suggest opening up the largest dataset you have available and try to apply some machine learning algorithms to see what happens. Unfortunately, if you don’t already know at least one statistical programming language, you’re going to have a difficult time getting into machine learning. If that’s you, I’d recommend learning R first, because that’ll be useful to an IO like you regardless of whether or not you continue pursuing machine learning.
If you don’t have a good dataset to practice on, that’s less of a problem. I recommend trying DataCite, which has a search tool to locate datasets or alternatively Kaggle.
Once you have a dataset to play with and software to use, tutorials on machine learning abound. Google will turn up hundreds. If you’re using R, I suggest starting with Chapter 1 (the only free one) of this course developed by the creator of the caret package. Alternatively, if you just want to get down to business, this tutorial is short and sweet, and you will get a very good window into the mindset of a data scientist when approaching a dataset for analysis.
If you’re the Python sort (if you don’t know Python, get started here), you’re going to become very familiar with the scikitlearn library, which contains everything you’re likely to need. With that installed, you might start with this tutorial provided in that library’s documentation.
Conclusion
That’s it for the third edition of Crash Course! If you have any questions, suggestions, or recommendations about machine learning or Crash Course, I’d love to hear from you (rnlanders@odu.edu; @rnlanders).
Reference
Campion, M. C., Campion, M. A., Campion, E. D., & Reider, M. H. (in press). Initial investigation into computer scoring of candidate essays for personnel selection. Journal of Applied Psychology. Advance online publication. http://dx.doi.org/10.1037/apl0000108