Meredith Turner / Thursday, September 21, 2017

/ Categories: 552

Crash Course in I-O Technology: A Crash Course in Web Scraping and APIs

Richard N. Landers, Old Dominion University

This issue, we’ll be building on the ideas in my Crash Course on the Internet to understand two concepts that are very similar in purpose but very different in execution: web scraping and application programming interfaces (APIs). If you haven’t read the Internet article yet, I recommend you do so first, or some of the concepts I talk about here will not make a whole lot of sense. Both approaches concern harvesting data from the internet algorithmically, but they are used in different circumstances.

The first of these approaches, “web scraping,” is labeled by an evocative term. It sounds a bit like you’re taking a knife to the web and pulling the gunk off. In practice, web scraping is a fairly complex engineering task, an algorithmic targeting and harvesting of specific, desirable data. The goal of scraping is to convert unstructured data from the Internet into a (structured) dataset. To be clear, scraping does nothing that you couldn’t theoretically do by hand. You could individually open webpages and copy/paste key pieces of information in Excel or SPSS one at a time. The advantage web scraping brings over by-hand approaches is automaticity, which increases both speed and accuracy. For example, in one project, I scraped roughly N = 100,000 cases with near-perfect accuracy in about 8 hours, a task that would probably take undergraduate research assistants a few years with, I’d imagine, quite a bit less accuracy.

From a technical standpoint, web scraping thus involves quite a few steps:

Identify a data source that meets your needs (more on this later).
Determine how the data that you want are represented in that data source. If there is no API, you’ll be scraping.
Develop an algorithm to harvest those data (called a “scraper”).
Develop an algorithm to identify and traverse all the webpages in your data source that contain those data (called a “crawler”).
Run a few strategically selected test cases to be sure your algorithms are correctly written given your needs and revise until perfect.
Execute the two algorithms, crawling as many pages as you think contain data and scraping each one for the data it contains, building your final dataset.

The technical aspects of this procedure are in stark contrast to how you go about harvesting data from APIs. So, what is an API? APIs are data gateways, carefully controlled points of entry into website databases. You can issue specific commands to an API and get data back, depending upon what data you can access and what format the creator of the API chose to put it in.

It’s important at this point to be clear that APIs are not designed for you, the academic or organizational researcher. APIs are data gateways, designed for web applications to be able to talk to each other. If you’ve ever seen a “Like” button on a website that isn’t facebook.com, the reason is that that website is using the Facebook API to bring Facebook functionality into itself. This defines the audience for APIs: programmers that want to integrate functionality from another website into their own, in real time. Remember from my last Crash Course that your web browser makes requests in sequence based upon instructions embedded in the first HTML file you download. Thus, the programmer’s job is to ensure that their webpage creates an appropriately formatted API request given your identifying information and then returns something meaningful for you to see. As a researcher, you are essentially coopting resources created for this purpose and instead harvesting the data for yourself.

Given that, it can seem paradoxical that API access is more clearly legal than web scraping. The reason is that although you may not access an API for its originally intended purpose, you are still accessing data that the API creator says you are permitted to access. When you scrape from the web, this isn’t necessarily true; you can grab anything you can see in a web browser and throw it into a dataset, and it’s not always clearly legal to do that.

Thus, legality is a major advantage of using an API over web scraping. A second major advantage is the amount of work involved. APIs are providers of structured data. You ask for data; the API provides it, in a nice, clean, well-organized data file. If there is no API, there is no provision; you are taking data, and it’s up to you to figure out what to do with it once you get it.

That leads to a somewhat different workflow when you’re working with APIs:

Identify the data source that meets your need (still getting to this later).
Determine how the data that you want are represented in that data source. If there is an API, do not scrape, because scraping is much harder than writing API requests.
Read the documentation for the API, completely. Become familiar with request formats, data output formats, and parameters.
Write a few test API queries in your web browser (or in a web-based tool given by the API provider for testing purposes).
Develop an algorithm to generate all the API queries you will need.
Run all those queries and build a dataset with the results.

In both web scraping and using API calls, you must identify an appropriate data source, develop some code to grab data and run a few tests cases, and then run it in full. So, we’ll take each step in turn.

Let’s See It in Action

Identify an Appropriate Data Source

The first step in both approaches is identifying an appropriate data source, and this is ultimately an external validity question. How do you know that the data you’re scraping or requesting is from a population that will generalize to the population you want to know about?

This is a complicated question. Tara Behrend and I tackled this new-sources-of-participants-vs.-external-validity issue a few years ago in an Industrial and Organizational Psychology Perspectives article that I recommend you check out for context. But in brief, Landers and Behrend (2015) suggested that what you need to think through are issues of range restriction and omitted variables bias (sometimes called endogeneity). I will briefly walk through both. First, if your research question concerns the effect of X on Y but the population you’re interested only includes some possible values of X or Y, you have a range restriction problem. For example, if you wanted to know about the effect of online bullying (X) on online participatory behavior (Y), you might not want to use YouTube, because YouTube has a reputation for high intensity bullying behavior in its comment sections, potentially restricting the type of people that would be willing to comment on it in the first place (range restriction in Y). Second, if your research question concerns the effect of X on Y but it is plausible that Z is driving that relationship, you have omitted variables bias. For example, let’s imagine that we’re interested in estimated the effect of time wasting on social media (X) on job performance (Y) among I-O psychologists using Twitter content. Only somewhere around 4% of SIOP members have Twitter accounts, and it seems likely that there are systematic differences between twittering I-Os and nontwittering I-Os for many variables, like technological fluency, age, and personality traits. Whether we discover an effect or not for social media use, it’s difficult to reasonably generalize that finding to “all I-O psychologists” unless we can make a theoretical case that none of these differences are likely to covary with both social media and job performance. Instead, we might consider sampling Facebook for I-Os, because about 75% of the US population uses Facebook regularly. This doesn’t eliminate the possibility of an endogeneity problem, but it is much reduced simply by a change in sampling strategy.

In the case of web scraping and API access, this is made somewhat more complicated because of the technical constraints of these procedures. Specifically, not only are there sampling concerns related to who uses the platform, but there will be additional concerns related to your access to those people. In the big data special issue of Psychological Methods, Landers, Brusso, Cavanaugh, and Collmus (2016) developed this idea into something called a data source theory, which is essentially a formal statement of all your assumptions about your data source that must be true to meaningfully test your hypotheses. For example, let’s continue trying to figure out if I-O psychologist use of social media is related to job performance. We would need a data source that, at a minimum:

Has an essentially random sample of I-O psychologists (or more specifically, missingness occurs randomly in relation to job performance and social media usage)
Gives us access to everyone in that sample’s social media content so that we can quantify it

Facebook is a good option given #1 above for the reasons I already laid out. But #2 is trickier; Facebook only gives you access to data that you would normally have access to using facebook.com directly. In practice, that means you have access to (a) open groups, (b) closed and secret groups of which you are a member, (c) public content on pages, (d) user information that has been shared with you according to privacy settings, and (e) your own profile. In general, if it could appear in your news feed, you can access it. But is there any I-O psychologist on Facebook that has access to a random sample of all I-O psychologists on Facebook based upon their privacy settings? Probably not, and that might rule out Facebook as a data source.

Importantly, you must be willing to give up on addressing a particular hypothesis or research question using scraped or API data if you ultimately do not find support for your data source theory. Internet-based data collection like this is a lot like an observational study; you need to build a theoretical case for why looking at this particular group of Internet users can be used to draw meaningful conclusions.

In the case of organizational research questions, this might be a little easier; your population might be “people we can reasonably recruit using Facebook/Twitter/LinkedIn,” which by definition will only include people that are already on those social media platforms. But be sure to state that formally first; don’t let it be an unnoticed assumption.

Run a Few Test Cases and Write Some Code

Writing a scraper and crawler is a complicated, iterative process with many roadblocks. It is something better delivered in workshop format than a column, so I’m not going to go into it here. Instead, I’ll show you how to grab data from Facebook in R. It’s much easier than you think.

First, you need to create what’s called a “token.” Tokens are a type of passphrase used to identify you when you access an API. That means you’ll generate a token via the Facebook webpage, then you’ll copy/paste that token into API requests as an identifier. To get your token, go to https://developers.facebook.com/tools/explorer/ and click on “Get Token/Get User Access Token.” Accept default permissions, and you’ll see a very, very long text passcode appear. That is your token. It will remain valid for about two hours. If you click on it, it will auto-highlight so that you can copy/paste it elsewhere.

This webpage you’ve accessed is called the “Graph API Explorer,” and it essentially allows you to “test” API requests before using them elsewhere. As I described above, this is really designed for web programmers, but we’re going to use it instead. So before heading to R, you’re going to spend some time here “crafting” your API request. To understand what these requests should look like, you’ll need to eventually read the Facebook API Documentation. But I’m going to give you a few examples for now.

Let’s say I want to grab all the posts from the official SIOP Facebook page. That group is public, so my Facebook account (and yours) has access to it. According to the documentation, I need a unique identifier number for the group to pull up its posts using the API. To get that number, paste this into the Graph API explorer to the right of “GET”:

search?q=Society for Industrial and Organizational Psychology&type=page

This is where I hope you have already read my article on the Internet, because I’m going to skip a lot of background here and just tell you the new stuff. In addition to the server call, what you’re seeing is a command that looks like a document (“search”), the query operator (?), and then a series of variable/value pairs (variable 1 is called q with value “Society for Industrial and Organizational Psychology”; variable 2 is called type with value “page”), separated by &. This is a generic format for data sent to a server by URL called a GET request. When you execute it (hit Submit), you’ll see a bunch of output from your search. You can also get this same output by going to the following web address:

https://graph.facebook.com/search?q=Society for Industrial and Organizational Psychology&type=page&access_token=xxxx

Except: you must replace xxxx with the token copy/pasted from the Facebook Graph Explorer. If you do this correctly, you’ll see the same output in your web browser that you see in the Graph Explorer.

This is how all API requests are made in practice; a request is sent to the API webpage, and data are returned. Servers then parse that return data and do something with it. When a web programmer integrates Facebook access into a webpage, what the programmer is really doing is constructing a URL, sending it, retrieving the data, and parsing it into appropriate output to send to your browser.

For now, notice that you have a bunch of ID numbers in that output (i.e., everything that matched your search term), but the one you want is the first one: 115024651712. Go back to the API Graph Explorer and replace your entire query string with this new API request:

115024651712/feed

You’ll get back data, and it should match the current output of the group page itself. Every word, photo, like, share, and so on is accessible somehow via this API. If you can see it on Facebook, you can get it algorithmically. But once we have figured out exactly what we want, how do we get that into a dataset? That’s why we need R.

Quick quiz: what would the URL request for this same data look like? Try working it out for yourself before looking ahead.

https://graph.facebook.com/115024651712/feed?access_token=xxxx

Switching over to R makes this even easier that it might otherwise be because there is already an R package that will create Facebook API queries for you, send them, and then convert them into a data frame format. It won’t grab literally anything—so for certain types of data, you might still need to manually create API requests—but it will do quite a lot. In this case, use the following very simple code to grab the same dataset we were just looking at, replacing “xxxx” with your access token:

install.packages("Rfacebook")
library(Rfacebook)
token <- 'xxxx'
siop_df <- getPage(115024651712, token, n=100)

You’ll see it send requests for 25 posts at a time—because that’s the maximum possible size for a single API request per the Facebook documentation—and at the end you’ll have a dataset stored in siop_df containing all the data you saw in the Graph Explorer. Done.

Run Your Code in Full

In most cases, your final goal will not be to collect a single group’s or page’s posts but to make many API requests across many contexts. For example, you might want to grab every post from every page or group mentioning the word “I-O psychology.” In that case, you would need to algorithmically construct each intermediary dataset with appropriate API calls. For example, you might grab a list of group IDs from one series of API requests and store the results in a data frame; next, you might iterate over the data frame you just created to create a new data frame containing content from all the groups, one at a time. At this point, you are only bounded by your skill with R.

So Who Should Learn About Web Scraping and API Access?

Frankly, if you have ever tried to copy/paste data from a webpage for any reason, you would do well to learn a little about web scraping and APIs. A general rule among computer programmers is that if you ever need to repeat the same action more than twice, you should abstract it. For example, if you find you need to copy/paste three or more times from a webpage or set of webpages that share similar characteristics, you should instead develop one single computer program to do that for you as many times as you need it to. (In general, this is also a good rule to live by when using R.)

The big new frontier for web scraping in applied I-O psychology appears to be in recruitment and selection; these techniques allow for you to automatically, with no user input, harvest information about people and their activities for inclusion in predictive models. That’s enormously popular, as long as you don’t care too much about constructs. This might ultimately make scraping more appropriate for recruitment, as a first-round candidate-identification screen-in tool, more so than as a selection-oriented screen-out tool. However, advances in predictive modeling might change that over the next few years; we can already get some personality data out of Facebook likes (Kosinski, Stillwell, & Graepel, 2013).

When hunting for interviews, much like what happened with my articles on machine learning and natural language processing, I ran into several people willing to tell me that their organizations were using scraping and access APIs to collect data but unwilling to go on the record in any detail about it. Instead, the people more excited to talk were ones you had used it in published research projects or as fun side projects.

Scraping is often tied closely to natural language processing, so many of the same people using NLP obtained that data using scraping. For example, Rudolph and Zacher (2015), who were the researchers that I described before looking at differences in affect toward different generations and the impact of this on workplace relationships using Twitter data, did not pull that Twitter by hand; they used APIs. When I asked Cort Rudolph how they did it, he reported using the twitterR package, saying “it was very easy to scrape the data directly into an R dataframe… I remember thinking to myself ‘That was easy!’ when we got the data dumped originally.” Reflecting on the experience, he recommended I-Os explore the idea of grabbing data from the internet this way, “especially when paired with more advanced text analytics procedures.” He learned to do this simply by reading R library documentation; in most cases, as demonstrated above, downloading and converting data from a single API request into a data frame using a pre-existing R library requires less than five lines of code.

One of the most fun examples of API use comes from a Twitter bot called @DrCattell. Before explaining this, a little context for non-Twitter users: it’s common to create “fake” accounts on Twitter for entertainment-related purposes, and @DrCattell is an example of this. Another in the I-O space is @HugoMunsterberg, whose profile describes him as the “Recently reanimated founding father of Industrial-Organizational Psychology,” and he has a personality to match.

Love to be the fly on the wall as an #IOPsych academic tells client to scrap an 8-week long selection validation effort because p=.06! pic.twitter.com/hlrjmFgpTt
— Hugo Munsterberg (@HugoMunsterberg) August 12, 2017

@DrCattell is a bit different, though, in that in addition to being tweeted from by a real human being, @DrCattell also constantly scrapes things said by I-O psychologists on Twitter, runs them through natural language processing algorithms, and then posts new messages reconstructing “what an I-O would say.” However, this also sometimes leads to some hard truths:

IOs are not linguistically unique or very interesting?
— Bot-IO Says (@DrCattell) August 17, 2017

To Learn More

To learn more about web scraping, I recommend my own materials, because I’ve given several workshops on the topic already! You can find them at http://scraping.tntlab.org. To be honest, basic text scraping is doable in R, but once you need a crawler, or if you want to scrape files like images or videos, you are best off moving to Python. R just does not handle files very well, and it does not always scale to “big data” scope very easily either. To try out Python scraping, I’d recommend following the tutorial we put together for Landers, Brusso, Cavanaugh, and Collmus (2016), available at http://rlanders.net/scrapy/

If you just want to access APIs, you can do this in R much more easily. The first technical step is always to read the company’s API documentation. Remember that APIs are essentially computer programs written by the data host. That data host has an interest in providing good documentation because they want people to access their data and integrate it into their websites. You benefit from this; extensive documentation is available on almost every API you might want. You can generally just Google “company name API documentation” and find what you need. Regardless, here are some APIs to try along with R packages designed for each:

Once you have read the documentation, remember that the general technical approach is always the same. You’ll need to figure out how to craft an appropriate GET request, you’ll need to get authentication from the company and then pass credentials via that request somehow, and you’ll need to reformat the data once you receive it into something readable. If you can solve those three problems, you’ve built a dataset, and these R packages automate most of the steps involved!

Conclusion

That’s it for the sixth edition of Crash Course! If you have any questions, suggestions, or recommendations about scraping and APIs or Crash Course, I’d love to hear from you (rnlanders@odu.edu; @rnlanders).

References

Kosinski, M., Stillwell, D. & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110, 5802-5805.

Landers, R. N. & Behrend, T. S. (2015). An inconvenient truth: Arbitrary distinctions between organizations, Mechanical Turk, and other convenience samples. Industrial and Organizational Psychology: Perspectives on Science and Practice, 8, 142-164.

Landers, R. N., Brusso, R. C., Cavanaugh, K. J. & Collmus, A. B. (2016). A primer on theory-driven web scraping: Automatic extraction of big data from the internet for use in psychological research. Psychological Methods, 21, 475-492.

Rudolph, C. W. & Zacher, H. (2015). Intergenerational perceptions and conflicts in multi-age and multigenerational work environments. In L. Finkelstein, D. Truxillo, F. Fraccaroli & R. Kanfer (Eds.), Facing the challenges of a multi-age workforce: A use-inspired approach (pp. 253-282). New York, NY: Routledge.

4680 Rate this article:

5.0