A Crash Course in I-O Technology: A Crash Course on the Internet
Richard N. Landers
This issue, we’ll be taking a step back to explore something you probably don’t think about a whole lot: How does the Internet work? In my experience, I-Os tend to treat the Internet a lot like they treat their car: a tool to get from A to B. When something goes wrong, and it always eventually will, you don’t usually bother figure out why, you just call your tow truck to drag it over to someone who will do that for you. In the case of the Internet, you might call your IT person or your Internet service provider if your access becomes slower than you expect it to be or goes out, but otherwise, you tend to ignore little problems. When’s the last time you called someone because you couldn’t use a webpage the way you thought you should be able to? Never, that’s when.
The problem with ignoring the complexities of how the Internet functions is that these days, the quality of our data depends upon the functionality of and interactions between many different Internet subsystems that we don’t usually understand. That has a variety of subtle—and sometimes not-so-subtle—impacts. For example, did you know that it’s literally impossible to design a web page that looks the same no matter who views it? I mean “literally” literally—it really is impossible. That has significant implications for both science and practice in I-O psychology. In science, you can never be sure that an Internet-delivered experiment or survey looks the same to one person as it does to another. In practice, you can never truly be sure that one job candidate’s assessment looks the same as another’s. But why? Just how bad can those differences be? Well, that’s why understanding how the Internet works is so important, and there are many such insights you’ll begin to collect once you start paying attention.
Understanding how the Internet works is also a foundational skill on which a lot of others more specific skillsets are built that are becoming increasingly relevant to I-O psychology research and practice. For example, if you want to scrape web pages for observational data, capturing Internet behaviors in a naturalistic study, you’d better be sure you understand what a web page is before you try. If you want to be able to instruct meaningful changes to your IT team in terms of how webpages are displayed, you’d better be able to speak their language or you will never end up with the changes you want. So, in both my graduate-level data science course and in the professional workshops I teach, a basic primer on “what is the Internet” almost always comes early. This article will be a bit shorter than those treatments, but I still think you’ll find it useful. It’ll also allow me to cover a few slightly deeper topics in a couple of future articles I’m planning for this column.
The Physical Internet
So, what is the Internet? Physically speaking, the Internet is a huge number of computers (current estimate: 70 billion by 2020) that are interconnected via some sort of data transfer conduit: copper wire, fiber optic cabling, wireless radio, or satellite transmitter/receiver. At any Internet-accessible computer, you are a part of that system by either a network card that is physically connected to a wire or to a wireless network radio that talks to an access point that is physically connected to a wire. Thus, in a sense, the Internet is a spider web of copper and fiber optic wiring that is literally worldwide. Any computer yours can talk to via the Internet , you could physically follow the wireless access points and physical wires starting at your computer and end up at that computer. Computers talk to each other across this spider web using an agreed upon set of standardized rules called transmission control protocol and Internet protocol (TCP/IP). Each segment of that network has a theoretical maximum capacity and speed, based on the thickness of and material used in the wires involved, as well as the processing speed of the computers on either end. With current infrastructure, requests can traverse the United States in less than a quarter of a second. Since the information itself travels at the speed of light in its medium (e.g., the speed of data transfer in fiber optic cabling is the speed of the light traveling through the type of glass or plastic used in that cabling), the slowdown from instantaneous is mostly caused by the computers in the middle of the path from requestor to requestee.
Request Routing and Encryption
When your web browser sends a request for a webpage, your request is first transferred to your local access point (in this context, called a node), which is asked to forward your request to the next closest node to your request’s destination. This node might be an actual computer or it could be a computing device, such as an Internet router. These requests are sent using TCP/IP rules, which specify how information is formatted to be transmitted and received over the Internet , but also with a second layer of rules on top called hypertext transfer protocol (HTTP) that contain additional rules for how to transmit webpages and related documents over TCP/IP. Each node then repeats this process until your request gets to where it’s supposed to go, at which point the response to your request is sent back using the same route in reverse. For example, from my home, any request I send to access http://www.siop.org goes through over 30 separate nodes (i.e., computers) before it gets there. This route is generally the fastest possible path between all those computers, and the segments in that journey are called hops. Importantly, the computer at the end of each hop gets the full text sent with my request. So, any text or files I submit using a form on a webpage gets received in full by each of those computers along the way.
That leads to two important points in terms of network security:
- Intercepting requests at a node and sending back misleading information is a common way to hack called a man-in-the-middle attack.
- The only way to avoid sending readable copies of your data to all the nodes you need to pass through is to encrypt your data first.
Encryption can be explained with a simple example. Say you wanted to send the message, “TIP is amazing!” to your friend but you didn’t want anyone else to read it, so you invented the incredibly clever encryption technique of shifting all letters forward one position, making your message “UJQ jt bnbAjoh@”. You send instructions on how to decrypt ahead of time to the person you’re trying to send your message to, and then later you hand your note to the person sitting next to you and ask them to pass it person-to-person to your intended target. None of the people passing the note can read it, but the person you’re sending it to can, because they have the decryption instructions. This is how HTTPS works (sometimes called “secure HTTP” or “secure browsing”) except that the encryption system is quite a lot more complicated and thus harder to crack. HTTPS is particularly important to e-commerce, because, for example, it’s relatively trivial for someone operating a node to write some code that grabs anything passing through that looks like a credit card number.
The way HTTPS works also means that if you don’t have the right decryption instructions (called a certificate), you can’t use HTTPS with a computer expecting you to follow those instructions. This is what leads to those “security certificate error!” messages you see in your web browser from time to time. Without a valid certificate, you are dropped down to regular unencrypted HTTP, so anyone in the middle could still theoretically snoop on what you’re doing. They probably won’t, most of the time, because usually no one cares what you do on the Internet , but not always. To facilitate HTTPS connections, your web browser will verify the certificates it receives with a recognized, third-party, trusted certificate authority. This is why you sometimes get errors about a certain HTTPS website being untrusted; there’s a problem with its certificate based upon what was expected when you asked the certificate authority for verification.
How does your computer figure out where each request is going in the first place? Every device attached to the Internet has its own address using IP notation (as a reminder, IP is part of TCP/IP). There are two major versions of IP in use today: IPv4, which looks something like 18.104.22.168 and IPv6, which looks something like aaaa:bbbb:cccc:dddd:eeee:ffff:1111:2222. IPv6 is a successor to IPv4, designed because we ran out of numbers to use in IPv4. Each of those 4 numbers in an IPv4 address can take a value from 0 to 255, so there were only a total of about 4.2 billion possible addresses. IPv6 uses hexadecimal (0 to 16, represented as 0 to 9 and then a to f) and has 8 positions, creating 3.4 x 10^38 possible addresses, which is the equivalent of 48 billion addresses for each atom in the human body. So, we won’t be running out again any time soon.
When you send a request to a domain name, like http://www.siop.org, you are first sending a request to a domain name server (also called a nameserver) that translates “www.siop.org” into an IP address. Nameservers are managed and operated by Internet service providers, like Verizon or AT&T. The IP address retrieved from the nameserver represents the physical address of the computer your request is trying to find. The domain name itself is just for human readability—they don’t mean anything special, and you can register your own .com domain name for less than $20 per year.
Among domain names, only the last two pieces generally have unique IP addresses. For example, http://mysite.siop.org and http://special.siop.org and http://special.mysite.siop.org are all most likely the same physical computer. Because many people don’t realize this, hackers often use dressed up domain names to pretend to be other sites. For example, a hacker might send a link to http://www.siop.org.hack.me/ to fool you into thinking you were going to siop.org, whereas in reality, you were going to hack.me. The information after the / only specifies what folder and file you will access plus, optionally, a few special instructions.
Is there anything special about this computer you’re requesting webpages from? Not really. Most servers, called servers simply because they service requests like the one I was just describing, are computers just like your desktop. These days, they typically run Linux or Windows. You can turn your own computer into a server just by installing some software to serve something. Commercial servers are usually a bit better at this than your desktop would be though, because they are designed for the sole purpose of serving. That mostly just means a fast CPU, a lot of memory, and no or limited graphics capabilities. Advanced data centers (another word for “lots of computers working together in a single building”) spread out some of the processing requirements by spreading the load to multiple computers. For example, your web request might be one of 10,000 per minute being processed, so the initial computer receiving the request will forward it to other computers to spread the workload. This sort of approach goes by many names, but a common one is cluster computing.
So, what does that server do when it gets your request? That’s easy—it looks up the file you asked for on its hard drive or in memory and sends it back to you. For typical Internet traffic, the first request received is for a hypertext markup language (HTML) file, which is computer code that explains to your web browser (i.e., Chrome, Edge, Safari) the structure and formatting of the webpage you asked for using the rules specified in the current HTML standards. Your web browser will read that file and harvest all the resources needed to display it to you, like images, videos, external code, and adware requests. New requests will be sent for all those resources; most of those requests typically go to the original server and a minority will go to other servers.
Because there’s a physical computer on the other end of your request, that computer can be overwhelmed with such requests. This is in fact the approach taken in another type of hacking called a denial-of-service (DOS) attack. In this sort of hack, a computer hammers a server with an enormous number of requests without processing any replies. This causes the server’s CPU usage to spike, creating a backlog of fake requests, and ultimately preventing it from serving any other, legitimate requests. However, DOS attacks are relatively easy to defend against, because once you realize all the requests are coming from a single IP address, you can simply refuse to answer any requests from that IP address. This led to the development of distributed-denial-of-service (DDOS) attacks, which take advantage of networks of hacked computers, often called botnets. DDOS attacks are often used to ransom servers; pay up or we won’t let anyone use your webpage. In 2016, DDOS attacks affected numerous websites including those of five Russian banks, the Rio Olympics, the US presidential campaigns, and Dyn. Dyn was particularly noteworthy because Dyn is a domain name server; the sustained DDOS attack on Dyn, the largest in history to that point, caused many other websites to become inaccessible, including Etsy, Spotify and Twitter. Such attacks are likely to continue to get worse, which is why cybersecurity has suddenly become so important to the world economy.
Rendering a Webpage in a Browser
As your web browser collects resources—and there are many potential delays as this happens—it tries to visually build the webpage, a process called rendering. Put that image there, this block of text here, format this area in bold, and so on. At this point, everything is static. Your web browser has sent some requests for documents, and once it received those documents, it began to construct a visual representation of the webpage. Importantly, the web browser has complete control of this. Whatever you send it, a person can use their web browser to turn on and off anything on the page. This sort of inline modification of what you see in relation to what you were supposed to see is how many browser plugins function, such as ad blockers. Even though the HTML is written so that web browsers will by default send requests for advertisements, these requests are blocked by the ad blocking plugin.
Why You Can’t Ever Be Sure Webpages Look the Same Across Computers
Here’s a good place to return to our original problem. Can you guess why it’s impossible to design a webpage so that it looks the same to everyone? The short answer is that it’s up to the web browser software to decide how to render the webpage, yet every web browser has little quirks that change how they display webpages and ever user has different plugins and addons installed that change even that. In the days of the early web, back in the late 1990s and early 2000s, the quirks were a nightmare for web designers. Netscape Navigator could render webpages completely differently from MS Internet Explorer, so many web designers would build parallel versions of every website intended for different web browsers. These days, most web browsers will render the same HTML essentially the same, assuming no plugins are installed or user settings specified that change it. But there can still be slight differences, most especially in relation to fonts and spacing.
A big challenge is JS. Unlike the raw HTML that’s sent by the server, JS is client-side code. This means that it’s ultimately interpreted and executed by the person’s web browser. This has several important implications:
- People looking at your webpage can see your raw code. You can’t hide things in JS code. For example, if your selection assessment provides real-time feedback, and possibly even if it doesn’t, a job candidate may be able to reverse engineer the JS to figure out what the correct answers are.
- Execution times may differ, such that you can’t really trust reaction time data using JS. If a webpage renders faster on one computer than another, those times will differ for reasons completely unrelated to trait or state individual differences. Trusting online reaction time data will always be a judgment of “good enough.”
- JS can be turned off entirely, so you can never assume people have it turned on. For pages that require JS functionality, developers need to implement a test that only allows people to proceed if JS is turned on and functioning properly.
Putting It All Together
With all that background, let’s look at all of it at once. Going to the address https://siop.org/tip/jan17/crash.aspx implies your web browser will engage in the following exchange:
- YOUR BROWSER: Sends a request to your domain name server (provided by your Internet service provider) to figure out the IP address of siop.org
- DNS SERVER: Resolves the IP address and sends it back to you
- YOUR BROWSER: Sends a request to siop.org asking for its certificate.
- SIOP.ORG SERVER: Sends back its certificate.
- YOUR BROWSER: Sends a request to the certificate authority to verify the content of the certificate.
- CERTIFICATE AUTHORITY SERVER: Sends back confirmation that the certificate is legitimate.
- YOUR BROWSER: Sends an acknowledgment to siop.org that all future communication will be encrypted and asks for the contents of /tip/jan17/crash.aspx
- SIOP.ORG SERVER: Confirms that encryption will be used, and then goes into the tip folder, then into the jan17 folder, grabs crash.aspx and sends it back encrypted
- YOUR BROWSER: Interprets crash.aspx as an HTML file, harvesting all the other resources we’ll need. Send requests back to siop.org for each image on that page. Also send requests to any other servers needed, starting back at step 1.
- SIOP.ORG SERVER: Sends back additional requested files.
- OTHER SERVERS: Sends back requested files, managing encryption as requested (or not).
- YOUR BROWSER: Renders all of the information it gathers as it gathers it. Also executes any JS as soon as it gets it. For any future AJAX requests, start this process over from Step 1.
All of this happens behind the scenes without you realizing it, in a matter of seconds. That’s just how user friendly the web has become. But it means that every one of these steps has the potential to introduce both unintentional error and enable hacking, if not executed properly.
Let’s See It in Action
We can easily peek a bit under the hood of the Internet with tools provided (for free) by Google Chrome. If you don’t have Chrome already, download it here and then try each Internet browsing of the following to get a sense of the sorts of things going on behind the scenes of Internet browsing:
- Go to http://rlanders.net, right-click on any text you see, and select Inspect. (You can do this on any webpage, but some of the examples below will be easier to see on that page.)
- A panel will open. You can see the HTML code in the Elements tab. Try hovering your mouse over each element, and it will highlight where that piece of the HTML file is being rendered. Try to reverse-engineer some of the code—you’ll find that in many cases, it’s easier than you think.
- Next, click on the Console tab. This is where JS is being executed. Type the following, press Enter, and see what happens: alert(“Hello world!”)
- Next, click on the Sources tab. This is a list of all the requests that were sent when you requested this webpage. Drill down to see the raw content of any file that was requested.
- Next, click on the Network tab. This won’t have anything on it yet, so click the Refresh button (or on a PC, press F5). You will be shown a timeline depicting how long each request took. Web developers use this to identify bottlenecks that are slowing down their websites.
- Finally, click on the Performance tab and press Refresh again. This will build a profile that shows everything your web browser did on your behalf, over time. This will show not only how long it takes to process each request (which is what the Network tab showed you) but also how long each individual object on the webpage took to render and how much time was spent on each and in what order. In the case of JS, it’ll even show you how long each part of the code took to execute.
- Bonus points 1: If you’re on Windows, press Windows Key + R to bring up the Run dialog. Then type: tracert siop.org. You’ll see the full route of any request you send to siop.org as well as three estimates per hop on the amount of time it takes for requests from your computer to get to that hop. If you’re getting slow Internet speeds, this is an effective way to figure out where the problem is. Is it your router (first hop), your Internet service provider (look for hops with the name of your Internet provider, like Verizon or AT&T), the website you’re trying to access (the last hop), or somewhere in between? Importantly, sometimes speed estimates (called pings) are blocked by particular hops, so you may just see asterisks for those hops instead.
So Who Should Learn About the Internet?
These days? Everyone! The Internet is so fundamental to everything that we do as researchers and practitioners that you are doing yourself a grave disservice if you don’t try to at least lightly familiarize yourself with how it works. Plus at some point, you’ll say something that makes your IT person stare at you with that look—you know the one—the one that says, “I can’t believe I’m forced to help people like you.”
To Learn More
- HTTP and CSS: https://www.codecademy.com/learn/learn-html-css (basic web design)
- jQuery: https://www.codecademy.com/learn/jquery (advanced client-side web programming)
- PHP: https://www.codecademy.com/learn/php (basic server-side programming)
An advantage to courses like those on Codecademy is real-time feedback; you code in your web browser, and your work is scored algorithmically less than a second after you submit it. As all I-Os know these days, there’s nothing better for changing your behavior than relevant, immediate feedback.
Even if you won’t be programming yourself, I left out a lot of valuable information in this overview. So, if you’d like to know more, and especially if you’re personally responsible for communicating between I-O and IT in your organization, you will want more “big picture” instruction, staying around the level of depth covered here. For that purpose, you might consider starting with this MOOC on Coursera.
That’s it for the fifth edition of Crash Course! If you have any questions, suggestions, or recommendations about the Internet or Crash Course, I’d love to hear from you (email@example.com; @rnlanders).