The Leading Edge: Using Excel to Clean and Prepare Data
R. Jason Weiss
Development Dimensions International
Robert J. Townsend
California State UniversityFullerton
Cleaning data and preparing it for analysis is one of those thankless jobs that is dull, laborious, and painstaking, no matter which way you slice it. The cost of a mistake is considerable, too, as you will discover if you try to report an observed F of 317. We think the burden can be greatly reduced with some help from our old friend, Excel. Its true that many of us already use Excel to clean and prepare data for analysis, but our sense is that few people leverage Excels considerable strengths in a systematic way. In this article, we describe a power users approach to cleaning and preparing your data with Excel. We suggest a phased approach that produces analysis-ready data without destroying the original dataset. Well also look at ways to document your dataset so that it will make sense when reviewed at a later point, or by other people. We conclude with a note about a presentation at the upcoming SIOP conference that needs your input!
A Phased Approach to Data Preparation
Why is cleaning and preparing data such a pain? Part of the problem is the lack of an easy, sensible, and common process. Another is the fact that people rarely document their datasets effectivelyhow often have you looked at a folder with three slightly different files, all named some variant of
Final Project Data.dat, and wondered which one was the real dataset you used for your analyses 6 months ago? A third source of frustration, which follows from the second, is that its easy to lose your place if you get distracted or have to correct a mistake you made several changes earlier.
Oh nuts Did I just undo the recoding of those reverse-coded items? Better start over, just to be sure.
You and your data deserve better. We certainly wont say that the process we propose is ideal or necessarily suitable for all circumstances. We do feel, though, that it reflects some of the best practices weve discovered over many combined years of working with data in Excel. Plus, its a process you can use consistently, which helps you in two ways: First, it replaces the need to reinvent the wheel every time you work with data. Second, once you know the process, you can quickly understand any data file created with it. Lets start with a brief look at the main steps in the process:
1. Create the data file. We will use several worksheets within a single Excel file to represent our data at each major stage of the process, from our initial raw data through several stages of transformation to the final, analysis-ready dataset.
2. Clean the data. In this stage, we remove any elements we dont want to leave in our dataset, such as
duplicate entries, out-of-range data, and extraneous characters. The outcome is a clean set of raw data.
3. Process the data. The processing stage is where we prepare the cleaned raw data for analysis through parsing, recoding, reformatting, and other actions.
4. Create an analysis-ready copy of the data. Here, we copy the final set of data for import into a statistics package.
5. Document the data. Finally, we add any necessary documentation to the data file so that the actions taken on the data are clear when the file is revisited by others or at a later date.
Step 1. Create the Data File
We recommend creating separate worksheets in an Excel data file for each logical step in the data cleaning process. This has a number of benefits. First, the original data and all transformations are preserved, so it does not require much effort to back up a step. Second, the worksheet labels make clear the main differences between the worksheets. Finally, you never have to play detective to figure out the differences between multiple files containing what look like the same data.
Following are the worksheets we will use:
Original Data. This worksheet contains the data as originally captured or entered. No actions will be taken on this worksheet except to copy the data to the next sheet, where we will clean and process it. This worksheet, then, exists solely to maintain a pristine copy of the base dataset.
Interim Data. This starts out as a copy of the original data, which is then cleaned and processed to produce analysis-ready data.
Final Data. This sheet contains a literal copy of the columns and rows of data you plan to use in your analyses. We discuss below why it is necessary to have a separate worksheet for your final data.
Setting up your worksheets is simple. By default, Excel opens with three worksheets available. Additional worksheets can be added to the workbook by selecting
Insert|Worksheet. The default names for the worksheets are Sheet1, Sheet2, and so forth. You can change these by doubleclicking on the worksheet names on the tabs at the bottom of the screen, or by clicking
Format|Sheet|Rename. Note that Excel sometimes abbreviates worksheet as simply
sheet. Dont worry, the two terms are synonymous.
A quick word of advice on naming your worksheets: Excel permits spaces in worksheet names, but these become onerous in functions that refer to cells across worksheets. We suggest following a convention of using upper- and lowercase letters to suggest word separation in worksheet names. For example,
OriginalData is a perfectly legible worksheet name. Alternately, use underscore characters for spaces, for example,
Processed_Data. With that said, we will maintain spaces within worksheet names as a means of maintaining readability through the remainder of this article.
We will use the example of a standard data cleaning task in which we manipulate a single data file, displayed as Exhibit 1. The file we are using is based on fictional data culled from the equally fictional Weiss Circus Clown Selection Test-Revised, which has swelled to 10 items, two of which are reverse coded (see Weiss, 2004b for more information on the original WCCST). The file is available for download at
if you wish to follow along with the dataset.
Exhibit 1. Sample Dataset.
Step 2. Clean the Data
The goal of the data cleaning process is to preserve meaningful data while removing elements that may impede our ability to run the analyses effectively or otherwise affect the quality of the statistics that result. Candidates for removal include duplicate records, extraneous characters within cells, or out-of-range values. Note that we will be acting directly on the data during this phase, though we can always hit the undo button if we make a mistake. The first step is to copy the data over from the
OriginalData sheet to the InterimData sheet. To do so, highlight all cells in the
OriginalData sheet directly or by typing CTRL+A on your keyboard. Copy the data and paste it into the
Before we start cleaning, we will assign each row in the spreadsheet an ID number. This way, if we delete a row (presumably because its a duplicate), we can tell from the gaps in ID numbers where the deleted rows were. Its a pretty simple process to create an ID: Insert a column to the left of column A and input the number 1 into the cell next to the first row of data (cell A2). Highlight the cells that need to be numbered, click
Edit|Fill|Series, input a step value of 1, and hit OK. In our example, the cells will now be numbered from 1 to 11. Now were ready to proceed with the cleaning.
Manage duplicate records. A common step in preparing data for a stats program is searching for and removing duplicate entries. Our strategy is to create a key that uniquely identifies each person in the dataset; we will then sort the data based on this key and check if the key shows up in adjacent rows. We create the key by copying the values of the
First Name, MI, Last Name, and Street Address cells and concatenating them into one cell. Happily, Excels CONCATENATE() function does all of the hard work for us. In Row 2 of the first empty column, enter the following formula:
=CONCATENATE(A2,B2,C2,D2). Now that we have our key, we need to search for duplicates. First, we need to sort by our keys. Click on
Data|Sort and select the column containing the key (Column R in our example). With the data sorted, we can then proceed to check if the key shows up in adjacent rows. In Row 2 of the first empty column, enter
=EXACT(R2,R3) and copy the formula down to the remaining cells. The EXACT() function returns
TRUE if the values it is given are identical, and FALSE otherwise.
Heres a power users hint. In a large dataset, it can become tiresome to locate all the
TRUE cells. A simple way around this is to automatically format all the
TRUE cells to a different color. Select the test column and click on Format|Conditional Formatting. Arrange the drop down boxes to read
Cell Value Is Equal To TRUE. Select the Format button, followed by the
Patterns tab, then choose the color to highlight the cell if the function value is true. As Exhibit 2 shows, the conditional formatting has highlighted a duplicate entry for Kay Rodriguez.
Exhibit 2. Spreadsheet after searching for duplicates
Strip out undesirable characters. Often, our data have undesirable characters that are useful for visually displaying the information but can trip up statistical analyses. Consider the phone number column, for example. We want to remove the unwanted periods, dashes, and parentheses so that every phone number contains numbers only. An easy way to do this is to use Excels
Find/Replace functionality. Start by highlighting Column F, which contains the phone numbers. Next, select
Edit|Replace and click the Options button for the advanced view. Enter a dash in the
Find what field, leave the Replace with field blank, and make sure that the
Match entire cell contents box is unchecked. When you hit the Replace all button, all dashes will be removed. Follow the same process to remove spaces, periods, parentheses, and so forth.
If you need to repeat this cleaning process often, you can record a macro to take the drudgework out of it. See Weiss (2004a) for more information on macros. To record your macro, start by selecting a cell in the column that has characters that need to be replaced. Click
Tools|Macro|Record Macro and select the shortcut key that you would like to use; in our example we will use
CTRL+E. After you click OK, a small toolbar will appear with two buttons,
Stop and Relative Reference. The Stop button stops the macro recorder. The
Relative Reference button requires some explanation. When the Relative Reference button is selected, the macro will start relative to the currently active cell. If it is not selected, macros will always begin from the same absolute position on the worksheet. In this example, we want to select relative references so that we dont end up cleaning the same column of data every time we invoke the macro. The macro recorder captures all of your activities within Excel until you click the
Stop button. Simply follow the find/replace process outlined above for each character you wish to remove, and then press the
Stop button on the macro toolbar at the end. You can then use CTRL+E
to run the macro anytime you need to clean extraneous characters from your data.
Locate out-of-range values. Ensuring that your data are in the correct range is critical. Because the WCCST-R items range in value from 1 to 5, observing a 6 in the dataset indicates real cause for concern. In the case of telephone numbers, 10-digit telephone numbers are useful; 9- or 11-digit numbers require further attention and possibly a review of the source data. One easy way to find telephone numbers with too many/few numbers is to use the
LEN() function to count the number of characters. Enter =LEN(F2) into Row 2 of a new column and copy the formula down the column. Next, use conditional formatting as described above to highlight out-of-range phone numbers for further attention.
Step 3. Process the Data
Our main goal in this stage is to refine the data for our eventual statistical analysis. We will illustrate how to parse data from one column into several others, and how to recode and reformat data for consistency. This is, of course, just the tip of the icebergthere is a vast array of processing activities that might be undertaken when you are processing data. We intend merely to illustrate some of the more compelling possibilities that Excel enables.
Parse data. There are a number of ways to parse the data in one cell and return the output to others. Possibly the simplest is Excels built-in parsing wizard, which can be found under
Data|Text to Columns. The wizard splits cell data at delimiters that you specify, such as commas, tabs, or spaces, and puts the output into separate cells. Consider, for example, the
City, State ZIP data. We parse this into separate columns by running the wizard and selecting the comma as a delimiter to separate the city name from the state and ZIP code. Next we run the wizard again and specify the space as a delimiter within the state and ZIP code column output by the first wizard. It may have occurred to you that we could try running the wizard just once and having it parse on both the space and comma. This would work for most data. However, if the data includes city names with internal spaces (e.g., Los Angeles), each component will get its own column. The two-part process is a step more laborious but also more effective.
Outside of the wizard, Excel has a number of functions that let you parse characters directly. For example, the
LEFT(), RIGHT(), and MID() functions return a specified number of characters from a target. The difference between them is in where they begin counting, and in which direction. The
LEFT() and RIGHT() functions pull the left-most and right-most n characters from a target, respectively. The
MID() function starts at a point you specify within a target and returns the next n characters. For example, if we wanted to pull the area code information out of a phone number, we would enter the following formula into a new cell:
=LEFT(F2, 3). This formula would then extract the left-most three characters from cell
Recode data. Excel does not have anything built expressly for the purpose of recoding data, such as SPSSs
Recode command. However, if your data are simply reverse coded, you can write a quick and easy formula to realign the data. Consider, for example, WCCST R Item 3, which ranges from 15 and is reverse coded. Entering the formula
=6 J2 in Row 2 of the first available column and copying the formula down to the rest of the rows does the trick nicely, turning 5s to 1s, and so forth. Wellit works in all the rows except for Row 2, where there is some missing data and the formula would return an undesirable value of
6. To protect against this, we need to use a slightly more complex formula, as follows:
=IF(ISNUMBER(J2),6 J2,NA()) This formula uses the IF() function, which tests a condition
(ISNUMBER(J2)-is the value in cell J2 a number?), and returns the reverse-coded value
(6 J2) if the condition is true, or an N/A error if it is not. When imported into SPSS, the
N/A error is interpreted as a missing value. Similar use of the IF() function could help ensure that the area code is only extracted from phone numbers that consist of 10 digits, as in the parsing example above.
Compute new values. Most readers should be familiar with functions like SUM()
or AVERAGE(), which return the sum and average of a range of cells, respectively. There are many other functions that can be leveraged to populate new variables, from the simple
MIN() and MAX(), which return the minimum and maximum values within a range of cells, to the somewhat more complex
PERCENTRANK(), which returns the percentage rank of a value within a larger range of cells. It would simply take too long to visit all of the functions, and so we suggest again that you take some time to explore them using the Excel function wizardaccessible by selecting
Insert | Functionand/or by locating a good reference on Excel functions. We list several at the end of this document.
Reformat data. One way in which Excel can ease annoyance is by helping you impose a consistent format on your data. Consider text case, for example. SPSS is case sensitive and will therefore understand variations of the abbreviation for
PennsylvaniaPA, Pa, and paas three different values. Excels
UPPER() function takes care of this handily by converting values to all upper case. There is also a
LOWER() function, which works as you might expect, and a PROPER() function that capitalizes each word. A particularly useful text function is
TRIM(), which removes all spaces except single spaces between wordsnote how it would be useful for the Street Address column in our sample file.
Step 4. Create an Analysis-Ready Copy of the Data
Now that weve done the heavy work, it remains for us merely to copy our final dataset from the Interim Data worksheet to the
Final Data worksheet. There are several reasons we dont simply attempt to copy the data straight from the
Interim Data worksheet into a statistics package. First, and most important, weve written a lot of formulae, yet the output of formulae is often not readable by statistics packages when they try to import data. In plain English, if you try to import data produced by formulae, you will more likely than not end up with blank entries. A second reason for creating a copy of the data is that the processing step typically produces a number of additional columns of data that we might not want to preserve in our analysis dataset. For example, parsing out city, state, and ZIP code information in our example above produced several redundant columns. It is better to avoid confusion by leaving out these apparently redundant variables and focusing only on those that belong in the final copy of the data.
Copying the data to the Final Data worksheet is a straightforward task. First, make sure that the rows of data in the
Interim Data worksheet are sorted in the order you want them (if there is such an order). If you need to re-sort them, see our instructions above. Next, select the columns of data that you wish to copy. Activate the
Final Data worksheet and select the first cell in the column where you would like to paste the data. Click on
Edit | Paste Special, select the Values radio button, and click
OK. Excel will paste only the final values of the copied cells. This means that there is no link between the copied and pasted cellsif you change the original cells on the
Interim Data worksheet, nothing will change on the Final Data worksheet. Keep following this cut-and-paste process until you have completed the
Final Data worksheet to your satisfaction.
Step 5. Document the Data
There are several ways to document your data. Some are implicit. Formulae, especially simple formulae, make it fairly clear how their results were derived. More complex formulae often require further explanation. Following are our recommendations for easy ways to document your dataset. Quite honestly, Excel makes it so easy to do so that there is no reason to have undocumented data.
File-level documentation. For general information, consider using the file properties page, accessible via
File | Properties. The Summary tab offers a number of useful fields for capturing information, including a large field for general comments. The
Custom tab includes a number of specific fields, such as Project, Date Completed, and
Checked by. Importantly, this information always travels with the file, so there is no risk associated with multiple pieces of documentation getting separated.
Cell comments. You can add notes to cells by selecting Insert | Comment
or by showing the Reviewing toolbar. Cell comments are separate from the data within the cells and have no influence on any computations. Further, they can be shown or hidden per your preference. We recommend you add comments to the variable name cells (usually the first row of a worksheet) to document the computation or formatting actions taken. You could also use cell comments to flag redundant rows of data omitted from the final dataset, or to annotate the source of the data copied to each column of the
Final Data worksheet. Another handy feature is the ability to print out your comments with the rest of the dataset. To configure your file to print comments, click on
File | Page Setup, activate the Sheet tab, and make your selection from the
Comments drop-down box.
Cell shading. One easy and intuitive way to document your data is to color code it according to a coding scheme. For example, you could indicate columns on the
Interim Data worksheet that were copied to the Final Data worksheet by coloring them green. Columns of data that were superseded by others could be shaded gray, such as the
City, State, ZIP column that was parsed into its basic elements. Along with cell comments and file-level documentation, cell shading makes it easy to have a dataset that anyone can review at any time and quickly understand.
Weiss, R. J. (2004a, July). Leading Edge: Programming Excel macros.
TIP: The Industrial-Organizational Psychologist, 42, 127134.
Weiss, R. J. (2004b, April). Leading Edge: Using Excel forms.
TIP: The Industrial-Organizational Psychologist, 41, 6169.
The data file we used as our example in this article is available at
and should offer a good start to those who are interested in our approach to cleaning and preparing data.
Space precluded us from offering more information on the power of Excel functions, but we do have several books to recommend on the subject:
Rubin, J., & Jelen, B. (2003). Mr. Excel on Excel: Excel 97, 2000, 2002.
Uniontown, OH: Holy Macro! Books.
Walkenbach, J. (2001). Excel 2002 formulas. New York: M&T Books.
Questions, Comments, Suggestions?
If you have questions or comments about this article or suggestions for future editions of this column, please dont hesitate to e-mail
firstname.lastname@example.org. We regret, of course, that we cant offer technical help on Excel or other applications. However, there are many free online resources where other users are eager to assist you, and they often answer questions surprisingly quickly. Check out one popular site at
http://www.mrexcel.com/board2/. Happy computing!
January 2005 Table
of Contents | TIP Home
| SIOP Home