Introduction to Data Cleaning with the Tidyverse

knitr::opts_chunk$set(echo = TRUE, eval=TRUE, include=TRUE)

If your file management system looks like this (no guilt, we have all been there), this is suggestive of some rather ad hoc approaches to data cleaning. As you work with your data, you will discover errors (this is a universal experience). How do you fix those?

from PHD comics

Newer scientists are likely to use point-and-click software like spreadsheet. The problem with this approach is that there is no record of your exact action, so you save a new copy of the data set with a new name. And on it goes. There are other ways to fix data errors using a transparent process that can avoid these multiplying data sets, or clarify what distinguishes one data set from another.

Introduction

Equip everyone regardless of means, to participate in the global economy that rewards data literacy.

—RStudio mission statement

The Tidyverse is a collection of packages designed to help data users work more efficiently with their data. At the time of writing this, the Tidyverse consists of over 25 packages with diverse that range from connecting to google sheets to working with databases.

Most Tidyverse functions have equivalent functions in base R (that is, the set of function that load automaticallty when R is launched). The difference is that tidyverse functions are often faster, the syntax may be clearer and easier to use (no guarantee, tho), and most importantly, they all follow a similar structure and the results are consistent across packages (within limits, this depends on the nature of the function).

Technical note: the Tidyverse uses “non-standard evaluation” (NSE) principles in its functions. NSE is a complicated topic, but in brief, it means that tidyverse function follow different rules than base R function. Tidyverse functions are easier to use, but, if you are used to base R functions, it will take some effort to reorient your R programming brain. Just a friendly warning before we dive in!

Tidyverse Packages

Abbreviated list of packages useful for data cleaning

As mentioned, the Tidyverse is a collection of many packages. Below is a table of many very useful packages for data cleaning. This tutorial does not address all these packages, but they are worth checking out, depending on your data analytic needs.

Package Purpose
dplyr main package for organizing data
forcats for managing categorical data
janitor also not part of the tidyverse, has useful data cleaning functions
lubridate for manipulating dates
purrr advanced broadcasting of functions across data
readr for reading in tabular data
readxl for reading in MS excel sheets
stringr for managing “strings” (character variables)
tidyr to reshape data and handle missing values

Honorable mention: data.table, which provides advanced handling functions for data.frames. While not part of the tidyverse officially, it is very efficient.

Packages Used in This Tutorial

library(readr)
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

The warning messages indicate when there are identical function names across different packages. The last package loaded will override functions from previous-loaded packages when such conflicts arise. So, the order of how packages are loaded can matter. However, any function can be accessed by add a prefix of the package name and two colons to a particular function:

base::intersect(...)

Load Data

read_csv() v read.csv()

How do the functions read.csv() (from base R) and read_csv (from readr) differ in how they read in data?

For this tutorial, 3 data sets will be used, “nass_cotton”, “nass_barley”, and “nass_barley”. These are historical data sets from the National Agricultural Statistics Service that indicate the total acreage and yield for the crops indicated for each each U.S. state and year from 1866 to 2011. They can be accessed with the package ‘agridat’.

These data sets were originally obtained as thus from Agridat and saved to separate files for this tutorial.

library(agridat) # a package of agricultural data sets
data("nass.corn")
data("nass.barley")
data("nass.cotton")

write.csv(nass.corn, "nass_corn.csv", row.names = F)
write.csv(nass.barley, "nass_barley.csv", row.names = F)
write.csv(nass.cotton, "nass_cotton.csv", row.names = F)

When using the RStudio’s point-and-click interface to load tabular data, it will typically call the function read_csv() for CSV files and read_delim() for text files. read_csv() is a wrapper for read_delim() with different default arguments.

These arguments can be called using the command line interface as well.

nass_barley <- read.csv("nass_barley.csv") 
nass_cotton <- read_csv("nass_cotton.csv")
## Rows: 2338 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state
## dbl (3): year, acres, yield
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# contrast functions read.csv and read_csv
str(nass_barley)
## 'data.frame':	4839 obs. of  4 variables:
##  $ year : int  1866 1866 1866 1866 1866 1866 1866 1866 1866 1866 ...
##  $ state: chr  "Connecticut" "Illinois" "Indiana" "Iowa" ...
##  $ acres: num  1000 96000 11000 66000 2000 10000 34000 7000 21000 20000 ...
##  $ yield: num  22.5 23.4 23 22 23 23.5 21.5 25.5 26 26 ...
str(nass_cotton)
## spec_tbl_df [2,338 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year : num [1:2338] 1866 1866 1866 1866 1866 ...
##  $ state: chr [1:2338] "Alabama" "Arkansas" "Florida" "Georgia" ...
##  $ acres: num [1:2338] 977000 489000 155000 895000 1020000 ...
##  $ yield: num [1:2338] 120 198 123 122 57 86 124 116 102 121 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   state = col_character(),
##   ..   acres = col_double(),
##   ..   yield = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

A Bit on the Tibble

A tibble is an object type created for the Tidyverse. This is the object type imported by ‘readr’ and returned by many tidyverse functions. The tibble is like a data frame with a few other rules. Tibbles have a simplified printing function if you accidentally type in the object name - it will only print the first 10 lines and will truncate the number of columns to fit your screen. If you type the object name of a traditional data frame, the first 1000 lines of data are printed!

To create a new tibble, you can use the function tribble() to create a tibble by rows, or the function tibble() for building a tibble column-wise, or as.tibble() to coerce an object to a tibble.

There are a few aspects of tibbles that are fundamentally different from a standard data frame. Tibbles do not handle row names well and many tidyverse functions discard that information. This is often just fine - row names are computationally expensive to set and maintain. Also, it can be difficult to filter and sort data based on row names. However, many other (non-tidyverse) package functions rely on the rownames attribute, so be aware of this behavior when using tidyverse functions on data sets.

Function details

  • the most noticeable difference between read.csv() and read_csv() is that read.csv()’s default behavior is to replace all spaces and special characters in column headers with a “.”, while read_csv() keeps all that information. Additionally, read_csv() tends to handle dates a bit better than read.csv().

  • read_delim() and its wrappers are faster than base versions (e.g. read.csv()), however, the difference will not be noticeable for small files (< 10 Mb)

  • read_delim() offers additional flexibility such removing trailing and leading white space (" hello" and “hello “) and automatically deleting empty rows (these things occur by default).

  • specific values for missing data that are not “NA” (e.g. “-9”, “.") can be set with with argument na = ...

  • the function can be sped up by setting values for the col_types = ... argument (e.g. character, numeric, date, et cetera):

nass_corn <- read_csv("nass_corn.csv", col_types = cols(col_integer(), col_character(), col_double(), col_double()))
str(nass_corn)
## spec_tbl_df [6,381 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year : int [1:6381] 1866 1866 1866 1866 1866 1866 1866 1866 1866 1866 ...
##  $ state: chr [1:6381] "Alabama" "Arkansas" "California" "Connecticut" ...
##  $ acres: num [1:6381] 1050000 280000 42000 57000 200000 ...
##  $ yield: num [1:6381] 9 18 28 34 23 9 6 29 36.5 32 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_integer(),
##   ..   state = col_character(),
##   ..   acres = col_double(),
##   ..   yield = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Note: if you have truly huge data, consider the fread() function in ‘data.table’, storing the data in database and using ‘dbplyr’ to access it, or using the ‘arrow’ package.

Data selection and subsetting

Using ‘dplyr’

Column selection

The select() function is used for column selection and has the general structure: select(data, variables_to_select).

Select only the columns that you want

barley_acres <- select(nass_barley, state, year, acres)
str(barley_acres)
## 'data.frame':	4839 obs. of  3 variables:
##  $ state: chr  "Connecticut" "Illinois" "Indiana" "Iowa" ...
##  $ year : int  1866 1866 1866 1866 1866 1866 1866 1866 1866 1866 ...
##  $ acres: num  1000 96000 11000 66000 2000 10000 34000 7000 21000 20000 ...
head(barley_acres)
##         state year acres
## 1 Connecticut 1866  1000
## 2    Illinois 1866 96000
## 3     Indiana 1866 11000
## 4        Iowa 1866 66000
## 5      Kansas 1866  2000
## 6    Kentucky 1866 10000

Select the columns you do not want:

barley_yield <- select(nass_barley, -acres)
str(barley_yield)
## 'data.frame':	4839 obs. of  3 variables:
##  $ year : int  1866 1866 1866 1866 1866 1866 1866 1866 1866 1866 ...
##  $ state: chr  "Connecticut" "Illinois" "Indiana" "Iowa" ...
##  $ yield: num  22.5 23.4 23 22 23 23.5 21.5 25.5 26 26 ...
barley_yield2 <- select(nass_barley, -(c(acres, yield))) # multiple columns
str(barley_yield2)
## 'data.frame':	4839 obs. of  2 variables:
##  $ year : int  1866 1866 1866 1866 1866 1866 1866 1866 1866 1866 ...
##  $ state: chr  "Connecticut" "Illinois" "Indiana" "Iowa" ...

Sorting data

Sort functions follow this structure: sort(data, variables_to_sort)

Sort across one column:*

corn_year <- arrange(nass_corn, state)
head(corn_year)
## # A tibble: 6 × 4
##    year state     acres yield
##   <int> <chr>     <dbl> <dbl>
## 1  1866 Alabama 1050000   9  
## 2  1867 Alabama 1400000  13  
## 3  1868 Alabama 1420000  12.5
## 4  1869 Alabama 1430000  12  
## 5  1870 Alabama 1330000  13.5
## 6  1871 Alabama 1480000  11

Sort across several columns

corn_year_yield <- arrange(nass_corn, year, yield)
head(corn_year_yield)
## # A tibble: 6 × 4
##    year state            acres yield
##   <int> <chr>            <dbl> <dbl>
## 1  1866 Georgia        1770000   6  
## 2  1866 South Carolina  950000   6.5
## 3  1866 Alabama        1050000   9  
## 4  1866 Florida         125000   9  
## 5  1866 Mississippi     800000   9.5
## 6  1866 North Carolina 1675000   9.5

Sort in reverse order

corn_year_yield_rev <- arrange(nass_corn, desc(year), desc(yield))
head(corn_year_yield_rev)
## # A tibble: 6 × 4
##    year state       acres yield
##   <int> <chr>       <dbl> <dbl>
## 1  2011 Washington 125000   225
## 2  2011 Oregon      51000   215
## 3  2011 California 150000   185
## 4  2011 Idaho      120000   185
## 5  2011 Arizona     32000   180
## 6  2011 New Mexico  43000   180

Row selection

Based on their position or index (where the first row = 1):

corn_first100 <- slice(nass_corn, 100) # grab first 100 rows

Filtering Data

(actually a type of row selection)

Filtering enables users to select all columns of data based on a condition within a row or rows. The filter() function has the formulation: select(data, condition_for_filter).

Filter based on one variable and exact match

corn_AZ <- filter(nass_corn, state == "Arizona")
head(corn_AZ)
## # A tibble: 6 × 4
##    year state   acres yield
##   <int> <chr>   <dbl> <dbl>
## 1  1882 Arizona  3000    21
## 2  1883 Arizona  3000    20
## 3  1884 Arizona  3000    21
## 4  1885 Arizona  3000    22
## 5  1886 Arizona  3000    22
## 6  1887 Arizona  3000    19

Filter for matching several items:

barley_pnw <- filter(nass_barley, state %in% c("Idaho", "Oregon", "Washington"))
head(barley_pnw)
##   year  state acres yield
## 1 1869 Oregon  6000  35.0
## 2 1870 Oregon  7000  30.5
## 3 1871 Oregon  8000  28.5
## 4 1872 Oregon 11000  30.5
## 5 1873 Oregon 15000  33.0
## 6 1874 Oregon 17000  29.0

Filter by numerical cut-off:

(first, look at summary of data)

summary(nass_cotton$acres)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      100   118000   550000  1206414  1431500 17749000
nass_cotton_high <- filter(nass_cotton, acres < 550000)

Logical operators in R:

symbol meaning
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x y
x & y x AND y

Combine multiple filters

(and plot the output)

corn_Iowa_prewar <- filter(nass_corn, year < 1950 & year >= 1930 & state == "Iowa")
with(corn_Iowa_prewar, plot(year, yield, col = "dodgerblue", pch = 16, type = "b"))

Data transformations

The mutate() function to define new variables

Mutate functions have the formulation: mutate(data, new_variable_name = data_transformation). New variables are declared on the left and the right provides the operations for variable creation itself. Many operations are allowed within a mutate function:

barley_HI <- mutate(nass_barley, 
                    harvest_index = acres/yield,
                    Year = as.factor(year),
                    Is_high = acres > median(acres),
                    crop = "barley", 
                    relative_yld = yield/mean(yield))

head(barley_HI)
##   year       state acres yield harvest_index Year Is_high   crop relative_yld
## 1 1866 Connecticut  1000  22.5      44.44444 1866   FALSE barley    0.6494957
## 2 1866    Illinois 96000  23.4    4102.56410 1866    TRUE barley    0.6754755
## 3 1866     Indiana 11000  23.0     478.26087 1866   FALSE barley    0.6639290
## 4 1866        Iowa 66000  22.0    3000.00000 1866    TRUE barley    0.6350625
## 5 1866      Kansas  2000  23.0      86.95652 1866   FALSE barley    0.6639290
## 6 1866    Kentucky 10000  23.5     425.53191 1866   FALSE barley    0.6783622

Renaming variables

The rename() function renames columns using this notation: rename(dataframe, newname = "oldname"):

barley_HI <- rename(barley_HI, high_acres = "Is_high")
str(barley_HI)
## 'data.frame':	4839 obs. of  9 variables:
##  $ year         : int  1866 1866 1866 1866 1866 1866 1866 1866 1866 1866 ...
##  $ state        : chr  "Connecticut" "Illinois" "Indiana" "Iowa" ...
##  $ acres        : num  1000 96000 11000 66000 2000 10000 34000 7000 21000 20000 ...
##  $ yield        : num  22.5 23.4 23 22 23 23.5 21.5 25.5 26 26 ...
##  $ harvest_index: num  44.4 4102.6 478.3 3000 87 ...
##  $ Year         : Factor w/ 146 levels "1866","1867",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ high_acres   : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
##  $ crop         : chr  "barley" "barley" "barley" "barley" ...
##  $ relative_yld : num  0.649 0.675 0.664 0.635 0.664 ...

Splitting and Merging Columns

Combine multiple columns into one with the function unite(). Column data are pasted together as text, not added as numbers. The columns used to make the new variables are automatically discarded (this behavior can be stopped with the argument remove = F).

unite() functions have this formulation: unite(data, new_var_name, vars_to_select).

cotton_nass_new <- unite(nass_cotton, "ID", state, year)
head(cotton_nass_new)
## # A tibble: 6 × 3
##   ID                 acres yield
##   <chr>              <dbl> <dbl>
## 1 Alabama_1866      977000   120
## 2 Arkansas_1866     489000   198
## 3 Florida_1866      155000   123
## 4 Georgia_1866      895000   122
## 5 Louisiana_1866   1020000    57
## 6 Mississippi_1866 1668000    86

The Opposite function,separate(), has this structure: separate(data, col_to_split, names_of_new_cols)

cotton_nass_new2 <- separate(cotton_nass_new, 
                             col = ID, 
                             into = c("STATE", "YEAR"), 
                             sep = "_", 
                             remove = F)
head(cotton_nass_new2)
## # A tibble: 6 × 5
##   ID               STATE       YEAR    acres yield
##   <chr>            <chr>       <chr>   <dbl> <dbl>
## 1 Alabama_1866     Alabama     1866   977000   120
## 2 Arkansas_1866    Arkansas    1866   489000   198
## 3 Florida_1866     Florida     1866   155000   123
## 4 Georgia_1866     Georgia     1866   895000   122
## 5 Louisiana_1866   Louisiana   1866  1020000    57
## 6 Mississippi_1866 Mississippi 1866  1668000    86

These last two functions are often quite useful - for pulling apart complicated ID columns, or for creating a unique identifier that can later be used for merging two data sets together.

Merging data sets with ‘dplyr’

Merging different data sets by a common variable is a major advantage to using automated software. Doing this manually in spreadsheet programs is nearly impossible to do without introducing errors. There are 6 different merging or joining functions which differ in what data they keep and what are discarded.

Standard join functions

For merging 2 data frames. Join functions are first and foremost defined by the rows they return. The function use a common ID, which can be automatically detected or explicitly declared, to match observations in A with observations in B and returned a merged data frame. In most cases, all columns from both data frames are returned.

The general formulation for join functions: type_join(x_data, y_data, ID_key_variable)

**Know Your Joins**


The column(s) used for ID are only included once in the merged data frame, that is, the duplicated key column is dropped. However, if there are other (non-ID) columns with duplicated names across the data sets, the duplicate columns from first data set listed are given a “.x” suffix and the duplicate columns from second data set listed are given a “.y” suffix.

1. The full_join():

  • always the safest option for avoiding data removal (everything is returned)

  • in this example, the full join will produce matches when both the year and state of a single observation match the year and state, and non-matches are appended to the resulting object.

  • Note: for the rest of the joins, which employ a filtering step, I recommend you always use a single unique identifier (called a “key”) for the joins. You can use multiple keys, but a single unique key is the clearest way to join data without generating duplicate observations.

  • When there is no match, the rows from each data set are still returned and filled with NA for columns from the other data set.

# first create a unique identifier variable
cotton_nass_new <- unite(nass_cotton, "ID", state, year)
barley_nass_new <- unite(nass_barley, "ID", state, year)

#equivalent models - this is only true for full_joins
cotton_barley_all <- full_join(nass_cotton, nass_barley, by = c("year", "state"))
cotton_barley_all2 <- full_join(cotton_nass_new, barley_nass_new, by = "ID")

# compare dimensions after the joins - what is kept?
dim(cotton_barley_all)
## [1] 5935    6
dim(nass_cotton)
## [1] 2338    4
dim(nass_barley)
## [1] 4839    4

2. The inner_join():

This function only keeps what is common between data sets:

cotton_barley_inner <- inner_join(cotton_nass_new, barley_nass_new, by = "ID")
dim(cotton_barley_inner)
## [1] 1242    5
head(cotton_barley_inner)
## # A tibble: 6 × 5
##   ID             acres.x yield.x acres.y yield.y
##   <chr>            <dbl>   <dbl>   <dbl>   <dbl>
## 1 Tennessee_1866  649000     121    4000      23
## 2 Texas_1866      490000     325    3000      16
## 3 Tennessee_1867  611000     137    4000      19
## 4 Texas_1867      491000     195    2000      20
## 5 Tennessee_1868  520000     170    4000      18
## 6 Texas_1868      581000     197    2000      22
- several column names duplicates resulted in the addition of ".x" to duplicate columns from the cotton data set and ".y" to duplicate columns from the barley data set. 
colnames(cotton_barley_inner)[2:5] <- c("cotton_acres", "cotton_yield", "barley_acres", "barley_yield")

3. The left_join():

This function matches all rows for observations found in the tibble listed on the left:

cotton_barley_left <- left_join(cotton_nass_new, barley_nass_new, by = "ID")
dim(cotton_barley_left)
## [1] 2338    5

4. The right_join():

This function matches all rows for observations found in the tibble listed on the left:

cotton_barley_right <- right_join(cotton_nass_new, barley_nass_new, by = "ID")
dim(cotton_barley_right)
## [1] 4839    5

5. The semi_join():

This function is like inner_join() (only rows with keys in both data sets are kept) except it does not keep any data from the data set listed second:

cotton_barley_semi <- semi_join(cotton_nass_new, barley_nass_new, by = "ID")
dim(cotton_barley_semi)
## [1] 1242    3
str(cotton_barley_semi)
## tibble [1,242 × 3] (S3: tbl_df/tbl/data.frame)
##  $ ID   : chr [1:1242] "Tennessee_1866" "Texas_1866" "Tennessee_1867" "Texas_1867" ...
##  $ acres: num [1:1242] 649000 490000 611000 491000 520000 581000 502000 720000 536000 922000 ...
##  $ yield: num [1:1242] 121 325 137 195 170 197 145 195 203 255 ...
str(cotton_barley_inner)
## tibble [1,242 × 5] (S3: tbl_df/tbl/data.frame)
##  $ ID          : chr [1:1242] "Tennessee_1866" "Texas_1866" "Tennessee_1867" "Texas_1867" ...
##  $ cotton_acres: num [1:1242] 649000 490000 611000 491000 520000 581000 502000 720000 536000 922000 ...
##  $ cotton_yield: num [1:1242] 121 325 137 195 170 197 145 195 203 255 ...
##  $ barley_acres: num [1:1242] 4000 3000 4000 2000 4000 2000 4000 2000 4000 2000 ...
##  $ barley_yield: num [1:1242] 23 16 19 20 18 22 19 22 18.5 20.5 ...

6. The anti_join():

Observations from first data set that do not match the second data set are kept:

cotton_barley_anti <- anti_join(cotton_nass_new, barley_nass_new, by = "ID")
dim(cotton_barley_anti)
## [1] 1096    3

Question: What happens when there are multiple matches for an identifier when using full_join(), left_join() or right_join()?


Answer: Each time there is a new match between the keys of two data frames, a new row is added for the matched data - effectively all pairwise matches are added to the data set. There may be circumstances when this is advantageous, but usually, it is an unexpected surprise when a full_join() results in new data set several magnitudes larger than the two data sets combined.

Set operations

These are different from standard joins. They compare across vectors (not data.frames/tibbles) and return what has been specified in the statement. Duplicates are always discarded. The purpose of set operations is to determine common or uncommon observations, not merge two different data sets.

Common set operations

# states with barley data but no cotton data
setdiff(nass_barley$state, nass_cotton$test)
## Warning: Unknown or uninitialised column: `test`.
##  [1] "Connecticut"    "Illinois"       "Indiana"        "Iowa"          
##  [5] "Kansas"         "Kentucky"       "Maine"          "Massachusetts" 
##  [9] "Michigan"       "Minnesota"      "Missouri"       "Nebraska"      
## [13] "New Hampshire"  "New York"       "Ohio"           "Pennsylvania"  
## [17] "Rhode Island"   "Tennessee"      "Texas"          "Vermont"       
## [21] "Wisconsin"      "California"     "Oregon"         "Nevada"        
## [25] "Colorado"       "Virginia"       "Arizona"        "Idaho"         
## [29] "Montana"        "New Mexico"     "North Dakota"   "South Dakota"  
## [33] "Utah"           "Washington"     "Maryland"       "Oklahoma"      
## [37] "Wyoming"        "New Jersey"     "North Carolina" "South Carolina"
## [41] "West Virginia"  "Arkansas"       "Delaware"       "Georgia"       
## [45] "Mississippi"    "Alabama"        "Alaska"
# states with cotton data, but no barley data
setdiff(nass_cotton$state, nass_barley$state)
## [1] "Florida"     "Louisiana"   "Otherstates"
# states that grow both
intersect(nass_cotton$state, nass_barley$state)
##  [1] "Alabama"        "Arkansas"       "Georgia"        "Mississippi"   
##  [5] "North Carolina" "South Carolina" "Tennessee"      "Texas"         
##  [9] "Missouri"       "Virginia"       "Oklahoma"       "California"    
## [13] "Arizona"        "New Mexico"     "Illinois"       "Kansas"        
## [17] "Kentucky"       "Nevada"

Reshaping data

Occasionally, you will have data in one arrangement (e.g. wide) that needs to be transformed into the long format. Perhaps this is needed for graphing purposes, or to compare particular data points, or because another R package requires it. Doing this manually is time consuming and error-prone. There are some function in tidyr that can do this very efficiently.

Same data, two formats

Two common approaches to reshaping data:

  1. ‘tidyr’ (popular, fast)
  2. ‘data.table’ (flexible, very fast with large data sets)
reshaping action tidyr function data.table function
long to wide pivot_wider() dcast()
wide to long pivot_longer() melt()

Note: the tidyr functions pivot_wider() and pivot_longer() replaced spread() and gather().

The tidyr approach:

Reshaping from long to wide

# use the "barley_pnw" object created earlier (barley data from ID, OR and WA)
nass_barley_pnw_wide <- pivot_wider(barley_pnw,
                                id_cols = year, 
                                names_from = state,
                                values_from = yield)

head(nass_barley_pnw_wide)
## # A tibble: 6 × 4
##    year Oregon Idaho Washington
##   <int>  <dbl> <dbl>      <dbl>
## 1  1869   35      NA         NA
## 2  1870   30.5    NA         NA
## 3  1871   28.5    NA         NA
## 4  1872   30.5    NA         NA
## 5  1873   33      NA         NA
## 6  1874   29      NA         NA
sample_n(nass_barley_pnw_wide, 5)
## # A tibble: 5 × 4
##    year Oregon Idaho Washington
##   <int>  <dbl> <dbl>      <dbl>
## 1  1869   35      NA       NA  
## 2  1911   23      35       32  
## 3  1995   76      80       72  
## 4  1956   37.5    33       35  
## 5  1926   24      31       30.5

There is also a values_fill= argument to fill in missing values when you know what those should be. The default behavior is to fill with an NA.

Reshaping multiple variables

# use the "barley_pnw" object created earlier (barley data from ID, OR and WA)
nass_barley_pnw_wide2 <- pivot_wider(barley_pnw,
                                id_cols = year, 
                                names_from = state,
                                values_from = c(yield, acres))

sample_n(nass_barley_pnw_wide2, 5)
## # A tibble: 5 × 7
##    year yield_Oregon yield_Idaho yield_Washington acres_Oregon acres_I…¹ acres…²
##   <int>        <dbl>       <dbl>            <dbl>        <dbl>     <dbl>   <dbl>
## 1  2004         73          92               70          66000    650000  245000
## 2  1945         29.5        37               35         257000    307000  125000
## 3  1960         36          28.5             36.5       457000    492000  661000
## 4  1914         23.5        32.5             33          82000    133000  187000
## 5  1946         34          34               37         278000    276000   90000
## # … with abbreviated variable names ¹​acres_Idaho, ²​acres_Washington

Reshaping from wide to long

nass_barley_pnw_long <- pivot_longer(nass_barley_pnw_wide,
                                     cols = 2:ncol(nass_barley_pnw_wide),
                                     names_to = "State",
                                     values_to = "Yield", 
                                     values_drop_na = T)

head(nass_barley_pnw_long)
## # A tibble: 6 × 3
##    year State  Yield
##   <int> <chr>  <dbl>
## 1  1869 Oregon  35  
## 2  1870 Oregon  30.5
## 3  1871 Oregon  28.5
## 4  1872 Oregon  30.5
## 5  1873 Oregon  33  
## 6  1874 Oregon  29
dim(nass_barley_pnw_long)
## [1] 403   3
dim(barley_pnw) # dimension of the original object
## [1] 403   4

The data.table approach

Thus far, the package ‘data.table’ has not been explored in any depth (and it still won’t be for space considerations). Nevertheless, ‘data.table’ is useful for manipulating large tables; it is extremely efficient (even better than tidyverse functions).

Reshaping from long to wide:

-we will use the previously created data set “cotton_barley_inner” from inner join of the barley and cotton data sets.

# ID variable needs to be separated into State and year
cotton_barley_inner2 <- separate(cotton_barley_inner, ID, c("State", "Year"), sep = "_")
head(cotton_barley_inner2)
## # A tibble: 6 × 6
##   State     Year  cotton_acres cotton_yield barley_acres barley_yield
##   <chr>     <chr>        <dbl>        <dbl>        <dbl>        <dbl>
## 1 Tennessee 1866        649000          121         4000           23
## 2 Texas     1866        490000          325         3000           16
## 3 Tennessee 1867        611000          137         4000           19
## 4 Texas     1867        491000          195         2000           20
## 5 Tennessee 1868        520000          170         4000           18
## 6 Texas     1868        581000          197         2000           22

Reshape cotton acreage, using State as the ID variable:

cotton_acres_wide <- dcast(setDT(cotton_barley_inner2), State ~ Year, value.var = "cotton_acres")

Reshaping with multiple variables

Multiple ID variables can be specified on the left side of the tilde and/or multiple column categories can be specified on right side. Furthermore, multiple columns (the “value.var”) can be specified to fill the cells.

Reshape cotton acreage, using State as the ID variable using both cotton acres and cotton yield to fill the cells:

cotton_wide <- dcast(setDT(cotton_barley_inner2), State ~ Year, value.var = c("cotton_acres", "cotton_yield"))

sample(colnames(cotton_wide), 30)
##  [1] "cotton_acres_1971" "cotton_acres_1898" "cotton_yield_1998"
##  [4] "cotton_acres_1945" "cotton_yield_1947" "cotton_yield_1911"
##  [7] "cotton_acres_1952" "cotton_yield_1944" "cotton_acres_1896"
## [10] "cotton_yield_1965" "cotton_yield_1945" "cotton_yield_1970"
## [13] "cotton_yield_2005" "cotton_yield_1957" "cotton_yield_1967"
## [16] "cotton_acres_1954" "cotton_yield_1898" "cotton_acres_1963"
## [19] "cotton_yield_1932" "cotton_acres_1926" "cotton_yield_1868"
## [22] "cotton_acres_1979" "cotton_yield_1908" "cotton_acres_1991"
## [25] "cotton_acres_1972" "cotton_acres_1878" "cotton_acres_1883"
## [28] "cotton_acres_1918" "cotton_yield_1921" "cotton_yield_1879"

Implementing a summary function in the reshape

  • if there is more than one observation for each cell, a summary function like mean can be used. If function is not specified, the number of observations for that variable combination will be used.
cotton_wide_meanYield <- dcast(setDT(cotton_barley_inner2), cotton_acres ~ Year, value.var = "cotton_yield", fun = mean, fill = 0)

sample_n(cotton_wide_meanYield , 5)[,c(1, 100:106)]
##    cotton_acres 1964 1965 1966 1967 1968 1969 1970
## 1:       571000    0    0    0    0    0    0    0
## 2:       420000    0    0    0    0    0    0    0
## 3:      3242000    0    0    0    0    0    0    0
## 4:       363000    0    0    0    0    0    0    0
## 5:     10212000    0    0    0    0    0    0    0

Reshaping from wide to long

cotton_acres_long <- melt(setDT(cotton_acres_wide), 
                          id.vars = "State", 
                          measure.vars = 2:ncol(cotton_acres_wide), 
                          variable.name = "YEAR", 
                          value.name = "ACRES")

sample_n(cotton_acres_long, 5)
##       State YEAR   ACRES
## 1:   Nevada 1969    2300
## 2: Kentucky 1981      NA
## 3: Arkansas 1905      NA
## 4: Arkansas 1921      NA
## 5: Arkansas 1965 1205000
cotton_long <- melt(setDT(cotton_wide), 
                          id.vars = "State", 
                          measure.vars = list(2:147,148:ncol(cotton_wide)), 
                          variable.name = "YEAR", 
                          value.name = c("ACRES", "YIELD"), na.rm = T)

sample_n(cotton_long, 5)
##             State YEAR  ACRES YIELD
## 1:       Kentucky  106   4300   573
## 2:         Kansas  130   2600   185
## 3:     New Mexico   72 159000   490
## 4: North Carolina   74 737000   296
## 5:       Missouri   93 295000   446

Grouping and summarising data

  • The function group_by() provides a handy tool for breaking data into groups based on a common factor.
  • The function summarise() often follows a grouping command. It will apply a summary function to a data set or a grouped data set.
  • these all rely on the idea of piping, or chaining together operations, using the pipe operator: %>%, rather than (1) a complicated nesting arrangement (see below) or (2) a large number of intermediate objects that you don’t actually need.

Example: group the cotton data set by year and output the mean yield for each year.

cotton_annual_mean <- group_by(nass_cotton, year) %>% summarise(mean_yield = mean(yield, na.rm = T))
with(cotton_annual_mean, plot(year, mean_yield, type = "l", col = "magenta", lwd = 1.5))

Example: Determine which states had the highest yields each year and plot highest yields over time

corn_highest_yielders <- nass_corn %>% mutate(HI = yield/acres) %>% 
  group_by(year) %>% slice(which.max(HI)) %>% 
  select(year, state, yield, HI) %>% mutate(state = as.character(state)) 

corn_highest_yielders %>% group_by(state) %>% tally(sort = T) 
## # A tibble: 11 × 2
##    state            n
##    <chr>        <int>
##  1 Montana         27
##  2 Arizona         25
##  3 Nevada          24
##  4 Oregon          20
##  5 Rhode Island    20
##  6 Vermont         12
##  7 Utah             8
##  8 Maine            5
##  9 Wyoming          3
## 10 Connecticut      1
## 11 Idaho            1
with(corn_highest_yielders, plot(year, yield, type = "l", col = "blue", lwd = 1.5, main = "harvest index upper limit"))

Other Useful Functions

Not every amazing function in dplyr and tidyr are covered in this tutorial. Here are a few other functions you may find helpful when preparing and analyzing data.

Package Function Purpose
dplyr pull() extracts a single column and returns a vector
dplyr n() counts observations
dplyr n_distinct() counts the number of unique observations
dplyr distinct() keeps only unique observations
dplyr starts_with() used within select(), for selecting columns that start with a specific string of characters
tidyr drop_na() remove rows with any missing data
tidyr fill() fills in missing values with whatever occurs before it (column-wise). Its behavior is dependent on row order
dplyr relocate() to move a column from one position to another (very helpful when working with a very wide data set)
tibble rownames_to_column will place rownames as the first column in a tibble or data.frame
tidyr replace_na() replace missing values with a user-defined value (only when is.na(a_scalar) evaluates to “TRUE”)

Other Resources

vignette("dplyr")
Statistical Programs
Statistical Programs
Statistical Consultation

Supporting Agricultural Research for Idaho

Sign up for our newsletter

Agricultural statistical content focused on R and SAS, along with info about upcoming workshops, lectures, and trainings relevant to these topics

Sent quarterly (thereabouts)