Tidy Data
What It Is and How To Do It
David I. Crabtree
2022-07-18
1 / 19

What is Tidy Data?Consider these two dataframes. Which one looks easier to work with?2 / 19

What is Tidy Data?Consider these two dataframes. Which one looks easier to work with?Frame 1
## # A tibble: 6 x 3
##   country      year    pr
##   <chr>       <dbl> <dbl>
## 1 Afghanistan  1973     4
## 2 Afghanistan  1974     1
## 3 Afghanistan  1975     1
## 4 Afghanistan  1976     1
## 5 Afghanistan  1977     1
## 6 Afghanistan  1978     2
2 / 19

What is Tidy Data?Consider these two dataframes. Which one looks easier to work with?Frame 1
## # A tibble: 6 x 3
##   country      year    pr
##   <chr>       <dbl> <dbl>
## 1 Afghanistan  1973     4
## 2 Afghanistan  1974     1
## 3 Afghanistan  1975     1
## 4 Afghanistan  1976     1
## 5 Afghanistan  1977     1
## 6 Afghanistan  1978     2
Frame 2
## # A tibble: 6 x 3
##   `Year(s) Under Review` `1972` `1973`
##   <chr>                  <chr>  <chr> 
## 1 <NA>                   PR     PR    
## 2 Afghanistan            4      7     
## 3 Albania                7      7     
## 4 Algeria                6      6     
## 5 Andorra                4      4     
## 6 Angola                 <NA>   <NA>
2 / 19

What is Tidy Data?

Consider these two dataframes. Which one looks easier to work with?

Frame 1

## # A tibble: 6 x 3
##   country      year    pr
##   <chr>       <dbl> <dbl>
## 1 Afghanistan  1973     4
## 2 Afghanistan  1974     1
## 3 Afghanistan  1975     1
## 4 Afghanistan  1976     1
## 5 Afghanistan  1977     1
## 6 Afghanistan  1978     2

Frame 2

## # A tibble: 6 x 3
##   `Year(s) Under Review` `1972` `1973`
##   <chr>                  <chr>  <chr> 
## 1 <NA>                   PR     PR    
## 2 Afghanistan            4      7     
## 3 Albania                7      7     
## 4 Algeria                6      6     
## 5 Andorra                4      4     
## 6 Angola                 <NA>   <NA>

They might seem similar at first glance, but the frame on the left will be significantly easier to work with. In the following slides I'll explain why.

2 / 19

What is Tidy Data?

There are three rules that all Tidy Data must follow:

Each variable gets its own column.
Each observation gets its own row.
Each value gets its own cell.

Every tidy dataset will follow these rules.

3 / 19

What is Tidy Data?

There are three rules that all Tidy Data must follow:

Each variable gets its own column.
Each observation gets its own row.
Each value gets its own cell.

Every tidy dataset will follow these rules.

Consider an example

head(starwars[1:4])

## # A tibble: 6 x 4
##   name           height  mass hair_color 
##   <chr>           <int> <dbl> <chr>      
## 1 Luke Skywalker    172    77 blond      
## 2 C-3PO             167    75 <NA>       
## 3 R2-D2              96    32 <NA>       
## 4 Darth Vader       202   136 none       
## 5 Leia Organa       150    49 brown      
## 6 Owen Lars         178   120 brown, grey

3 / 19

What is Tidy Data: Which of these tables is tidy?table1[1:4,]

## # A tibble: 4 x 4
##   country      year cases population
##   <chr>       <int> <int>      <int>
## 1 Afghanistan  1999   745   19987071
## 2 Afghanistan  2000  2666   20595360
## 3 Brazil       1999 37737  172006362
## 4 Brazil       2000 80488  174504898
table2[1:4,]

## # A tibble: 4 x 4
##   country      year type          count
##   <chr>       <int> <chr>         <int>
## 1 Afghanistan  1999 cases           745
## 2 Afghanistan  1999 population 19987071
## 3 Afghanistan  2000 cases          2666
## 4 Afghanistan  2000 population 20595360
table3[1:4,]

## # A tibble: 4 x 3
##   country      year rate           
##   <chr>       <int> <chr>          
## 1 Afghanistan  1999 745/19987071   
## 2 Afghanistan  2000 2666/20595360  
## 3 Brazil       1999 37737/172006362
## 4 Brazil       2000 80488/174504898
Example continues on next slide
4 / 19

What is Tidy Data: Which of these datasets is tidy?table4a[1:3,] # cases

## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b[1:3,]  # population

## # A tibble: 3 x 3
##   country         `1999`     `2000`
##   <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
5 / 19

What is Tidy Data?Table 1 is the only table that is tidy.6 / 19

What is Tidy Data?

Table 1 is the only table that is tidy.

table1[1:4,]

## # A tibble: 4 x 4
##   country      year cases population
##   <chr>       <int> <int>      <int>
## 1 Afghanistan  1999   745   19987071
## 2 Afghanistan  2000  2666   20595360
## 3 Brazil       1999 37737  172006362
## 4 Brazil       2000 80488  174504898

6 / 19

What is Tidy Data?

Table 1 is the only table that is tidy.

table1[1:4,]

## # A tibble: 4 x 4
##   country      year cases population
##   <chr>       <int> <int>      <int>
## 1 Afghanistan  1999   745   19987071
## 2 Afghanistan  2000  2666   20595360
## 3 Brazil       1999 37737  172006362
## 4 Brazil       2000 80488  174504898

Note why this data is tidy

(1) Every variable has its own column.
(2) Every observation has its own row.
(3) Every value has its own cell.

6 / 19

Tidying Data: How to Do It7 / 19

Tidying DataSo what do we do if data has broken one (or more!) of the rules of tidy data?8 / 19

Tidying DataSo what do we do if data has broken one (or more!) of the rules of tidy data?Well, we bring it back into compliance!8 / 19

Tidying Data

So what do we do if data has broken one (or more!) of the rules of tidy data?

Well, we bring it back into compliance!

In order to tidy - or wrangle - the data, we will use the following functions from the dplyr package.

pivot_longer()
pivot_wider()
separate()
unite()

8 / 19

Tidying Data: pivot_longer()When data has observations that should be columns, use pivot_longer().9 / 19

Tidying Data: `pivot_longer()`

When data has observations that should be columns, use `pivot_longer()`.

Usually, this means that the dataset has column names that are not names of variables but values of variables; this violates the first rule of tidy data.

Consider an example of untidy data from the previous slides

9 / 19

Tidying Data: `pivot_longer()`

When data has observations that should be columns, use `pivot_longer()`.

## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

9 / 19

Tidying Data: `pivot_longer()`

When data has observations that should be columns, use `pivot_longer()`.

## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

1999 and 2000 are not variable names; they're values of an unlisted variable - year. Thus, they should be in their own column. The pivot_longer() function achieves this. To do so, it requires three main arguments: which columns are to be moved (cols), what should be the new column that these ex-columns are stored in (names_to), and what should be the new column that the previously stored values go to (values_to)?

Try to solve this yourself before proceeding.

9 / 19

Tidying Data: pivot_longer()10 / 19

Tidying Data: pivot_longer()table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
10 / 19

Tidying Data: `pivot_longer()`

table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

pivot_longer(data = table4a,
             cols = c(`1999`, `2000`),
             names_to = "year",
             values_to = "cases")

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.

The unit of analysis is the case count of a given country in a given year.

10 / 19

Tidying Data: pivot_wider()When data has columns that should be observations, use pivot_wider().11 / 19

Tidying Data: `pivot_wider()`

When data has columns that should be observations, use `pivot_wider()`.

Usually, this will be because one observation is spread across multiple rows instead of just one; this violates the second rule of tidy data.

Consider another example we've looked at previously:

11 / 19

Tidying Data: `pivot_wider()`

When data has columns that should be observations, use `pivot_wider()`.

Usually, this will be because one observation is spread across multiple rows instead of just one; this violates the second rule of tidy data.

Consider another example we've looked at previously:

## # A tibble: 4 x 4
##   country      year type          count
##   <chr>       <int> <chr>         <int>
## 1 Afghanistan  1999 cases           745
## 2 Afghanistan  1999 population 19987071
## 3 Afghanistan  2000 cases          2666
## 4 Afghanistan  2000 population 20595360

11 / 19

Tidying Data: `pivot_wider()`

When data has columns that should be observations, use `pivot_wider()`.

Usually, this will be because one observation is spread across multiple rows instead of just one; this violates the second rule of tidy data.

Consider another example we've looked at previously:

## # A tibble: 4 x 4
##   country      year type          count
##   <chr>       <int> <chr>         <int>
## 1 Afghanistan  1999 cases           745
## 2 Afghanistan  1999 population 19987071
## 3 Afghanistan  2000 cases          2666
## 4 Afghanistan  2000 population 20595360

Note that in this table, the unit of analysis is no longer "the case count of a given country in a given year," like it was in our tidy dataframe. Now, the unit of analysis is "a given type of count" in a given country in a given year. Wow - that's a loy of "given"s! And it is precisely why Afghanistan is listed four times here!

There is a row for Afghanistan's cases in 1999, a row for its population in 1999, a row for its cases in 2000, a row for its population in 2000, and so on. To tidy it, we need to move rows to be columns.

11 / 19

Tidying Data: pivot_wider()12 / 19

Tidying Data: pivot_wider()table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
12 / 19

Tidying Data: `pivot_wider()`

table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

pivot_wider(data = table2,
            names_from = "type",
            values_from = "count")

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.

The unit of analysis is the case count of a given country in a given year.

12 / 19

Tidying Data: separate()When data has cells with more than one value, use separate().13 / 19

Tidying Data: `separate()`

When data has cells with more than one value, use `separate()`.

Usually, this will involve the creation of a new column to store the new values from the over-filled cell.

Consider table3 from earlier:

13 / 19

Tidying Data: `separate()`

When data has cells with more than one value, use `separate()`.

Usually, this will involve the creation of a new column to store the new values from the over-filled cell.

Consider table3 from earlier:

## # A tibble: 4 x 3
##   country      year rate           
##   <chr>       <int> <chr>          
## 1 Afghanistan  1999 745/19987071   
## 2 Afghanistan  2000 2666/20595360  
## 3 Brazil       1999 37737/172006362
## 4 Brazil       2000 80488/174504898

Note that the cells in the rate column actually contain two values: the case count first, then the population count. This violates the third rule of tidy data. To tidy it, each value should have its own cell.

Can you guess what the new columns will be called?

13 / 19

Tidying Data: separate()14 / 19

Tidying Data: separate()table3

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
14 / 19

Tidying Data: `separate()`

table3

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

separate(data = table3,
         col = "rate",
         into = c("cases", "population"),
         sep = "/")

## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.

The unit of analysis is the case count of a given country in a given year.

14 / 19

Tidying Data: unite()When data has cells with less than one value, use unite().15 / 19

Tidying Data: `unite()`

When data has cells with less than one value, use `unite()`.

This is rare, but it occurs when a meaningful value is spread across multiple columns. This violates the third rule of tidy data.

Consider table5:

15 / 19

Tidying Data: `unite()`

When data has cells with less than one value, use `unite()`.

This is rare, but it occurs when a meaningful value is spread across multiple columns. This violates the third rule of tidy data.

Consider table5:

## # A tibble: 4 x 4
##   country     century year  rate           
##   <chr>       <chr>   <chr> <chr>          
## 1 Afghanistan 19      99    745/19987071   
## 2 Afghanistan 20      00    2666/20595360  
## 3 Brazil      19      99    37737/172006362
## 4 Brazil      20      00    80488/174504898

15 / 19

Tidying Data: `unite()`

When data has cells with less than one value, use `unite()`.

This is rare, but it occurs when a meaningful value is spread across multiple columns. This violates the third rule of tidy data.

Consider table5:

## # A tibble: 4 x 4
##   country     century year  rate           
##   <chr>       <chr>   <chr> <chr>          
## 1 Afghanistan 19      99    745/19987071   
## 2 Afghanistan 20      00    2666/20595360  
## 3 Brazil      19      99    37737/172006362
## 4 Brazil      20      00    80488/174504898

Note that a meaningful value of year is obscurred by the columns century and year. Instead of having Afghanistan in 1999, we have Afghanistan in the 99th year of the 19th century (... which, come to think of it, would be 1899!). Unite() fixes this by combining the two columns

Can you guess what the new column will be called?

15 / 19

Tidying Data: unite()16 / 19

Tidying Data: unite()table5

## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
16 / 19

Tidying Data: `unite()`

table5

## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

unite(data = table5,
      col = "year",
      century, year)

## # A tibble: 6 x 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583

Note that the new year column is not formatted correctly. This is because it is now separated with a "_". We can fix this with the sep = argument.

16 / 19

Tidying Data: `unite()`

table5

## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

unite(data = table5,
      col = "year",
      century, year,
      sep = "")

## # A tibble: 6 x 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.

The unit of analysis is the case count of a given country in a given year.

17 / 19

Conclusion

Tidy data follows three core principles: every variable gets its own column, every observation gets its own row, every value gets its own cell.

To clean our data into tidy data, we can use the following functions from the dplyr package:

pivot_longer() when observations are columns
pivot_wider() when columns are observations
separate() when a cell has more than 1 value
unite() when a cell has less than 1 meaningful value

References

Wickam, Hadley and Garrett Grolemund, R for Data Science.

18 / 19

Questions?Email me!crabtreed@uchicago.edu19 / 19

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help