+ - 0:00:00
Notes for current slide
Notes for next slide

Tidy Data

What It Is and How To Do It

David I. Crabtree

2022-07-18

1 / 19

What is Tidy Data?

Consider these two dataframes. Which one looks easier to work with?

2 / 19

What is Tidy Data?

Consider these two dataframes. Which one looks easier to work with?

Frame 1
## # A tibble: 6 x 3
## country year pr
## <chr> <dbl> <dbl>
## 1 Afghanistan 1973 4
## 2 Afghanistan 1974 1
## 3 Afghanistan 1975 1
## 4 Afghanistan 1976 1
## 5 Afghanistan 1977 1
## 6 Afghanistan 1978 2
2 / 19

What is Tidy Data?

Consider these two dataframes. Which one looks easier to work with?

Frame 1
## # A tibble: 6 x 3
## country year pr
## <chr> <dbl> <dbl>
## 1 Afghanistan 1973 4
## 2 Afghanistan 1974 1
## 3 Afghanistan 1975 1
## 4 Afghanistan 1976 1
## 5 Afghanistan 1977 1
## 6 Afghanistan 1978 2
Frame 2
## # A tibble: 6 x 3
## `Year(s) Under Review` `1972` `1973`
## <chr> <chr> <chr>
## 1 <NA> PR PR
## 2 Afghanistan 4 7
## 3 Albania 7 7
## 4 Algeria 6 6
## 5 Andorra 4 4
## 6 Angola <NA> <NA>
2 / 19

What is Tidy Data?

Consider these two dataframes. Which one looks easier to work with?

Frame 1
## # A tibble: 6 x 3
## country year pr
## <chr> <dbl> <dbl>
## 1 Afghanistan 1973 4
## 2 Afghanistan 1974 1
## 3 Afghanistan 1975 1
## 4 Afghanistan 1976 1
## 5 Afghanistan 1977 1
## 6 Afghanistan 1978 2
Frame 2
## # A tibble: 6 x 3
## `Year(s) Under Review` `1972` `1973`
## <chr> <chr> <chr>
## 1 <NA> PR PR
## 2 Afghanistan 4 7
## 3 Albania 7 7
## 4 Algeria 6 6
## 5 Andorra 4 4
## 6 Angola <NA> <NA>


They might seem similar at first glance, but the frame on the left will be significantly easier to work with. In the following slides I'll explain why.

2 / 19

What is Tidy Data?

There are three rules that all Tidy Data must follow:
  1. Each variable gets its own column.
  2. Each observation gets its own row.
  3. Each value gets its own cell.

Every tidy dataset will follow these rules.

3 / 19

What is Tidy Data?

There are three rules that all Tidy Data must follow:
  1. Each variable gets its own column.
  2. Each observation gets its own row.
  3. Each value gets its own cell.

Every tidy dataset will follow these rules.

Consider an example

head(starwars[1:4])
## # A tibble: 6 x 4
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 C-3PO 167 75 <NA>
## 3 R2-D2 96 32 <NA>
## 4 Darth Vader 202 136 none
## 5 Leia Organa 150 49 brown
## 6 Owen Lars 178 120 brown, grey
3 / 19

What is Tidy Data: Which of these tables is tidy?

table1[1:4,]
## # A tibble: 4 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
table2[1:4,]
## # A tibble: 4 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
table3[1:4,]
## # A tibble: 4 x 3
## country year rate
## <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898

Example continues on next slide

4 / 19

What is Tidy Data: Which of these datasets is tidy?

table4a[1:3,] # cases
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b[1:3,] # population
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
5 / 19

What is Tidy Data?

Table 1 is the only table that is tidy.

6 / 19

What is Tidy Data?

Table 1 is the only table that is tidy.

table1[1:4,]
## # A tibble: 4 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
6 / 19

What is Tidy Data?

Table 1 is the only table that is tidy.

table1[1:4,]
## # A tibble: 4 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898

Note why this data is tidy

(1) Every variable has its own column.
(2) Every observation has its own row.
(3) Every value has its own cell.

6 / 19

Tidying Data: How to Do It

7 / 19

Tidying Data

So what do we do if data has broken one (or more!) of the rules of tidy data?

8 / 19

Tidying Data

So what do we do if data has broken one (or more!) of the rules of tidy data?

Well, we bring it back into compliance!

8 / 19

Tidying Data

So what do we do if data has broken one (or more!) of the rules of tidy data?

Well, we bring it back into compliance!

In order to tidy - or wrangle - the data, we will use the following functions from the dplyr package.

  • pivot_longer()
  • pivot_wider()
  • separate()
  • unite()
8 / 19

Tidying Data: pivot_longer()

When data has observations that should be columns, use pivot_longer().

9 / 19

Tidying Data: pivot_longer()

When data has observations that should be columns, use pivot_longer().

Usually, this means that the dataset has column names that are not names of variables but values of variables; this violates the first rule of tidy data.

Consider an example of untidy data from the previous slides

9 / 19

Tidying Data: pivot_longer()

When data has observations that should be columns, use pivot_longer().

Usually, this means that the dataset has column names that are not names of variables but values of variables; this violates the first rule of tidy data.

Consider an example of untidy data from the previous slides

## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
9 / 19

Tidying Data: pivot_longer()

When data has observations that should be columns, use pivot_longer().

Usually, this means that the dataset has column names that are not names of variables but values of variables; this violates the first rule of tidy data.

Consider an example of untidy data from the previous slides

## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

1999 and 2000 are not variable names; they're values of an unlisted variable - year. Thus, they should be in their own column. The pivot_longer() function achieves this. To do so, it requires three main arguments: which columns are to be moved (cols), what should be the new column that these ex-columns are stored in (names_to), and what should be the new column that the previously stored values go to (values_to)?

Try to solve this yourself before proceeding.

9 / 19

Tidying Data: pivot_longer()

10 / 19

Tidying Data: pivot_longer()

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
10 / 19

Tidying Data: pivot_longer()

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
pivot_longer(data = table4a,
cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766

Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.

The unit of analysis is the case count of a given country in a given year.

10 / 19

Tidying Data: pivot_wider()

When data has columns that should be observations, use pivot_wider().

11 / 19

Tidying Data: pivot_wider()

When data has columns that should be observations, use pivot_wider().

Usually, this will be because one observation is spread across multiple rows instead of just one; this violates the second rule of tidy data.

Consider another example we've looked at previously:

11 / 19

Tidying Data: pivot_wider()

When data has columns that should be observations, use pivot_wider().

Usually, this will be because one observation is spread across multiple rows instead of just one; this violates the second rule of tidy data.

Consider another example we've looked at previously:

## # A tibble: 4 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
11 / 19

Tidying Data: pivot_wider()

When data has columns that should be observations, use pivot_wider().

Usually, this will be because one observation is spread across multiple rows instead of just one; this violates the second rule of tidy data.

Consider another example we've looked at previously:

## # A tibble: 4 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360

Note that in this table, the unit of analysis is no longer "the case count of a given country in a given year," like it was in our tidy dataframe. Now, the unit of analysis is "a given type of count" in a given country in a given year. Wow - that's a loy of "given"s! And it is precisely why Afghanistan is listed four times here!

There is a row for Afghanistan's cases in 1999, a row for its population in 1999, a row for its cases in 2000, a row for its population in 2000, and so on. To tidy it, we need to move rows to be columns.

11 / 19

Tidying Data: pivot_wider()

12 / 19

Tidying Data: pivot_wider()

table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
12 / 19

Tidying Data: pivot_wider()

table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
pivot_wider(data = table2,
names_from = "type",
values_from = "count")
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.

The unit of analysis is the case count of a given country in a given year.

12 / 19

Tidying Data: separate()

When data has cells with more than one value, use separate().

13 / 19

Tidying Data: separate()

When data has cells with more than one value, use separate().

Usually, this will involve the creation of a new column to store the new values from the over-filled cell.


Consider table3 from earlier:

13 / 19

Tidying Data: separate()

When data has cells with more than one value, use separate().

Usually, this will involve the creation of a new column to store the new values from the over-filled cell.


Consider table3 from earlier:

## # A tibble: 4 x 3
## country year rate
## <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898

Note that the cells in the rate column actually contain two values: the case count first, then the population count. This violates the third rule of tidy data. To tidy it, each value should have its own cell.

Can you guess what the new columns will be called?

13 / 19

Tidying Data: separate()

14 / 19

Tidying Data: separate()

table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
14 / 19

Tidying Data: separate()

table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
separate(data = table3,
col = "rate",
into = c("cases", "population"),
sep = "/")
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.

The unit of analysis is the case count of a given country in a given year.

14 / 19

Tidying Data: unite()

When data has cells with less than one value, use unite().

15 / 19

Tidying Data: unite()

When data has cells with less than one value, use unite().

This is rare, but it occurs when a meaningful value is spread across multiple columns. This violates the third rule of tidy data.


Consider table5:

15 / 19

Tidying Data: unite()

When data has cells with less than one value, use unite().

This is rare, but it occurs when a meaningful value is spread across multiple columns. This violates the third rule of tidy data.


Consider table5:

## # A tibble: 4 x 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
15 / 19

Tidying Data: unite()

When data has cells with less than one value, use unite().

This is rare, but it occurs when a meaningful value is spread across multiple columns. This violates the third rule of tidy data.


Consider table5:

## # A tibble: 4 x 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898

Note that a meaningful value of year is obscurred by the columns century and year. Instead of having Afghanistan in 1999, we have Afghanistan in the 99th year of the 19th century (... which, come to think of it, would be 1899!). Unite() fixes this by combining the two columns

Can you guess what the new column will be called?

15 / 19

Tidying Data: unite()

16 / 19

Tidying Data: unite()

table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
16 / 19

Tidying Data: unite()

table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
unite(data = table5,
col = "year",
century, year)
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583

Note that the new year column is not formatted correctly. This is because it is now separated with a "_". We can fix this with the sep = argument.

16 / 19

Tidying Data: unite()

table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
unite(data = table5,
col = "year",
century, year,
sep = "")
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583

Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.

The unit of analysis is the case count of a given country in a given year.

17 / 19

Conclusion

Tidy data follows three core principles: every variable gets its own column, every observation gets its own row, every value gets its own cell.

To clean our data into tidy data, we can use the following functions from the dplyr package:

  • pivot_longer() when observations are columns
  • pivot_wider() when columns are observations
  • separate() when a cell has more than 1 value
  • unite() when a cell has less than 1 meaningful value


References

Wickam, Hadley and Garrett Grolemund, R for Data Science.

18 / 19

Questions?

Email me!

crabtreed@uchicago.edu

19 / 19

What is Tidy Data?

Consider these two dataframes. Which one looks easier to work with?

2 / 19
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow