## # A tibble: 6 x 3## country year pr## <chr> <dbl> <dbl>## 1 Afghanistan 1973 4## 2 Afghanistan 1974 1## 3 Afghanistan 1975 1## 4 Afghanistan 1976 1## 5 Afghanistan 1977 1## 6 Afghanistan 1978 2
## # A tibble: 6 x 3## country year pr## <chr> <dbl> <dbl>## 1 Afghanistan 1973 4## 2 Afghanistan 1974 1## 3 Afghanistan 1975 1## 4 Afghanistan 1976 1## 5 Afghanistan 1977 1## 6 Afghanistan 1978 2
## # A tibble: 6 x 3## `Year(s) Under Review` `1972` `1973`## <chr> <chr> <chr> ## 1 <NA> PR PR ## 2 Afghanistan 4 7 ## 3 Albania 7 7 ## 4 Algeria 6 6 ## 5 Andorra 4 4 ## 6 Angola <NA> <NA>
## # A tibble: 6 x 3## country year pr## <chr> <dbl> <dbl>## 1 Afghanistan 1973 4## 2 Afghanistan 1974 1## 3 Afghanistan 1975 1## 4 Afghanistan 1976 1## 5 Afghanistan 1977 1## 6 Afghanistan 1978 2
## # A tibble: 6 x 3## `Year(s) Under Review` `1972` `1973`## <chr> <chr> <chr> ## 1 <NA> PR PR ## 2 Afghanistan 4 7 ## 3 Albania 7 7 ## 4 Algeria 6 6 ## 5 Andorra 4 4 ## 6 Angola <NA> <NA>
They might seem similar at first glance, but the frame on the left will be significantly easier to work with. In the following slides I'll explain why.
Every tidy dataset will follow these rules.
Every tidy dataset will follow these rules.
head(starwars[1:4])
## # A tibble: 6 x 4## name height mass hair_color ## <chr> <int> <dbl> <chr> ## 1 Luke Skywalker 172 77 blond ## 2 C-3PO 167 75 <NA> ## 3 R2-D2 96 32 <NA> ## 4 Darth Vader 202 136 none ## 5 Leia Organa 150 49 brown ## 6 Owen Lars 178 120 brown, grey
table1[1:4,]
## # A tibble: 4 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898
table2[1:4,]
## # A tibble: 4 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360
table3[1:4,]
## # A tibble: 4 x 3## country year rate ## <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362## 4 Brazil 2000 80488/174504898
table4a[1:3,] # cases
## # A tibble: 3 x 3## country `1999` `2000`## <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
table4b[1:3,] # population
## # A tibble: 3 x 3## country `1999` `2000`## <chr> <int> <int>## 1 Afghanistan 19987071 20595360## 2 Brazil 172006362 174504898## 3 China 1272915272 1280428583
table1[1:4,]
## # A tibble: 4 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898
table1[1:4,]
## # A tibble: 4 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898
(1) Every variable has its own column.
(2) Every observation has its own row.
(3) Every value has its own cell.
In order to tidy - or wrangle - the data, we will use the following functions from the dplyr
package.
pivot_longer()
pivot_wider()
separate()
unite()
pivot_longer()
pivot_longer()
.pivot_longer()
pivot_longer()
.Usually, this means that the dataset has column names that are not names of variables but values of variables; this violates the first rule of tidy data.
Consider an example of untidy data from the previous slides
pivot_longer()
pivot_longer()
.Usually, this means that the dataset has column names that are not names of variables but values of variables; this violates the first rule of tidy data.
Consider an example of untidy data from the previous slides
## # A tibble: 3 x 3## country `1999` `2000`## <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
pivot_longer()
pivot_longer()
.Usually, this means that the dataset has column names that are not names of variables but values of variables; this violates the first rule of tidy data.
Consider an example of untidy data from the previous slides
## # A tibble: 3 x 3## country `1999` `2000`## <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
1999 and 2000 are not variable names; they're values of an unlisted variable - year. Thus, they should be in their own column. The pivot_longer()
function achieves this. To do so, it requires three main arguments: which columns are to be moved (cols
), what should be the new column that these ex-columns are stored in (names_to
), and what should be the new column that the previously stored values go to (values_to
)?
pivot_longer()
pivot_longer()
table4a
## # A tibble: 3 x 3## country `1999` `2000`## * <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
pivot_longer()
table4a
## # A tibble: 3 x 3## country `1999` `2000`## * <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
pivot_longer(data = table4a, cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3## country year cases## <chr> <chr> <int>## 1 Afghanistan 1999 745## 2 Afghanistan 2000 2666## 3 Brazil 1999 37737## 4 Brazil 2000 80488## 5 China 1999 212258## 6 China 2000 213766
Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.
The unit of analysis is the case count of a given country in a given year.
pivot_wider()
pivot_wider()
.pivot_wider()
pivot_wider()
.Usually, this will be because one observation is spread across multiple rows instead of just one; this violates the second rule of tidy data.
Consider another example we've looked at previously:
pivot_wider()
pivot_wider()
.Usually, this will be because one observation is spread across multiple rows instead of just one; this violates the second rule of tidy data.
Consider another example we've looked at previously:
## # A tibble: 4 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360
pivot_wider()
pivot_wider()
.Usually, this will be because one observation is spread across multiple rows instead of just one; this violates the second rule of tidy data.
Consider another example we've looked at previously:
## # A tibble: 4 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360
Note that in this table, the unit of analysis is no longer "the case count of a given country in a given year," like it was in our tidy dataframe. Now, the unit of analysis is "a given type of count" in a given country in a given year. Wow - that's a loy of "given"s! And it is precisely why Afghanistan is listed four times here!
There is a row for Afghanistan's cases in 1999, a row for its population in 1999, a row for its cases in 2000, a row for its population in 2000, and so on. To tidy it, we need to move rows to be columns.
pivot_wider()
pivot_wider()
table2
## # A tibble: 12 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583
pivot_wider()
table2
## # A tibble: 12 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583
pivot_wider(data = table2, names_from = "type", values_from = "count")
## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583
Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.
The unit of analysis is the case count of a given country in a given year.
separate()
separate()
.separate()
separate()
.Usually, this will involve the creation of a new column to store the new values from the over-filled cell.
Consider table3
from earlier:
separate()
separate()
.Usually, this will involve the creation of a new column to store the new values from the over-filled cell.
Consider table3
from earlier:
## # A tibble: 4 x 3## country year rate ## <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362## 4 Brazil 2000 80488/174504898
Note that the cells in the rate column actually contain two values: the case count first, then the population count. This violates the third rule of tidy data. To tidy it, each value should have its own cell.
Can you guess what the new columns will be called?
separate()
separate()
table3
## # A tibble: 6 x 3## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583
separate()
table3
## # A tibble: 6 x 3## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583
separate(data = table3, col = "rate", into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4## country year cases population## <chr> <int> <chr> <chr> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583
Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.
The unit of analysis is the case count of a given country in a given year.
unite()
unite()
.unite()
unite()
.This is rare, but it occurs when a meaningful value is spread across multiple columns. This violates the third rule of tidy data.
Consider table5
:
unite()
unite()
.This is rare, but it occurs when a meaningful value is spread across multiple columns. This violates the third rule of tidy data.
Consider table5
:
## # A tibble: 4 x 4## country century year rate ## <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362## 4 Brazil 20 00 80488/174504898
unite()
unite()
.This is rare, but it occurs when a meaningful value is spread across multiple columns. This violates the third rule of tidy data.
Consider table5
:
## # A tibble: 4 x 4## country century year rate ## <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362## 4 Brazil 20 00 80488/174504898
Note that a meaningful value of year is obscurred by the columns century
and year
. Instead of having Afghanistan in 1999, we have Afghanistan in the 99th year of the 19th century (... which, come to think of it, would be 1899!). Unite()
fixes this by combining the two columns
Can you guess what the new column will be called?
unite()
unite()
table5
## # A tibble: 6 x 4## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583
unite()
table5
## # A tibble: 6 x 4## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583
unite(data = table5, col = "year", century, year)
## # A tibble: 6 x 3## country year rate ## <chr> <chr> <chr> ## 1 Afghanistan 19_99 745/19987071 ## 2 Afghanistan 20_00 2666/20595360 ## 3 Brazil 19_99 37737/172006362 ## 4 Brazil 20_00 80488/174504898 ## 5 China 19_99 212258/1272915272## 6 China 20_00 213766/1280428583
Note that the new year column is not formatted correctly. This is because it is now separated with a "_". We can fix this with the sep =
argument.
unite()
table5
## # A tibble: 6 x 4## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583
unite(data = table5, col = "year", century, year, sep = "")
## # A tibble: 6 x 3## country year rate ## <chr> <chr> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583
Note that the resulting dataframe is now tidy: Each variable in its own column, each observation in its own row, each value in its own cell.
The unit of analysis is the case count of a given country in a given year.
Tidy data follows three core principles: every variable gets its own column, every observation gets its own row, every value gets its own cell.
To clean our data into tidy data, we can use the following functions from the dplyr
package:
pivot_longer()
when observations are columnspivot_wider()
when columns are observationsseparate()
when a cell has more than 1 valueunite()
when a cell has less than 1 meaningful valueWickam, Hadley and Garrett Grolemund, R for Data Science.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |