generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 5
/
01_data-preparation.Rmd
287 lines (181 loc) · 8.4 KB
/
01_data-preparation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
# Data Preparation
```{r 01-01, include=FALSE}
data(Salaries, package="carData")
```
**Learning objectives:**
- How to import data from external sources, such as text files, excel spreadsheets, statistical packages, and databases
- How to clean data so that its in a useful format prior to visualizing it
## Importing data
- We'll use the [Salaries](https://rkabacoff.github.io/datavis/Data.html#Salaries) dataset to illustrate that R can import data from many different sources.
- It is provided in other formats (i.e. `.csv`, `.txt`, `.xsls`) in [Appendix A](https://rkabacoff.github.io/datavis/Data.html) of the book so that you can practice importing data.
- The dataset describes the 9 month academic salaries of 397 college professors at a single institution in 2008-2009.
## Text files
The [`readr`](https://www.rdocumentation.org/packages/readr/versions/0.1.1/topics/read_delim) package provides functions for importing delimited text files into R data frames.
```{r 01-02, eval=FALSE}
library(readr)
# import data from a comma delimited file
Salaries <- read_csv("data/salaries.csv")
# import data from a tab delimited file
Salaries <- read_tsv("data/salaries.txt")
```
## Excel spreadsheets
The [`readxl`](https://www.rdocumentation.org/packages/readxl/versions/1.4.2) package can import data from Excel workbooks. Both xls and xlsx formats are supported.
Since workbooks can have more than one worksheet, you can specify the one you want with the `sheet` option. The default is `sheet=1`.
```{r 01-03, eval=FALSE}
library(readxl)
# import data from an Excel workbook
Salaries <- read_excel("data/salaries.xlsx", sheet=1)
```
## Stastistical packages
The [`haven`](https://www.rdocumentation.org/packages/haven/versions/2.5.2) package provides functions for importing data from a variety of statistical packages.
```{r 01-04, eval=FALSE}
library(haven)
# import data from Stata
Salaries <- read_dta("data/salaries.dta")
# import data from SPSS
Salaries <- read_sav("data/salaries.sav")
# import data from SAS
Salaries <- read_sas("data/salaries.sas7bdat")
```
## Databases
Depending on the database containing the data, the following packages can help:
- [`RODBC`](https://www.rdocumentation.org/packages/RODBC/versions/0.8-3/topics/RODBC)
- [`RMySQL`](https://www.rdocumentation.org/packages/RMySQL/versions/0.10.25)
- [`ROracle`](https://www.rdocumentation.org/packages/ROracle/versions/1.3-1.1)
- [`RPostgreSQL`](https://www.rdocumentation.org/packages/RPostgreSQL/versions/0.7-5)
- [`RSQLite`](https://www.rdocumentation.org/packages/RSQLite/versions/2.3.0)
- [`RMongo`](https://www.rdocumentation.org/packages/RMongo/versions/0.0.25)
In the newest versions of RStudio, you can use the *Connections pane* to quickly access the data stored in database management systems.
## Cleaning data
- Cleaning data can be the most time consuming part of any data analysis.
- Two of the quickest and easiest to learn approaches to clean data are using the [`dplyr`](https://dplyr.tidyverse.org/) and [`tidyr`](https://tidyr.tidyverse.org/) package.
```{r, 01-05, echo=FALSE}
knitr::include_graphics("images/dplyr.png")
```
## Selecting variables
The `dplyr::select` function allows you to limit your dataset to specified variables (columns).
```{r 01-06, eval=FALSE}
library(dplyr)
data("starwars")
# keep the variables name, height, and gender
newdata <- select(starwars, name, height, gender)
# keep the variables name and all variables
# between mass and species inclusive
newdata <- select(starwars, name, mass:species)
# keep all variables except birth_year and gender
newdata <- select(starwars, -birth_year, -gender)
```
## Selecting observations
The `dplyr::filter` function allows you to limit your dataset to observations (rows) meeting a specific criteria. Multiple criteria can be combined with the `&` (AND) and `|` (OR) symbols.
```{r 01-07, message=FALSE, warning=FALSE}
library(dplyr)
data("starwars")
# select females
newdata <- filter(starwars,
gender == "feminine")
# select females that are from Alderaan
newdata <- filter(starwars,
gender == "feminine" &
homeworld == "Alderaan")
```
## Creating/Recoding variables
The `dplyr::mutate` function allows you to create new variables or transform existing ones.
```{r 01-08}
library(dplyr)
data("starwars")
# convert height in centimeters to inches,
# and mass in kilograms to pounds
newdata <- mutate(starwars,
height = height * 0.394,
mass = mass * 2.205)
```
##
The `ifelse` function (part of base R) can be used for recoding data. The format is `ifelse(test, return if TRUE, return if FALSE)`.
```{r 01-09}
# if height is greater than 180
# then heightcat = "tall",
# otherwise heightcat = "short"
newdata <- mutate(starwars,
heightcat = ifelse(height > 180,
"tall",
"short"))
# set heights greater than 200 or less than 75 to missing
newdata <- mutate(starwars,
height = ifelse(height < 75 | height > 200,
NA, height))
```
## Summarizing data
The `dplyr::summarize` function can be used to reduce multiple values down to a single value (such as a mean).
```{r 01-10}
library(dplyr)
# calculate mean height and mass
newdata <- summarize(starwars,
mean_ht = mean(height, na.rm=TRUE),
mean_mass = mean(mass, na.rm=TRUE))
newdata
```
##
The `dplyr::summarize` is often used in conjunction with the `group_by` function, to calculate statistics by group.
```{r 01-12}
# calculate mean height and weight by gender
newdata <- group_by(starwars, gender)
newdata <- summarize(newdata,
mean_ht = mean(height, na.rm=TRUE),
mean_wt = mean(mass, na.rm=TRUE))
newdata
```
## Using pipes
- Tidyverse packages like `dplyr` and `tidyr` allow you to write code using the pipe ` %>% ` operator.
- The ` %>% ` operator passes the result on the left to the first parameter of the function on the right.
```{r 01-13}
library(dplyr)
# calculate the mean height for women by species
newdata <- starwars %>%
filter(gender == "feminine") %>%
group_by(species) %>%
summarize(mean_ht = mean(height, na.rm = TRUE))
newdata
```
## Reshaping data
**NOTE:** tidyr 1.0.0 introduced `pivot_longer()` and `pivot_wider()`, replacing the older `spread()` and `gather()` functions.
```{r 01-14}
library(tidyr)
table1 #tidy dataset
```
## `pivot_wider()`
```{r 01-15}
glimpse(table2)
table2 %>%
pivot_wider(names_from = type, values_from = count)
```
## `pivot_longer()`
```{r 01-16}
glimpse(table4a)
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
```
## Missing Data
There are three basic approaches to dealing with missing data: *feature selection*, *listwise deletion*, and *imputation*.
- In **feature selection**, you delete variables (columns) that contain too many missing values.
- **Listwise deletion** involves deleting observations (rows) that contain missing values on any of the variables of interest.
- **Imputation** involves replacing missing values with “reasonable” guesses about what the values would have been if they had not been missing. There are several approaches, as detailed in such packages as `VIM`, `mice`, `Amelia` and `missForest`.
## Resources
- [Dplyr Cheatsheet](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf)
- [Tidyr Cheatsheet](https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf)
## Meeting Videos {-}
### Cohort 1 {-}
`r knitr::include_url("https://www.youtube.com/embed/PQ62v8j8rRM")`
<details>
<summary> Meeting chat log </summary>
```
00:16:13 Lydia Gibson: https://r4ds.github.io/bookclub-datavisr/
00:16:22 Lydia Gibson: https://rkabacoff.github.io/datavis/
00:16:32 Lydia Gibson: https://github.com/r4ds/bookclub-datavisr
00:20:16 Lydia Gibson: https://docs.google.com/spreadsheets/d/1yrXUdZ95upU3kISocqinvaDh-U2JK3mnjhtj4Fr27H8/edit#gid=0
00:20:20 Ryan Spjut: stepping away be back in a few minutes
00:38:26 Lydia Gibson: https://www.rdocumentation.org/packages/readr/versions/0.1.1/topics/read_delim
00:38:38 Lydia Gibson: https://www.rdocumentation.org/packages/readxl/versions/1.4.2
00:57:02 Lydia Gibson: https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf
00:57:13 Lydia Gibson: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf
```
</details>