generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy path12-reshaping.Rmd
426 lines (310 loc) · 23.6 KB
/
12-reshaping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
# Reshaping and Combining Data Sets
## Ordering and Sorting Data
Sorting a data set, in ascending order, say, is a common task. You might need to do it because
1. ordering and ranking is commonly done in *nonparametric statistics*,
2. you want to inspect the most "extreme" observations in a data set,
3. it's a pre-processing step before generating visualizations.
In R, it all starts with `vector`s. There are two common functions you should know: `sort()` and `order()`. `sort()` returns the sorted *data*, while `order()` returns the *order indexes*.
```{r, collapse = TRUE}
sillyData <- rnorm(5)
print(sillyData)
sort(sillyData)
order(sillyData)
```
`order()` is useful if you're sorting a data frame by a particularly column. Below, we inspect the top 5 most expensive cars in an example data set [@sas_cars]. Notice that we need to clean up the `MSRP` (a `character` vector) a little first. We use the function `gsub()` to find patterns in the text, and replace them with the empty string.
```{r}
carData <- read.csv("data/cars.csv")
noDollarSignMSRP <- gsub("$", "", carData$MSRP, fixed = TRUE)
carData$cleanMSRP <- as.numeric(gsub(",", "", noDollarSignMSRP,
fixed = TRUE))
rowIndices <- order(carData$cleanMSRP, decreasing = TRUE)[1:5]
carData[rowIndices,c("Make", "Model", "MSRP", "cleanMSRP")]
```
In Python, Numpy has [`np.argsort()`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) and [`np.sort()`](https://numpy.org/doc/stable/reference/generated/numpy.sort.html).
```{python, collapse = TRUE}
import numpy as np
silly_data = np.random.normal(size=5)
print(silly_data)
print( np.sort(silly_data) )
np.argsort(silly_data)
```
For Pandas\index{Pandas}' `DataFrame`s, most of the functions I find useful are methods attached to the `DataFrame` class. That means that, as long as something is inside a `DataFrame`, you can use dot notation.
```{python, collapse = TRUE}
import pandas as pd
car_data = pd.read_csv("data/cars.csv")
car_data['no_dlr_msrp'] = car_data['MSRP'].str.replace("$", "",
regex = False)
no_commas = car_data['no_dlr_msrp'].str.replace(",","")
car_data['clean_MSRP'] = no_commas.astype(float)
car_data = car_data.sort_values(by='clean_MSRP', ascending = False)
car_data[["Make", "Model", "MSRP", "clean_MSRP"]].head(5)
```
Pandas' `DataFrame`s and `Series` have a [`.replace()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) method. We use this to remove dollar signs and commas from the MSRP column. Note that we had to access the `.str` attribute of the `Series` column before we used it. After the string was processed, we converted it to a `Series` of `float`s with the `astype()` method.
Finally, sorting the overall data frame could have been done with the same approach as the code we used in R (i.e. raw subsetting by row indexes), but there is a built-in method called [`sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) that will do it for us.
## Stacking Data Sets and Placing Them Shoulder to Shoulder
Stacking data sets on top of each other is a common task. You might need to do it if
1. you need to add a new row (or many rows) to a data frame,
2. you need to recombine data sets (e.g. recombine a train/test split), or
3. you're creating a matrix in a step-by-step way.
In R, this can be done with `rbind()` (short for "row bind"). Consider the following example that makes use of GIS data queried from [@albemarle_county_gis_web] and cleaned with code from [@clay_ford].
```{r, collapse = TRUE, R.options = list(width = 65)}
realEstate <- read.csv("data/albemarle_real_estate.csv")
train <- realEstate[-1,]
test <- realEstate[1,]
str(rbind(test, train), strict.width = "cut")
sum(rbind(test, train) != realEstate)
```
The above example was with `data.frame`s. This example of `rbind()` is with `matrix` objects.
```{r, collapse=TRUE}
rbind(matrix(1,nrow = 2, ncol = 3),
matrix(2,nrow = 2, ncol = 3))
```
In Python, you can stack data frames with [`pd.concat()`](https://www.google.com/search?client=safari&rls=en&q=pandas+concat&ie=UTF-8&oe=UTF-8). It has a lot of options, so feel free to peruse them. You can also replace the call to `pd.concat()` below with [`test.append(train)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html). Consider the example below that uses the Albemarle County real estate data [@albemarle_county_gis_web] [@clay_ford].
```{python, collapse = TRUE}
import pandas as pd
real_estate = pd.read_csv("data/albemarle_real_estate.csv")
train = real_estate.iloc[1:,]
test = real_estate.iloc[[0],] # need the extra brackets!
stacked = pd.concat([test,train], axis=0)
stacked.iloc[:3,:3]
(stacked != real_estate).sum().sum()
```
Take note of the extra square brackets when we create `test`. If you use `real_estate.iloc[0,]` instead, it will return a `Series` with all the elements coerced to the same type, and this won't `pd.concat()` properly with the rest of the data!
## Merging or Joining Data Sets
If you have two different data sets that provide information about the same experimental units, you can put the two data sets together using a **`merge`** (aka **`join`**) operation.\index{merge!merging in R}\index{join|see{merge}} In R, you can use the [`merge()` function](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge). In Python, you can use the [`.merge()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html#pandas-dataframe-merge).
Merging (or joining) data sets is not the same as placing them shoulder to shoulder. Placing data sets shoulder to shoulder will not reorder the rows of your data and the operation requires that both input data sets have the same number of rows to start off with. On the other hand, merging data takes care to match rows together in an intelligent way, and it can handle situations of missing or duplicate matches. In both cases, the resulting data set is wider, but with merging, the output might end contain either more or fewer rows.
Here's a clarifying example. Suppose you have to sets of supposedly anonymized data about individual accounts on some online platforms.
```{r, collapse = TRUE}
# in R
baby1 <- read.csv("data/baby1.csv", stringsAsFactors = FALSE)
baby2 <- read.csv("data/baby2.csv", stringsAsFactors = FALSE)
head(baby1)
head(baby2)
```
```{python, echo = FALSE}
import pandas as pd
baby1 = pd.read_csv("data/baby1.csv")
baby2 = pd.read_csv("data/baby2.csv")
```
The first thing you need to ask yourself is *"which column is the unique identifier that is shared between these two data sets?"* In our case, they both have an "identification number" column. However, these two data sets are coming from different online platforms, and these two places use different schemes to number their users.
In this case, it is better to merge on the email addresses. Users might be using different email addresses on these two platforms, but there's a stronger guarantee that matched email addresses means that you're matching the right accounts. The columns are named differently in each data set, so we must specify them by name.
```{r, collapse = TRUE}
# in R
merge(baby1, baby2, by.x = "email_address", by.y = "email")
```
In Python, `merge()` is a method attached to each `DataFrame` instance.
```{python, collapse = TRUE}
# in Python
baby1.merge(baby2, left_on = "email_address", right_on = "email")
```
The email addresses `[email protected]` and `[email protected]` exist in both data sets, so each of these email addresses will end up in the result data frame. The rows in the result data set are wider and have more attributes for each individual.
Notice the duplicate email address, too. In this case, either the user signed up for two accounts using the same email, or one person signed up for an account with another person's email address. In the case of duplicates, both rows will match with the same rows in the other data frame.
Also, in this case, all email addresses that weren't found in both data sets were thrown away. This does not necessarily need to be the intended behavior. For instance, if we wanted to make sure no rows were thrown away, that would be possible. In this case, though, for email addresses that weren't found in both data sets, some information will be missing. Recall that Python and R handle missing data differently (see \@ref(how-r-and-python-handle-missing-values)).
```{r, collapse = TRUE}
# in R
merge(baby1, baby2,
by.x = "email_address", by.y = "email",
all.x = TRUE, all.y = TRUE)
```
```{python, collapse = TRUE}
# in Python
le_merge = baby1.merge(baby2,
left_on = "email_address", right_on = "email",
how = "outer")
le_merge.iloc[:5,3:]
```
You can see it's slightly more concise in Python. If you are familiar with SQL, you might have heard of inner and outer joins. This is where Pandas [takes some of its argument names from](https://pandas.pydata.org/pandas-docs/version/0.15/merging.html#database-style-dataframe-joining-merging).\index{merge!merging in Python}
Finally, if both data sets have multiple values in the column you're joining on, the result can have more rows than either table. This is because *every possible match* shows up.
```{r, collapse = TRUE}
# in R
first <- data.frame(category = c('a','a'), measurement = c(1,2))
merge(first, first, by.x = "category", by.y = "category")
```
```{python, collapse = TRUE}
# in Python
first = pd.DataFrame({'category' : ['a','a'], 'measurement' : [1,2]})
first.merge(first, left_on = "category", right_on = "category")
```
## Long Versus Wide Data
### Long Versus Wide in R
Many types of data can be stored in either a **wide** or **long** format.
The classical example is data from a *longitudinal study.* If an experimental unit (in the example below this is a person) is repeatedly measured over time, each row would correspond to an experimental unit *and* an observation time in a data set in a long form.
```{r, collapse = TRUE}
peopleNames <- c("Taylor","Taylor","Charlie","Charlie")
fakeLongData1 <- data.frame(person = peopleNames,
timeObserved = c(1, 2, 1, 2),
nums = c(100,101,300,301))
fakeLongData1
```
A long format can also be used if you have multiple observations (at a single time point) on an experimental unit. Here is another example.
```{r, collapse = TRUE}
myAttrs <- c("attrA","attrB","attrA","attrB")
fakeLongData2 <- data.frame(person = peopleNames,
attributeName = myAttrs,
nums = c(100,101,300,301))
fakeLongData2
```
If you would like to reshape the long data sets into a wide format, you can use the `reshape()` function. You will need to specify which columns correspond with the experimental unit, and which column is the "factor" variable.
```{r, collapse = TRUE}
fakeWideData1 <- reshape(fakeLongData1,
direction = "wide",
timevar = "timeObserved",
idvar = "person",
varying = c("before","after"))
# ^ varying= arg becomes col names in new data set
fakeLongData1
fakeWideData1
```
```{r, collapse = TRUE}
# timevar= is a misnomer here
fakeWideData2 <- reshape(fakeLongData2,
direction = "wide",
timevar = "attributeName",
idvar = "person",
varying = c("attribute A","attribute B"))
fakeLongData2
fakeWideData2
```
`reshape()` will also go in the other direction: it can take wide data and convert it into long data
```{r, collapse = TRUE}
reshape(fakeWideData1,
direction = "long",
idvar = "person",
varying = list(c("before","after")),
v.names = "nums")
fakeLongData1
reshape(fakeWideData2,
direction = "long",
idvar = "person",
varying = list(c("attribute A","attribute B")),
v.names = "nums")
fakeLongData2
```
### Long Versus Wide in Python
With Pandas, we can take make long data wide with [`pd.DataFrame.pivot()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html#), and we can go in the other direction with [`pd.DataFrame.melt()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html?highlight=melt).
When going from long to wide, make sure to use the [`pd.DataFrame.reset_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html#) method afterwards to reshape the data and remove the index. Here is an example similar to the one above.
```{python, collapse = TRUE}
import pandas as pd
fake_long_data1 = pd.DataFrame(
{'person' : ["Taylor","Taylor","Charlie","Charlie"],
'time_observed' : [1, 2, 1, 2],
'nums' : [100,101,300,301]})
fake_long_data1
pivot_data1 = fake_long_data1.pivot(index='person',
columns='time_observed',
values='nums')
fake_wide_data1 = pivot_data1.reset_index()
fake_wide_data1
```
Here's one more example showing the same functionality--going from long to wide format.
```{python, collapse = TRUE}
people_names = ["Taylor","Taylor","Charlie","Charlie"]
attribute_list = ['attrA', 'attrB', 'attrA', 'attrB']
fake_long_data2 = pd.DataFrame({'person' : people_names,
'attribute_name' : attribute_list,
'nums' : [100,101,300,301]})
fake_wide_data2 = fake_long_data2.pivot(index='person',
columns='attribute_name',
values='nums').reset_index()
fake_wide_data2
```
Here are some examples of going in the other direction: from wide to long with [`pd.DataFrame.melt()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html?highlight=melt). The first example specifies value columns by integers.
```{python, collapse = TRUE}
fake_wide_data1
fake_wide_data1.melt(id_vars = "person", value_vars = [1,2])
```
The second example uses strings to specify value columns.
```{python, collapse = TRUE}
fake_wide_data2
fake_wide_data2.melt(id_vars = "person",
value_vars = ['attrA','attrB'])
```
## Exercises
### R Questions
1.
Recall the `car.data` data set [@misc_car_evaluation_19], which is hosted by [@uci_data].
a) Read in the data set as `carData`.
b) Convert the third and fourth columns to *ordered* `factor`s.
c) Order the data by the third and then the fourth column (simultaneously). Do not change the data in place. Instead store it under the name `ordCarData1`
d) Order the data by the fourth and then the third column (simultaneously). Do not change the data in place. Instead store it under the name `ordCarData2`
2.
```{r, eval = FALSE}
day1Data <- data.frame(idNum = 1:10,
measure = rnorm(10))
day2Data <- data.frame(idNum = 11:20,
measure = rnorm(10))
```
a) Pretend `day1Data` and `day2Data` are two separate data sets that possess the same type of measures but on different experimental units. Stack `day1Data` on top of `day2Data` and call the result `stackedData`.
b) Pretend `day1Data` and `day2Data` are different measurements on the same experimental units. Place them shoulder to shoulder and call the result `sideBySide`. Put `day1Data` first, and `day2Data` second.
3.
If you are dealing with random matrices, you might need to **vectorize** a matrix object. This is not the same as "vectorization" in programming. Instead, it means you write the matrix as a big column vector by stacking the columns on top of each other. Specifically, if you have a $n \times p$ real-valued matrix $\mathbf{X}$, then
\begin{equation}
\text{vec}(\mathbf{X}) =\begin{bmatrix} \mathbf{X}_1 \\ \vdots \\ \mathbf{X}_p \end{bmatrix}
\end{equation}
where $\mathbf{X}_i$ is the $i$th column as an $n \times 1$ column vector. There is another operator that we will use, the **Kronecker product:**
\begin{equation}
\mathbf{A} \otimes \mathbf{B} =
\begin{bmatrix}
a_{11} \mathbf{B} & \cdots & a_{1n} \mathbf{B} \\
\vdots & \ddots & \vdots \\
a_{m1} \mathbf{B} & \cdots & a_{mn} \mathbf{B} \\
\end{bmatrix}.
\end{equation}
If $\mathbf{A}$ is $m \times n$ and $\mathbf{B}$ is $p \times q$, then $\mathbf{A} \otimes \mathbf{B}$ is $pm \times qn$.
a) Write a function called `vec(myMatrix)`. Its input should be one `matrix` object. It's output should be a `vector`. Hint: `matrix` objects are stored in column-major order.
b) Write a function called `unVec(myVector, nRows)` that takes in the vectorized matrix as a `vector`, splits that into elements with `nRows` elements, and then places them together shoulder-to-shoulder as a `matrix`.
c) Write a function called `stackUp(m, BMat)` that returns $\mathbf{1}_m \otimes \mathbf{B}$ where $\mathbf{1}_m$ is a length $m$ column vector of ones. You may check your work with `%x%`, but do not use this in your function.
d) Write a function called `shoulderToShoulder(n, BMat)` that returns $\mathbf{1}^\intercal_n \otimes \mathbf{B}$ where $\mathbf{1}_n^\intercal$ is a length $n$ row vector of ones. You may check your work with `%x%`, but do not use this in your function.
4.
This problem uses the Militarized Interstate Disputes (v5.0) [@mid5] data set from [The Correlates of War Project](https://correlatesofwar.org/). There are four `.csv` files we use for this problem. `MIDA 5.0.csv` contains the essential attributes of each militarized interstate dispute from 1/1/1816 through 12/31/2014. `MIDB 5.0.csv` describes the participants in each of those disputes. `MIDI 5.0.csv` contains the essential elements of each militarized interstate incident, and `MIDIP 5.0.csv` describes the participants in each of those incidents.
a) Read in the four data sets and give them the names `mida`, `midb`, `midi`, and `midp`. Take care to convert all instances of `-9` to `NA`.
b) Examine all rows of `midb` where its `dispnum` column equals `2`. Do not change `midb` permanently. Are these two rows corresponding to the same conflict? If so, assign `TRUE` to `sameConflict`. Otherwise, assign `FALSE`.
c) Join the first two data sets together on the dispute number column (`dispnum`). Call the resulting `data.frame` `join1`. Do not address any concerns about duplicate columns.
d) Is there any difference between doing an inner join and an outer join in the previous question? If there was a difference, assign `TRUE` to `theyAreNotTheSame`. Otherwise, assign `FALSE` to it.
e) Join the last two data sets together by `incidnum` and call the result `join2`. Is there any difference between an inner and an outer join for this problem? Why or why not? Do not address any concerns about duplicate columns.
f) The codebook mentions that the last two data sets don't go as far back in time as the first two. Suppose then that we only care about the events in `join2`. Merge `join2` and `join1` in a way where all undesired rows from `join1` are discarded, and all rows from `join2` are kept. Call the resulting `data.frame` `midData`. Do not address any concerns about duplicate columns.
g) Use a scatterplot to display the relationship between the maximum duration and the end year. Plot each country as a different color.
h) Create a `data.frame` called `longData` that has the following three columns from `midp`: `incidnum` (incident identification number) `stabb` (state abbreviation of participant) and `fatalpre` (precise number of fatalities). Convert this to "wide" format. Make the new table called `wideData`. Use the incident number row as a unique row-identifying variable.
i) Bonus Question: identify all column pairs that contain duplicate information in `midData`, remove all but one of the columns, and change the column name back to its original name.
### Python Questions
1.
Once again, recall the `"car.data"` data set [@misc_car_evaluation_19].
a) Read in the data set as `car_data`.
b) Order the data by the third and then the fourth column. Do not change the data in place. Instead store it under the name `ord_car_data1`
c) Order the data by the fourth and then the third column. Do not change the data in place. Instead store it under the name `ord_car_data2`
2.
Consider the following random data set.
```{python, eval=FALSE}
indexes = np.random.choice(np.arange(20),size=20,replace=False)
d1 = pd.DataFrame({'a' : indexes,
'b' : np.random.normal(size=20)})
d2 = pd.DataFrame({'a' : indexes + 20,
'b' : np.random.normal(size=20)})
```
a) Pretend `d1` and `d2` are two separate data sets that possess the same type of measures but on different experimental units. Stack `d1` on top of `d2` and call the result `stacked_data_sets`. Make sure the `index` of the result is the numbers $0$ through $39$
b) Pretend `d1` and `d2` are different measurements on the same experimental units. Place them shoulder to shoulder and call the result `side_by_side_data_sets`. Put `d1` first, and `d2` second.
3.
Consider the following two data sets:
```{python, collapse = TRUE, eval=F}
import numpy as np
import pandas as pd
dog_names1 = ['Charlie','Gus', 'Stubby', 'Toni','Pearl']
dog_names2 = ['Charlie','Gus', 'Toni','Arya','Shelby']
nicknames = ['Charles','Gus The Bus',np.nan,'Toni Bologna','Porl']
breed_names = ['Black Lab','Beagle','Golden Retriever','Husky',np.nan]
dataset1 = pd.DataFrame({'dog': dog_names1,
'nickname': nicknames})
dataset2 = pd.DataFrame({'dog':dog_names2,
'breed':breed_names})
```
a) Join/merge the two data sets together in such a way that there is a row for every dog, whether or not both tables have information for that dog. Call the result `merged1`.
b) Join/merge the two data sets together in such a way that there are only rows for every dog in `dataset1`, whether or not there is information about these dogs' breeds. Call the result `merged2`.
c) Join/merge the two data sets together in such a way that there are only rows for every dog in `dataset2`, whether or not there is information about the dogs' nicknames. Call the result `merged3`.
d) Join/merge the two data sets together in such a way that all rows possess complete information. Call the result `merged4`.
4.
Let's consider Fisher's "Iris" data set [@misc_iris_53] again.
a) Read in `iris.csv` and store the `DataFrame` with the name `iris`. Let it have the column names `'a'`,`'b'`,`'c'`, `'d'` and `'e'`.
b) Create a `DataFrame` called `name_key` that stores correspondences between long names and short names. It should have three rows and two columns. The long names are the unique values of column five of `iris`. The short names are either `'s'`, `'vers'` or `'virg'`. Use the column names `'long name'` and `'short name'`.
c) Merge/join the two data sets together to give `iris` a new column with information about short names. Do not overwrite `iris`. Rather, give the `DataFrame` a new name: `iris_with_short_names`. Remove any columns with duplicate information.
d) Change the first four column names of `iris_with_short_names` to `s_len`, `s_wid`, `p_len`, and `p_wid`. Use Matplotlib to create a figure with 4 subplots arranged into a $2 \times 2$ grid. On each subplot, plot a histogram of these four columns. Make sure to use x-axis labels so viewers can tell which column is being plotted in each subplot.
e) Let's go back to `iris`. Change that to long format. Store it as a `DataFrame` called `long_iris`. Make the column names `row`, `variable` and `value`, in that order. Last, make sure it is sorted (simultaneously/once) by `row` and then `variable`.