-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathrstudio-redshift-code-walkthrough.Rmd
162 lines (127 loc) · 4.58 KB
/
rstudio-redshift-code-walkthrough.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
title: "RStudio & Redshift Connection"
---
Creating an ODBC Connection with Amazon Redshift Serverless
```{r}
library(DBI)
library(reticulate)
path_to_python <- system("which python", intern = TRUE)
use_python(path_to_python)
boto3 <- import('boto3')
client <- boto3$client('redshift-serverless')
workgroup <- unlist(client$list_workgroups())
namespace <- unlist(client$get_namespace(namespaceName=workgroup$workgroups.namespaceName))
creds <- client$get_credentials(dbName=namespace$namespace.dbName,
durationSeconds=3600L,
workgroupName=workgroup$workgroups.workgroupName)
con <- dbConnect(odbc::odbc(),
Driver='redshift',
Server=workgroup$workgroups.endpoint.address,
Port='5439',
Database=namespace$namespace.dbName,
UID=creds$dbUser,
PWD=creds$dbPassword)
```
Now we will use the DBI package function dbListTables() to view existing tables.
```{r}
dbListTables(con)
```
Use dbGetQuery() to pass an SQL query to the database.
```{r}
dbGetQuery(con, "select * from synthetic.users limit 100")
dbGetQuery(con, "select * from synthetic.cards limit 100")
dbGetQuery(con, "select * from synthetic.transactions limit 100")
```
We can also the dbplyr and dplyr packages to execute queries in the database. Let's count() how many transactions are in the transactions table. But first we need to install these packages.
```{r}
install.packages(c("dplyr", "dbplyr", "crayon"))
```
Use the tbl() function while specifying the schema.
```{r}
library(dplyr)
library(dbplyr)
users_tbl <- tbl(con, in_schema("synthetic", "users"))
cards_tbl <- tbl(con, in_schema("synthetic", "cards"))
transactions_tbl <- tbl(con, in_schema("synthetic", "transactions"))
```
Let's run a count of the number of rows for each table.
```{r}
count(users_tbl)
count(cards_tbl)
count(transactions_tbl)
```
So we have 2,000 users, 6,146 cards, and 24,386,900 transactions. We can also view the tables in the console.
```{r}
transactions_tbl
```
We can also view what dplyr verbs are doing under the hood.
```{r}
show_query(transactions_tbl)
```
Let's visually explore the number of transactions by year.
```{r}
transactions_by_year <- transactions_tbl %>%
count(year) %>%
arrange(year) %>%
collect()
transactions_by_year
```
```{r}
install.packages(c('ggplot2', 'vctrs'))
```
```{r}
library(ggplot2)
ggplot(transactions_by_year) +
geom_col(aes(year, as.integer(n))) +
ylab('transactions')
```
We can also summarize data in the database as follows:
```{r}
transactions_tbl %>%
group_by(is_fraud) %>%
count()
```
```{r}
transactions_tbl %>%
group_by(merchant_category_code, is_fraud) %>%
count() %>%
arrange(merchant_category_code)
```
Suppose we want to view fraud using card information. We just need to join the tables and then group by the attribute.
```{r}
cards_tbl %>%
left_join(transactions_tbl, by = c("user_id", "card_id")) %>%
group_by(card_brand, card_type, is_fraud) %>%
count() %>%
arrange(card_brand)
```
Now let's prepare a dataset that could be used for machine learning. Let's filter the transaction data to just include Discover credit cards while only keeping a subset of columns.
```{r}
discover_tbl <- cards_tbl %>%
filter(card_brand == 'Discover', card_type == 'Credit') %>%
left_join(transactions_tbl, by = c("user_id", "card_id")) %>%
select(user_id, is_fraud, merchant_category_code, use_chip, year, month, day, time_stamp, amount)
```
We will clean the dataset using the following transformations:
Convert is_fraud to binary attribute.
Remove transaction string from use_chip and rename it to type.
Combine year, month, and day into a data object.
Remove $ from amount and convert to a numeric data type.
```{r}
discover_tbl <- discover_tbl %>%
mutate(is_fraud = ifelse(is_fraud == 'Yes', 1, 0),
type = str_remove(use_chip, 'Transaction'),
type = str_trim(type),
type = tolower(type),
date = paste(year, month, day, sep = '-'),
date = as.Date(date),
amount = str_remove(amount, '[$]'),
amount = as.numeric(amount)) %>%
select(-use_chip, -year, -month, -day)
```
Now that we have filtered and cleaned our dataset, we are ready to collect this dataset into local RAM.
```{r}
discover <- collect(discover_tbl)
summary(discover)
```
Now we have a working dataset to start creating features and fitting models. We will not cover those steps in this blog as we want to highlight how to work with database tables in Redshift to prepare a dataset to bring into our local environment.