-
Notifications
You must be signed in to change notification settings - Fork 0
/
p8105_hw2_sl5232.Rmd
174 lines (137 loc) · 8 KB
/
p8105_hw2_sl5232.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "Homework 2"
author: "Shangsi Lin"
date: "2022/10/01"
output: github_document
---
```{r load_libraries, message = FALSE}
library(tidyverse)
library(readxl)
```
## Problem 1
Import data from `NYC_Transit_Subway_Entrance_And_Exit_Data.csv`. Then clean variable names and selects the interested column of variables. Using the `recode()` argument to change the entry variable from `YES` / `NO` to a logical variable. Also change the route 8-11 data type from double to character as the route 1-7 data.
```{r chunk_clean}
transit_df =
read_csv(
"data/NYC_Transit_Subway_Entrance_And_Exit_Data.csv",
col_types = cols(Route8 = "c", Route9 = "c", Route10 = "c", Route11 = "c")) %>%
janitor::clean_names() %>%
select(
line, station_name, station_latitude, station_longitude,
starts_with("route"), entry, exit_only, vending, entrance_type,
ada) %>%
mutate(entry = recode(entry, "YES" = TRUE, "NO" = FALSE))
```
These data are not tidy, after the cleaning above, the current dataset contains variables including line, station name, station latitude, station longitude, route information, entry, exit only or not, vending, entrance type, and ADA compliance. The dataset dimension is 1868 x 20. It would be helpful to change rount numbers and route as variables for investigations interested on specific routes.
To find out the number of unique stations, select station name and line and use `distinct()` to find all unique combinations that each represents a unique station. The number of rows represent the number of unique stations.
```{r chunk_unique}
transit_df %>%
select(station_name, line) %>%
distinct
```
Turns out, there are 465 unique stations.
To find out the number of ADA compliant stations, filter `ada == TRUE` first and then select unique stations the same as before. The number of rows represent the number of ADA compliant stations.
```{r chunk_ADA}
transit_df %>%
filter(ada == TRUE) %>%
select(station_name, line) %>%
distinct
```
Turns out, there are 84 ADA compliant stations.
To find out the proportion of station entrances / exits without vending allow entrance, first filter stations that don't allow vending, and then take the mean of the `entry` variable. The result is the parameter of interest.
```{r chunk_proportion}
transit_df %>%
filter(vending == "NO") %>%
pull(entry) %>%
mean
```
Turns out, the proportion of station entrances / exits without vending allow entrance is about 0.377
Reformat data so that route number and route name are distinct variables, then identify stations that serve the A train and how many of those are ADA compliant using the `filter`, `select`, and `distinct` arguments.
```{r chunk_reformat}
reformat_df = transit_df %>%
pivot_longer(
route1:route11,
names_to = "route_number",
values_to = "route")
reformat_df %>%
filter(route == "A") %>%
select(station_name, line) %>%
distinct
reformat_df %>%
filter(route == "A", ada == TRUE) %>%
select(station_name, line) %>%
distinct
```
Turns out, there are 60 stations that serve the A train and 17 stations that serve the A train while being ADA compliant.
## Problem 2
Import two data sheets from `Trash-Wheel-Collection-Totals-7-2020-2.xlsx`, clean the names od the variables and filter out only dumpster-specific data. Round the number of sports balls to the nearest integer and converts the result to an integer variable. Rename variable name "dumpster" into "dumpster_number" for clarity, and used brackets to mark units for weight and volume. Add a new variable called `wheel_name` to keep track of which trash wheel does the data belong to.
```{r chunk_clean2}
mr_trash_df =
read_excel("data/Trash Wheel Collection Data.xlsx", sheet = 1, range = "A2:N549") %>%
janitor::clean_names() %>%
filter(dumpster != "NA") %>%
mutate(sports_balls = as.integer(round(sports_balls)),
year = as.double(year)) %>%
mutate(wheel_name = "mr_trash") %>%
rename("dumpster_number" = dumpster,
"weight(tons)" = weight_tons,
"volume(cubic yards)" = volume_cubic_yards)
professor_trash_df =
read_excel("data/Trash Wheel Collection Data.xlsx", sheet = 2, range = "A2:M96") %>%
janitor::clean_names() %>%
filter(dumpster != "NA") %>%
mutate(wheel_name = "professor_trash") %>%
rename("dumpster_number" = dumpster,
"weight(tons)" = weight_tons,
"volume(cubic yards)" = volume_cubic_yards)
```
Combine both data sheets to produce a single tidy dataset .
```{r}
trash_wheel_df =
bind_rows(mr_trash_df, professor_trash_df) %>%
arrange(dumpster_number, increasing = TRUE) %>%
relocate(dumpster_number, wheel_name, everything())
```
```{r chunk_calc2}
total_weight = pull(professor_trash_df, "weight(tons)") %>%
sum
filter_df = mr_trash_df %>%
filter(year == '2020')
ball = pull(filter_df, sports_balls) %>%
sum
```
The total number of observations in the resulting dataset is `r nrow(trash_wheel_df)`, key variables include the dumpster number, the date, weight in tons, volume in cubic yards, types of trash collected and the amount of each type of trash collected, and homes the trash wheel powered. The total weight of trash collected by Professor Trash Wheel is `r total_weight` tons. The total number of sports balls collected by Mr.Trash Wheel in 2020 was `r ball`.
## Problem 3
Import and clean the data in pols-month.csv, snp.csv, and unemployment.csv. For pols_df, break the variable mon into interger variables year, month, and day. Replace month number with month name, and created a `president` variable taking values `gop` and `dem`. Removed `prez_dem`, `prez_gop`, and the dat variable. Cleaned snp_df with a similar fashion, and organized so variables `year` and `month` are in the leading columns. Tidy the unemployment data by switching it from "wide" to "long" format and make the month variable in the same expression style as the other two data frames.
```{r load_problem3}
pols_df =
read_csv('data/fivethirtyeight_datasets/pols-month.csv') %>%
janitor::clean_names() %>%
separate(mon, c('year', 'month', 'day'), convert = TRUE) %>%
mutate(year = as.integer(year),month = as.integer(month),day = as.integer(day)) %>%
mutate(month = month.name[month], president = recode(prez_gop, "1" = "gop", "0" = "dem")) %>%
select(-c("prez_gop", "prez_dem", "day"))
snp_df =
read_csv("data/fivethirtyeight_datasets/snp.csv") %>%
janitor::clean_names() %>%
separate(date, c('month', 'day', 'year'), convert = TRUE) %>%
mutate(year = if_else(year > 16 , year + 1900, year + 2000),
month = month.name[as.integer(month)],
day = as.integer(day)) %>%
select(year, month, everything(),-c("day"))
umemployment_df =
read_csv("data/fivethirtyeight_datasets/unemployment.csv") %>%
pivot_longer(
cols = Jan:Dec,
names_to = "month",
values_to = "unemployment rate") %>%
mutate(month = month.name[match(month,month.abb)]) %>%
janitor::clean_names()
```
Merge the snp_df into the pols_df, and later merge the umemployment_df into the result.
```{r chunk_result}
merged_df = left_join(pols_df, snp_df, by = c("year", "month"))
result_df = left_join(merged_df, umemployment_df, by = c("year", "month"))
```
The pols-month dataset contained information about the number of government officials in different roles belonging to either dem or gop side at different given times. The snp dataset contained information about the closing values of the S&P stock index on the associated dates. The unemployment dataset contained information on the percentage of unemployment in different months of the associated year.
The resulting dataset has `r nrow(result_df)` rows and `r ncol(result_df)` columns. The range of years it contains is from `r min(pull(result_df, year))` to `r max(pull(result_df, year))`, which is `r max(pull(result_df, year))-min(pull(result_df, year))` years. The names of key variables it contains are the year, month, number of government officials from either gop or dem parties at different roles, the unemployment rate and the closing values of the S&P stock index on the associated dates.