generated from jhudsl/OTTR_Template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-data-wrangling2.Rmd
205 lines (133 loc) · 8.65 KB
/
04-data-wrangling2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```
# Data Wrangling, Part 2
We will continue to learn about data analysis with Dataframes. Let's load our three Dataframes from the Depmap project in again:
```{python}
import pandas as pd
import numpy as np
metadata = pd.read_csv("classroom_data/metadata.csv")
mutation = pd.read_csv("classroom_data/mutation.csv")
expression = pd.read_csv("classroom_data/expression.csv")
```
## Creating new columns
Often, we want to perform some kind of transformation on our data's columns: perhaps you want to add the values of columns together, or perhaps you want to represent your column in a different scale.
To create a new column, you simply modify it as if it exists using the bracket operation `[ ]`, and the column will be created:
```{python}
metadata['AgePlusTen'] = metadata['Age'] + 10
expression['KRAS_NRAS_exp'] = expression['KRAS_Exp'] + expression['NRAS_Exp']
expression['log_PIK3CA_Exp'] = np.log(expression['PIK3CA_Exp'])
```
where [`np.log(x)`](https://numpy.org/doc/stable/reference/generated/numpy.log.html) is a function imported from the module NumPy that takes in a numeric and returns the log-transformed value.
Note: you cannot create a new column referring to the attribute of the Dataframe, such as: `expression.KRAS_Exp_log = np.log(expression.KRAS_Exp)`.
## Merging two Dataframes together
Suppose we have the following Dataframes:
`expression`
| ModelID | PIK3CA_Exp | log_PIK3CA_Exp |
|--------------|------------|----------------|
| "ACH-001113" | 5.138733 | 1.636806 |
| "ACH-001289" | 3.184280 | 1.158226 |
| "ACH-001339" | 3.165108 | 1.152187 |
`metadata`
| ModelID | OncotreeLineage | Age |
|--------------|-----------------|-----|
| "ACH-001113" | "Lung" | 69 |
| "ACH-001289" | "CNS/Brain" | NaN |
| "ACH-001339" | "Skin" | 14 |
Suppose that I want to compare the relationship between `OncotreeLineage` and `PIK3CA_Exp`, but they are columns in different Dataframes. We want a new Dataframe that looks like this:
| ModelID | PIK3CA_Exp | log_PIK3CA_Exp | OncotreeLineage | Age |
|--------------|------------|----------------|-----------------|-----|
| "ACH-001113" | 5.138733 | 1.636806 | "Lung" | 69 |
| "ACH-001289" | 3.184280 | 1.158226 | "CNS/Brain" | NaN |
| "ACH-001339" | 3.165108 | 1.152187 | "Skin" | 14 |
We see that in both dataframes,
- the rows (observations) represent cell lines.
- there is a common column `ModelID`, with shared values between the two dataframes that can faciltate the merging process. We call this an **index**.
We will use the method [`.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) for Dataframes. It takes a Dataframe to merge with as the required input argument. The method looks for a common index column between the two dataframes and merge based on that index.
```{python}
merged = metadata.merge(expression)
```
It's usually better to specify what that index column to avoid ambiguity, using the `on` optional argument:
```{python}
merged = metadata.merge(expression, on='ModelID')
```
If the index column for the two Dataframes are named differently, you can specify the column name for each Dataframe:
```{python}
merged = metadata.merge(expression, left_on='ModelID', right_on='ModelID')
```
One of the most import checks you should do when merging dataframes is to look at the number of rows and columns before and after merging to see whether it makes sense or not:
The number of rows and columns of `metadata`:
```{python}
metadata.shape
```
The number of rows and columns of `expression`:
```{python}
expression.shape
```
The number of rows and columns of `merged`:
```{python}
merged.shape
```
We see that the number of *columns* in `merged` combines the number of columns in `metadata` and `expression`, while the number of *rows* in `merged` is the smaller of the number of rows in `metadata` and `expression`: it only keeps rows that are found in both Dataframe's index columns. This kind of join is called "inner join", because in the Venn Diagram of elements common in both index column, we keep the inner overlap:
![](images/join.png)
You can specifiy the join style by changing the optional input argument `how`.
- `how = "outer"` keeps all observations - also known as a "full join"
- `how = "left"` keeps all observations in the left Dataframe.
- `how = "right"` keeps all observations in the right Dataframe.
- `how = "inner"` keeps observations common to both Dataframe. This is the default value of `how`.
## Grouping and summarizing Dataframes
In a dataset, there may be groups of observations that we want to understand, such as case vs. control, or comparing different cancer subtypes. For example, in `metadata`, the observation is cell lines, and perhaps we want to group cell lines into their respective cancer type, `OncotreeLineage`, and look at the mean age for each cancer type.
We want to take `metadata`:
| ModelID | OncotreeLineage | Age |
|--------------|-----------------|-----|
| "ACH-001113" | "Lung" | 69 |
| "ACH-001289" | "Lung" | 23 |
| "ACH-001339" | "Skin" | 14 |
| "ACH-002342" | "Brain" | 23 |
| "ACH-004854" | "Brain" | 56 |
| "ACH-002921" | "Brain" | 67 |
into:
| OncotreeLineage | MeanAge |
|-----------------|---------|
| "Lung" | 46 |
| "Skin" | 14 |
| "Brain" | 48.67 |
To get there, we need to:
- **Group** the data based on some criteria, elements of `OncotreeLineage`
- **Summarize** each group via a summary statistic performed on a column, such as `Age`.
We use the methods [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and `.mean()`.
```{python}
metadata_grouped = metadata.groupby("OncotreeLineage")
metadata_grouped['Age'].mean()
```
Here's what's going on:
- We use the Dataframe method [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and specify the column we want to group by. The output of this method is a **Grouped Dataframe object**. It still contains all the information of the `metadata` Dataframe, but it makes a note that it's been grouped.
- We subset to the column `Age`. The grouping information still persists (This is a Grouped Series object).
- We use the method `.mean()` to calculate the mean value of `Age` within each group defined by `OncotreeLineage`.
Alternatively, this could have been done in a chain of methods:
```{python}
metadata.groupby("OncotreeLineage")["Age"].mean()
```
Once a Dataframe has been grouped and a column is selected, all the summary statistics methods you learned from last week, such as `.mean()`, `.median()`, `.max()`, can be used. One new summary statistics method that is useful for this grouping and summarizing analysis is [`.count()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.count.html) which tells you how many entries are counted within each group.
### Optional: Multiple grouping, Multiple columns, Multiple summary statistics
Sometimes, when performing grouping and summary analysis, you want to operate on multiple columns simultaneously.
For example, you may want to group by a combination of `OncotreeLineage` and `AgeCategory`, such as "Lung" and "Adult" as one grouping. You can do so like this:
```{python}
metadata_grouped = metadata.groupby(["OncotreeLineage", "AgeCategory"])
metadata_grouped['Age'].mean()
```
You can also summarize on multiple columns simultaneously. For each column, you have to specify what summary statistic functions you want to use. This can be specified via the [`.agg(x)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html) method on a Grouped Dataframe.
For example, coming back to our age case-control Dataframe,
```{python}
df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"],
'age_case': [25, 43, 21, 65, 7],
'age_control': [49, 20, 32, 25, 32]})
df
```
We group by `status` and summarize `age_case` and `age_control` with a few summary statistics each:
```{python}
df.groupby("status").agg({"age_case": "mean", "age_control": ["min", "max", "mean"]})
```
The input argument to the `.agg(x)` method is called a [Dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries), which let's you structure information in a paired relationship. You can learn more about dictionaries here.
## Exercises
Exercise for week 4 can be found [here](https://colab.research.google.com/drive/1ntkUdKQ209vu1M89rcsBst-pKKuwzdwX?usp=sharing).