python-pandas.qmd

---
title: Introduction to Python and Pandas
author: Kevin Nota, Robin Warner, and Maxime Borry
---

:::  {.callout-note} 
This session is typically ran held in parallel to the Introduction to R and Tidyverse. Participants of the summer schools chose which to attend based on their prior experience. We recommend the [introduction to R session](r-tidyverse.qmd) if you have no experience with neither R nor Python.
:::

::: {.callout-note collapse="true" title="Self guided: chapter environment setup"}
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.

Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.11394586](https://doi.org/10.5281/zenodo.11394586), and unpack

```bash
tar xvf python-pandas.tar.gz 
cd python-pandas/
```

You can then create the subsequently activate environment with

```bash
conda env create -f python-pandas.yml
conda activate python-pandas
```
:::

Over the last few years, _Python_ has gained popularity thanks to the numerous libraries (packages with pre-written functions) in bioinformatics, statistical data analysis, and machine learning.
While a few years ago, it was often necessary to go to _R_ for performing routine data manipulation and analysis tasks, nowadays _Python_ has a vast ecosystem of useful libraries for working on metagenomic data.
Existing libraries exist for many different file formats encountered in metagenomics, such as fasta, fastq, sam, bam, etc.
Furthermore, python is fast and extremely useful for writing programs that can be easily called from the command line like many existing tools.

This tutorial/walkthrough will provide a short introduction to the popular libraries for data analysis `pandas` ([(https://pandas.pydata.org/](https://pandas.pydata.org/)).
This library has functions for reading and manipulating _tabular data_ similar to the _`data.frame()`_ in _R_ together with some basic data plotting.
This will set the base for learning Python and use it for data analysis.

There are many IDEs in which Python code can be written.
For data analysis, Jupyter is powerful and popular which looks and functions similar to R markdown, where code is written in code blocks with space in text blocks for annotations.
In this tutorial/walkthrough, we will use these notebooks for running and visualising Python code.

Learning objectives:

- Get familiar with the Python code syntax and use Jupyter Notebook for executing code
- Get a kickstart to utilising the endless possibilities of data analysis in Python that can be applied to our data

## Working in a Jupyter environment

This tutorial/walkthrough is using a Jupyter Notebook ([https://jupyter.org](https://jupyter.org)) for writing and executing Python code and for annotating.

Jupyter notebooks have two types of cells: _Markdown_ and _Code_.
The _Markdown cell_ syntax is very similar to _R markdown_.
The markdown cells are used for annotating code, which is important for sharing work with collaborators, reproducibility, and documentation.

Change the directory to the the working directory of this tutorial/walkthrough.

```bash
cd python-pandas_lecture/
```

To launch jupyter, run the following command in the terminal.
This will open a browser window with jupyter running.

```bash
jupyter notebook
```

Jupyter Notebook should have a file structure with all the files from the working directory.
Open the `student-notebook.ipynb` notebook by clicking on it.
This notebook has exactly the same code as written in this book chapter and is only a support so that it is not necessary to copy and paste the code.
It is of course also possible to copy the code from this chapter into a fresh notebook file by clicking on:  `File` > `New` > `Notebook`.

::: {.callout-note title="Note If the notebook is not there" collapse="true"}
If you cannot find `student-notebook.ipynb`, it is possible the working directory is not correct.
Make sure that `pwd` returns `/<path>/<to>/python-pandas/python-pandas_lecture`.
:::

### Creating and running cells

There are multiple ways of making a new cells in jupyter, such as typing the letter `b`, or using the cursor on the bottom of the page that says _`click to add cell`_.
The cells can be assigned to `code` or `markdown` using the drop down menu at the top.
Code cells are always in edit mode.
Code can be run with pressing `Shift + Enter` or click on the `▶` botton.
To make an markdown cell active, double-click on a markdown cell, it switches from display mode to edit mode.
To leave the editing mode by running the cell.

::: {.callout-tip collapse="true" title="Clear your code cells"}
Before starting it might be nice to clear the output of all code cells, by clicking on:

 `edit` > `Clear outputs of All Cells`
:::

### Markdown cell syntax

Here a few examples of the syntax for the Markdown cells are shown, such as making words **bold**, or _italics_.
For a more comprehensive list with syntax check out this Jupyter Notebook cheat-sheet ([https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet](https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet)).

List of _markdown cell_ examples:

- `**bold**` : **bold**
- `_italics_` : _italics_

Code

- \`inline code\` : `inline code`
 
 LaTeX maths

- `$ x = \frac{\pi}{42} $` : $$ x = \frac{\pi}{42} $$

URL links

- `[link](https://www.python.org/)` : [link](https://www.python.org/)

Images

- `![](https://www.spaam-community.org/assets/media/SPAAM-Logo-Full-Colour_ShortName.svg)` ![](https://www.spaam-community.org/assets/media/SPAAM-Logo-Full-Colour_ShortName.svg)

::: {.callout-note collapse="true" title="All roads lead to Rome"}
In many cases, there are multiple syntaxes, or 'ways of doing things,' that will give the same results.
For each section in this tutorial/walkthrough, one way is presented.
:::

### code cell syntax

The _code cells_ can interpret many different coding languages including _Python_ and _Bash_.
The syntax of the code cells is the same as the syntax of the coding languages, in our case _python_.

Below are some examples of Python _code cells_ with some useful basic python functions:

::: {.callout-tip collapse="true" title="Python function print()"}
`print()` is a python function for printing lines in the terminal

`print()` is the same as `echo` in bash
:::

```{.python eval=False}
print("Hello World from Python!")
```

::: {.callout-note collapse="true"}
## Expand to see output
```
Hello World from Python!
```
:::
::: {.callout-tip collapse="true" title="Running bash code in Jupyter"}
It is also possible to run bash commands in Jupyter, by adding a *!* at the start of the line.

```{.python eval=False}
! echo "Hello World from bash!"
```
```
Hello World from bash!
```
:::

Stings or numbers can be stored as a variable by using the *=* sign.

```{.python eval=False}
i = 0
```

Ones a variable is set in one _code cell_ they are stored and can be accessed in other downstream _code cells_.

To see what value a variable contains, the `print()` function can be used.

```{.python eval=False}
print(i)
```
::: {.callout-note collapse="true"}
## Expand to see output
```
0
```
:::

You can also print multiple things together in one `print` statement such as a number and a string.

```{.python eval=True}
print("The number is", i, "Wow!")
```
::: {.callout-note collapse="true"}
## Expand to see output
```
The number is, 0, Wow!
```
:::

## Pandas

### Getting started

Pandas is a Python library used for data manipulation and analysis.

We can import the library like this.

```{.python eval=False}
import pandas as pd
```

::: {.callout-tip collapse="true"}
## Why import as pd?
We set `pandas` to the alias `pd` because we are lazy and do not want to write the full word too many times.
:::

Now that `Pandas` is imported, we can check if it worked correctly, and check which version is running by runing `.__version__`.

```{.python eval=False}
pd.__version__
```
::: {.callout-note collapse="true"}
## Expand to see output
'2.2.2'
:::

### Pandas data structures

The primary data structures in `Pandas` are the `Series` and the `DataFrame`.
A Series is a one-dimensional array-like object containing a value of the same type and can be imagined as one column in a table @fig-pythonpandas-figpandasseries.
Each element in the `series` is associated with an index from 0 to the number of elements, but these can be changed to labels.
A `DataFrame` is two-dimensional, and can change in size after it is created by adding and removing rows and columns, which can hold different types of data such as numbers and strings  @fig-pythonpandas-figpandasdataframe.
The columns and rows are labelled.
By default, rows are unnamed and are indexed similarly to a `series`.

![A single row or column (1-dimensional data) is a `Series`. The dark grey squares are the index or row names, and the light grey squares are the elements.](assets/images/chapters/python-pandas/01_table_series.svg){#fig-pythonpandas-figpandasseries}

![A dataframe with columns and rows. The dark grey squares are the index/row names and the column names. The light grey squares are the values.](assets/images/chapters/python-pandas/01_table_dataframe.svg){#fig-pythonpandas-figpandasdataframe}

::: {.callout-tip collapse="true" title="More details on pandas"}
For a more in detail pandas getting started tutorial click here ([https://pandas.pydata.org/docs/getting_started/index.html#](https://pandas.pydata.org/docs/getting_started/index.html#))
:::

## Reading data with Pandas

Pandas can read in _csv_ (comma separated values) files, which are tables in text format.
It is called _c_sv because each value is separated from the others through a comma.

```verbatim
A,B
5,6
8,4
```

Another common tabular separator are _tsv_, where each value is separated by a _tab_ `\t`.

```verbatim
A\tB
5\t6
8\t4
```

The dataset that is used in this tutorial/walkthrough is called `"all_data.tsv"`, and is tab-separated.
Pandas by default assume that the file is comma delimited, but this can be change by using the `sep= ` argument.

::: {.callout-tip collapse="true" title="Pandas function pd.read_csv()"}
`pd.read_csv()` is the pandas function to read in tabular tables.
The `sep=` can be specified argument, `sep=,` is the default.
:::

```{.python eval=False}
df = pd.read_csv("../all_data.tsv", sep="\t")
df
```
::: {.callout-note collapse="true"}
## Expand to see output

```{r}
#| echo: false
#| results: 'asis'
#| message: false
#| warning: false

library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/python-pandas/table_1_all_data.csv")

# Create the table with gt
data %>%
    gt() %>%
    tab_options(
        table.width = pct(100),
        table.layout = "fixed",
        container.overflow.x = "scroll"
    )
```
:::

::: {.callout-tip collapse="true"}
## Help
When you are unsure what arguments a function can take, it is possible to get a _help documentation_ using `help(pd.read_csv)`
:::

In most cases, data will be read in with the `pd.read_csv()` function, however, internal Python data structures can also be transformed into a pandas data frame.
For example using a nested list, were each row in the datafram is a list `[]`.

```{.python eval=False}
df = pd.DataFrame([[5,6], [8,4]], columns=["A", "B"])
df
```
::: {.callout-note collapse="true"}
## Expand to see output

| |A|B|
|-|-|-|
|0|5|6|
|1|8|4|
:::

Another usful transformation is from a dictionary ([https://docs.python.org/3/tutorial/datastructures.html#dictionaries](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)) to pd.Dataframe.

```{.python eval=False}
table_data = {'A' : [5, 6]
              'B' : [8, 4]}

df = pd.DataFrame(table_data)
df
```
::: {.callout-note collapse="true"}
## Expand to see output
| |A|B|
|-|-|-|
|0|5|6|
|1|8|4|
:::

There are many ways to turn a `DataFrame` back into a dictonary ([https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html#pandas.DataFrame.to_dict](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html#pandas.DataFrame.to_dict)), which might be very handy for certain purposes.

## Data exploration

The data for this tutorial/walkthrough is from a customer personality analysis of a company trying to better understand how to modify their product catalogue.
Here is the link to the original source ([https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis)) for more information.

### Columns

To display all the column names from the imported `DataFrame`, the attribute `columns` can be called.

```python{.python eval=False}
df.columns
```

::: {.callout-note collapse="true"}
## Expand to see output
```
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'MntWines', 'MntFruits', 'MntMeatProducts',
       'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
       'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
       'NumWebVisitsMonth', 'Complain', 'Z_CostContact', 'Z_Revenue'],
      dtype='object')
```
:::

Each column has its own data types which are highly optimised.
A column with only integers has the data type `int64`.
Columns with decimal numbers are called `float64`.
A column with only strings, or a combination of strings and `integers` or `floats` is called an `object`.

```{.python eval=False}
df.dtypes
```

::: {.callout-note collapse="true"}
## Expand to see output
```
ID                       int64
Year_Birth               int64
Education               object
Marital_Status          object
Income                 float64
Kidhome                  int64
Teenhome                 int64
MntWines                 int64
MntFruits                int64
MntMeatProducts          int64
MntFishProducts          int64
MntSweetProducts         int64
MntGoldProds             int64
NumWebPurchases          int64
NumCatalogPurchases      int64
NumStorePurchases        int64
NumWebVisitsMonth        int64
Complain                 int64
Z_CostContact            int64
Z_Revenue                int64
dtype: object
```
:::

::: {.callout-tip collapse="true"}
## What does 64 stand for?
The 64 indicates the number of bits the integers are stored in.
64 bits is the largest pandas handels.
When it is known that a value is in a certain range, it is possible to change the bits to 8, 16, or 32.
This chosing the correct range might reduce memory usage, to be save, 64 range is so large it will incorporate most user cases.

```python{.python eval=False}
df['Kidhome'] = df['Kidhome'].astype('int8')
```
:::

### Inspecting the DataFrame

To quickly check how many rows and columns the `DataFrame` has, we can access the `shape` attribute.

```{.python eval=False}
df.shape
```
::: {.callout-note collapse="true"}
## Expand to see output
```
(1754, 20)
```
:::

The `.shape` attribute of a DataFrame provides a tuple representing its dimensions.
A tuple is a Python data structure that is used to store ordered items.
In the case of shape, the first item is always the row, and the second item is the columns.
To print, or access the rows or columns the index can be used.
 `.shape[0]` gives the number of rows, and `.shape[1]` gives the number of columns.

```{.python eval=False}
df.shape[0]
```

::: {.callout-note collapse="true"}
## Expand to see output
```
1754
```
:::

```{.python eval=False}
df.shape[1]
```

::: {.callout-note collapse="true"}
## Expand to see output
```
20
```
:::

:::{.callout-tip}
It is often useful to have a quick look at the first rows, to get, for example, an idea of the data was read correctly.
This can be done with the `head()` function.
:::

```{.python eval=False}
df.head()
```

::: {.callout-note collapse="true"}
## Expand to see output
```{r}
#| echo: false
#| results: 'asis'
#| message: false
#| warning: false

library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/python-pandas/table_2_all_data_head.csv")

# Create the table with gt
data %>%
    gt() %>%
    tab_options(
        table.width = pct(100),
        table.layout = "fixed",
        container.overflow.x = "scroll"
    )
```
:::

:::{.callout-tip collapse="true"}
## Fuctions and attributes
The difference between calling a function and an atribute is the `()`.
`.head()` is a function and will perform an action.
While `.shape` is an attribute and will return a value that is already stored in the `DataFrame`.
:::

What we can see it that, unlike _R_, _Python_ and in extension _Pandas_ is 0-indexed instead of 1-indexed.

### Accessing rows and columns

It is possible to access parts of the data in `DataFrames` in different ways.
The first method is sub-setting rows using the row name and column name.
This can be done with the `.loc`, which _loc_ ates row(s) by providing the row name and column name `[row, column]`.
When the rows are not named, the row index can be used instead.
To print the second row, this would be index 1 since the index in Python starts at 0.
To print the all the columns, the `:` is used.

```{.python eval=False}
df.loc[1, :]
```

::: {.callout-note collapse="true"}
## Expand to see output
```
ID                           2174
Year_Birth                   1954
Education              Graduation
Marital_Status             Single
Income                    46344.0
Kidhome                         1
Teenhome                        1
MntWines                       11
MntFruits                       1
MntMeatProducts                 6
MntFishProducts                 2
MntSweetProducts                1
MntGoldProds                    6
NumWebPurchases                 1
NumCatalogPurchases             1
NumStorePurchases               2
NumWebVisitsMonth               5
Complain                        0
Z_CostContact                   3
Z_Revenue                      11
Name: 1, dtype: object
```
:::

To print a range of rows, the first and last index can be written with a `:`.
To print the second and third row, this would be `[1:2, :]`.

```{.python eval=False}
df.loc[1:2]
```
::: {.callout-note collapse="true"}
## Expand to see output

```{r}
#| echo: false
#| results: 'asis'
#| message: false
#| warning: false

library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/python-pandas/table_3_all_data_first_2_lines.csv")

# Create the table with gt
data %>%
    gt() %>%
    tab_options(
        table.width = pct(100),
        table.layout = "fixed",
        container.overflow.x = "scroll"
    )
```
:::

To print all the rows with a certain column name works the same as subsetting rows, but then adding the column name after the comma.

```{.python eval=False}
df.loc[:, "Year_Birth"]
```

::: {.callout-note collapse="true"}
## Expand to see output
```
0       1957
1       1954
2       1965
3       1984
4       1967
        ... 
1749    1977
1750    1974
1751    1967
1752    1981
1753    1956
```
:::

It is important to notice that almost all operations on `DataFrames` are not in place, meaning that the `DataFrame` is not modified.
To keep the changes, the `DataFrame` has to be actively stored using the same name or a new variable.

To save the changes, a new `DataFrame` has to be created, or the existing `DataFrame` has to be overwritten.
This can be done by directing the output to a variable with `=`.
To make a new `DataFrame` with only the "Education" and "Marital_Status" columns, the column names have to be placed in a list `['colname1', 'colname2']`.

```{.python eval=False}
df.head()
new_df = df.loc[:, ["Education", "Marital_Status"]]
new_df
```

::: {.callout-note collapse="true"}
## Expand to see output
||Education|Marital_Status|
|-|-|-|
|0|Graduation|Single|
|1|Graduation|Single|
|2|Graduation|Together|
|3|Graduation|Together|
|4|Master|Together|
|…|…|…|
|1749|Graduation|Together|
|1750|Graduation|Married|
|1751|Graduation|Married|
|1752|Graduation|Divorced|
|1753|Master|Together|
|1754 rows × 2 columns|
:::

It is also possible to remove rows and columns from a `DataFrame`.
This can be done with the function `drop()`.
To remove the columns `Z_CostContact` and `Z_Revenue` and keep those changes, it is necessary to overwrite the `DataFrame`.
To make sure `Pandas` understands that the its `columns` that need to be removed, the `axis` can be specified.
Rows are called `axis=0`, and columns are called `axis=1`.
In most cases `Pandas` will guess correctly without specifying the axis, since in this case the no row is called `Z_CostContact` or `Z_Revenue`.
It is however good practice to add the axis to make sure `Pandas` is operating as expected.

```{.python eval=False}
df = df.drop("Z_CostContact", axis=1)
df = df.drop("Z_Revenue", axis=1)
```
::: {.callout-tip collapse="true"}
## Can be done in one go
```
df = df.drop(["Z_CostContact", "Z_Revenue"], axis=1)
```
:::

### Conditional subsetting

So far, all the subsetting has been based on `row names` and column names.
However, in many cases, it is more helpful to look only at data that contain certain items.
This can be done using conditional subsetting, which is based on Boolean values `True` or `False`.
pandas will interpret a series of True and `False` values by printing only the rows or columns where a `True` is present and ignoring all rows or columns with a `False`.

For example, if we are only interested in individuals in the table who graduated, we can test each string in the column Education to see if it is equal (==) to Graduation.
This will return a series with Boolean values `True` or `False`.

```{.python eval=False}
education_is_grad = (df["Education"] == "Graduation")
education_is_grad
```

:::{.callout-note collapse="true"}
## Expand to see output
```
0        True
1        True
2        True
3        True
4       False
        ...  
1749     True
1750     True
1751     True
1752     True
1753    False
Name: Education, Length: 1754, dtype: bool
```
:::

To quicky check if the `True` and `False` values are correct, it can be useful to print out this column.

```{.python eval=False}
df["Education"]
```

:::{.callout-note collapse="true"}
## Expand to see output
```
0       Graduation
1       Graduation
2       Graduation
3       Graduation
4       Master
…       …
1749    Graduation
1750    Graduation
1751    Graduation
1752    Graduation
1753    Master
Name: Education, length: 1754, dtype: object
```
:::

It is possible to provide pandas with multiple conditions at the same time.
This can be done by combining multiple statements with `&`.

```{.python eval=False}
two_at_once = (df["Education"] == "Graduation") & (df["Marital_Status"] == "Single")
two_at_once
```

::: {.callout-note collapse="true"}
## Expand to see output
```
0        True
1        True
2       False
3       False
4       False
        ...  
1749    False
1750    False
1751    False
1752    False
1753    False
Length: 1754, dtype: bool
```
:::

To find out the total number of Graduated singles, the `.sum()` can be used.

```{.python eval=False} 
sum(two_at_once)
```

::: {.callout-note collapse="true"}
## Expand to see output
```
252
```
:::

These `Series` of Booleans can be used to subset the dataframe to rows where the condition(s) are _True_:

```{.python eval=False} 
df[two_at_once]
```
:::{.callout-note collapse="true"}
## Expand to see output

```{r}
#| echo: false
#| results: 'asis'
#| message: false
#| warning: false

library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/python-pandas/table_4_all_data_conditional_subset.csv")

# Create the table with gt
data %>%
    gt() %>%
    tab_options(
        table.width = pct(100),
        table.layout = "fixed",
        container.overflow.x = "scroll"
    )
```
252 rows × 20 columns
:::

It is not actually necessary to create a `series` every time for subsetting the table and it can be done in one go by combining the conditions within the _`df[]`_.

```python
df[(df["Education"] == "Master") & (df["Marital_Status"] == "Single")]
```
:::{.callout-note collapse="true"}
## Expand to see output
```{r}
#| echo: false
#| results: 'asis'
#| message: false
#| warning: false

library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/python-pandas/table_5_all_data_conditional_subset2.csv")

# Create the table with gt
data %>%
    gt() %>%
    tab_options(
        table.width = pct(100),
        table.layout = "fixed",
        container.overflow.x = "scroll"
    )
```
75 rows × 20 columns
:::

### Describing a DataFrame

Sometimes is is nice to get a quick overview of the data in a table, such as `means` and `counts`.
`Pandas` has a native function to do just that, it will output a `count`, `mean`, `standard deviation`, `minimum`, `25th percentile (Q1)`, `median (50th percentile or Q2)`, `75th percentile (Q3)`, and `maximum` for each numeric columns.

```{.python eval=False} 
df.describe()
```
:::{.callout-note collapse="true"}
## Expand to see output
```{r}
#| echo: false
#| results: 'asis'
#| message: false
#| warning: false

library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/python-pandas/table_6_Describing_df.csv")

# Create the table with gt
data %>%
    gt() %>%
    tab_options(
        table.width = pct(100),
        table.layout = "fixed",
        container.overflow.x = "scroll"
    )
```
8 rows × 18 columns
:::

We can also directly calculate the relevant statistics on numberic columns or rows we are interested in using the functions `max()`, `min()`, `mean()`, `median()` etc..

```{.python eval=False} 
df["MntWines"].max()
```

:::{.callout-note collapse="true"}
## Expand to see output
```
1492
```
:::

```{.python eval=False} 
df[["Kidhome", "Teenhome"]].mean()
```

:::{.callout-note collapse="true"}
## Expand to see output
```
Kidhome     0.456100
Teenhome    0.480616
dtype: float64
```
:::

There are also ceartin functions that are usfule for non-numeric columns.
To know which stings are present in a `object`, the fuction `unique()` can be used, this will returns an `array` with the unique values in the column or row.

```{.python eval=False} 
df["Education"].unique()
```
:::{.callout-note collapse="true"}
## Expand to see output
```
array(['Graduation', 'Master', 'Basic', '2n Cycle'], dtype=object)
```
:::

To know how often a value is present in a column or row, the function `value_counts()` can be used.
This will print a `series` for all the unique values and print a count.

```{.python eval=False} 
df["Marital_Status"].value_counts()
```
:::{.callout-note collapse="true"}
## Expand to see output
```
Marital_Status
Married     672
Together    463
Single      382
Divorced    180
Widow        53
Alone         2
Absurd        2
Name: count, dtype: int64
```
:::

### Getting summary statistics on grouped data

`Pandas` is equipped with lots of useful functions which make complicated tasks very easy and fast.
One of these functions is `.groupby()` with the arguments `by=...`, which will group a `DataFrame` using a categorical column (for example `Education` or `Marital_Status`).
This makes it possible to perform operations on a group directly without the need for subsetting.
For example, to get a mean income value for the different Education levels in the `DataFrame` can be done by specifying the column name for the grouping variable by `.groupby(by='Education')` and specifying the column name to perform this action on `[Income]` followed by the `sum()` function.

```{.python eval=False} 
df.groupby(by="Education")["Income"].mean()
```
:::{.callout-note collapse="true"}
## Expand to see output
```
Education
2n Cycle      47633.190000
Basic         20306.259259
Graduation    52720.373656
Master        52917.534247
Name: Income, dtype: float6
```
:::

### Subsetting Questions and Exercises

Here there are several exercises to try conditional subsetting.
Try to first before seeing the awnsers.

::: {.callout-tip title="Question" appearance="simple"}
How many Single people are there in the table that also greduated? And how many are single?
:::
::: {.callout-note collapse="true" title="Answer"}
```{.python eval=False} 
sum(df["Marital_Status"] == "Single")
```

::: {.callout-note collapse="true"}
## Expand to see output
```
382
```
:::
:::

::: {.callout-tip title="Question" appearance="simple"}
Subset the `DataFrame` with people born before 1970 and after 1970 and store both `DataFrames`
:::
::: {.callout-note collapse="true" title="Answer"}

```{.python eval=False} 
df_before = df[df["Year_Birth"] < 1970]
df_after = df[df["Year_Birth"] >= 1970]
```
:::

::: {.callout-tip title="Question" appearance="simple"}
How many people are in the two `DataFrames`? 
:::
::: {.callout-note collapse="true" title="Answer"}

```{.python eval=False} 
print("n(before)   =", df_before.shape[0])
print("n(after)   =", df_after.shape[0])
```
::: {.callout-note collapse="true"}
## Expand to see output
```
n(before)   = 804
n(after)   = 950
```
:::
:::

::: {.callout-tip title="Question" appearance="simple"}
Do the total number of people sum up to the original `DataFrame` total?
:::
::: {.callout-note collapse="true" title="Answer"}

```{.python eval=False} 
 df_before.shape[0] + df_after.shape[0] == df.shape[0]
```
:::{.callout-note collapse="true"}
## Expand to see output
True
:::
```{.python eval=False} 
print("n(sum)      =", df_before.shape[0] + df_after.shape[0])
print("n(expected) =", df.shape[0])
```
:::{.callout-note collapse="true"}
## Expand to see output
```
n(sum)      = 1754
n(expected) = 1754
```
:::
:::

::: {.callout-tip title="Question" appearance="simple"}
How does the mean income of the two groups differ?
:::

::: {.callout-note collapse="true" title="Answer"}

```{.python eval=False} 
  print("income(before) =", df_before["Income"].mean())
  print("income(after)  =", df_after["Income"].mean())
  
```
:::{.callout-note collapse="true"}
## Expand to see output
income(before) = 55513.38113207547

income(after)  = 47490.29255319149  
:::     
:::

::: {.callout-tip title="Question" appearance="simple"}
**Bonus**: Can you find something else that differs a lot between the two groups?
:::

::: {.callout-note collapse="true" title="Answer"}
This is an open ended question.
:::

## Dealing with missing data

In large tables, it is often important to check if there are columns or rows that have missing data.
`pandas` represents missing data with NA (Not Available).
To identify these missing values, pandas provides the .isna() function.
This function checks every cell in the `DataFrame` and returns a `DataFrame` of the same shape, where each cell contains a Boolean value: `True` if the original cell contains `NA`, and `False` otherwise.

```python
df.isna()
```
:::{.callout-note collapse="true"}
## Expand to see output

```{r}
#| echo: false
#| results: 'asis'
#| message: false
#| warning: false

library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/python-pandas/table_7_isna_df.csv")

# Create the table with gt
data %>%
    gt() %>%
    tab_options(
        table.width = pct(100),
        table.layout = "fixed",
        container.overflow.x = "scroll"
    )
```
1754 rows × 20 columns
:::

It is very hard to see if there are any 'True' values in this new Boolean table.
To investigate how many missing values are present in the table, the `sum()` function can be used.
In `Python`, `True` has the `1` assigned to it and `False` `0`.

```{.python eval=False} 
df.isna().sum()
```
:::{.callout-note collapse="true"}
## Expand to see output
```
ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 19
Kidhome                 0
Teenhome                0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
Complain                0
dtype: int64
```
:::

In this case is it not clear what a missing value represents, are these individuals who did not want to state their income, or did not have an income? 
For the tutorial, we are going to keep them in the data.

Two possible actions that can be taken to deal with missing data.
One is to remove the row or column using the `.dropna()` function.
This function will by default remove the row with the `NA` but with specifying the `axis=1` the column will be dropped.
The second course of action is filling the empty cells with a value, this can be done with the `.fillna()` function, which will substitute the missing value with a set value, such as `0`.

##  Combining data

### Concatenation exercises

Data is very often present in multiple tables.
Think, for example, about a taxonomy table giving count data per sample.
One way to combine multiple datasets is through concatenation, which either combines all columns or rows of multiple `DataFrames`.
The function in Pandas that does just that is called `.concat`.
This command combines two `DataFrames` by appending all rows or columns: `.concat([first_dataframe, second_dataframe])`.

In the `DataFrame`, there are individuals with the education levels Graduation, Master, Basic, and 2n Cycle.
PhD is missing; however, there is data on people with the education level PhD in another table called phd_data.tsv.

With everything learned so far, and basic information on the `.concat() `function, try to read in the data from `../phd_data.tsv` and concatenate it to the existing `df`.

::: {.callout-tip title="Question" appearance="simple"}
Read the _tsv_ "phd_data.tsv" as a new `DataFrame` and name the variable `df2`.
:::

::: {.callout-note collapse="true" title="Answer"}
```{.python eval=False} 
df2 = pd.read_csv("../phd_data.tsv", sep="\t")
```
:::

::: {.callout-tip title="Question" appearance="simple"}
Concatenate the "old" `DataFrame` `df` and the new `df2` and name the concatenated one `concat_df`.
:::
::: {.callout-note collapse="true" title="Answer"}
```{.python eval=False}  
concat_df = pd.concat([df, df2])
concat_df
```
:::{.callout-note collapse="true"}
## Expand to see output
```{r}
#| echo: false
#| results: 'asis'
#| message: false
#| warning: false

library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/python-pandas/table_8_concat_df.csv")

# Create the table with gt
data %>%
    gt() %>%
    tab_options(
        table.width = pct(100),
        table.layout = "fixed",
        container.overflow.x = "scroll"
    )
```
2240 rows × 20 columns
:::
:::
::: {.callout-tip title="Question" appearance="simple"}
Is there anything weird about the new `DataFrame` and can you fix that?
:::
::: {.callout-note collapse="true" title="Answer"}

We previously removed the columns "Z_CostContact" and "Z_Revenue" but they are in the new data again.

We can remove them like before.

```{.python eval=False} 
concat_df = concat_df.drop("Z_CostContact", axis=1)
concat_df = concat_df.drop("Z_Revenue", axis=1)
concat_df
```
:::{.callout-note collapse="true"}
## Expand to see output
```{r}
#| echo: false
#| results: 'asis'
#| message: false
#| warning: false

library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/python-pandas/table_9_doncat_drop_df.csv")

# Create the table with gt
data %>%
    gt() %>%
    tab_options(
        table.width = pct(100),
        table.layout = "fixed",
        container.overflow.x = "scroll"
    )
```
2240 rows × 18 columns
:::
:::

::: {.callout-tip title="Question" appearance="simple"}
Is there something interesting about the marital status of some people that have a PhD?
:::
::: {.callout-note collapse="true" title="Answer"}

```{.python eval=False} 
concat_df[concat_df["Education"]=="PhD"]["Marital_Status"].value_counts()
```
:::{.callout-note collapse="true"}
## Expand to see output
```
Marital_Status
Married     192
Together    117
Single       98
Divorced     52
Widow        24
YOLO          2
Alone         1
Name: count, dtype: int64
```
:::

There are two people that have "YOLO" as their Marital Status ...

:::

### Merging

Besides concatenating two `DataFrames`, there is another powerful function for combining data from multiple sources: `.merge()`.
This function is especially useful when we have different types of related data in separate tables.
For example, we might have a taxonomy table with count data per sample and a metadata table in another `DataFrame`.

The pandas function `.merge()` allows us to combine these `DataFrames` based on a common column.
This column must exist in both `DataFrames` and contain similar values.

To illustrate the `.merge()` function, we will create a new `DataFrame` and merge it with the existing one.
Let's rank the different education levels from 1 to 5 in a new `DataFrame` and merge this with the existing `DataFrame`.

As shown before, there are multiple ways of making a new `DataFrame`.
Here, we first create a dictionary and then use the from_dict() function to transform this into a `DataFrame`.

The `from_dict()` function will by default use the `keys()` of the dictionary as column names.
A dictionary is made up of `{'key':'value'}` pairs.
To specify that we want the `keys()` as rows, the `orient=` argument has to be set to `index`.
This means that the row names are the `dictionary` `keys()`.

```{.python eval=False} 
education_dictionary = {
    "Basic": 1,
    "2n Cycle": 2,
    "Graduation": 3,
    "Master": 4,
    "PhD": 5
}

education_df = pd.DataFrame.from_dict(education_dictionary, orient="index")
education_df
```
:::{.callout-note collapse="true"}
## Expand to see output
||0|
|-|-|
|Basic	|1|
|2n Cycle	|2|
|Graduation	|3|
|Master	|4|
|PhD	|5|
:::

The resulting `DataFrame` has the Education level as `index` "row names" and the column name is `0`.

The `0` is not a particular useful name for our new column, so we can use the `.rename()` function to change this.

```{.python eval=False} 
education_df = education_df.rename(columns={0: "Level"})
education_df
```
:::{.callout-note collapse="true"}
## Expand to see output
||Level|
|-|-|
|Basic	|1|
|2n Cycle	|2|
|Graduation	|3|
|Master	|4|
|PhD	|5|

:::

Now that there is a new `DataFrame` with all the needed information, we can `merge` it with our previous `concat_df` on the Education column.
The `.merge()` function requires several arguments: left=, which is the DataFrame that will be on the left side of the merge, in our case concat_df; right=, which is the DataFrame that will be on the right side of the merge, which is `education_df`; left_on=, specifying the column to merge on from the left `DataFrame` `concat_df`, which is `Education`; and `right_index=True`, indicating that the right DataFrame `education_df` should be merged using its index.
If the values were in a column instead of the index, we would use `right_on=` instead.

The right one is `education_df` and the information is in the index.

```{.python eval=False} 
merged_df = pd.merge(left=concat_df, right=education_df, left_on="Education", right_index=True)
merged_df
```
:::{.callout-note collapse="true"}
## Expand to see output

```{r}
#| echo: false
#| results: 'asis'
#| message: false
#| warning: false

library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/python-pandas/table_10_merged_df.csv")

# Create the table with gt
data %>%
    gt() %>%
    tab_options(
        table.width = pct(100),
        table.layout = "fixed",
        container.overflow.x = "scroll"
    )
```
2240 rows × 19 columns

:::

## Data visualisation

Just looking at `DataFrames` is nice and useful, but in many cases, it is easier to look at data in graphs.
The function that can create plots directly from `DataFrames` is `.plot()`.
The `.plot()` function uses the plotting library `matplotlib` by default in the background.
There are other plotting libraries such as `Plotnine` which will be shown further in the tutorial.

The only arguments `.plot()` requires are `kind=...`, and the plot axis `x=...` and `y=....` The `kind` argument specifies the type of plot, such as `hist` for histogram, `bar` for bar plot, and `scatter` for scatter plot.
Check out the `Pandas` documentation ([https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)) for more plot `kind`s and useful syntax.
There are many aesthetic functions that can help to create beautiful plots.
These functions such as `.set_xlabel()` or `,set_title()` are added to the plot, as shown in the examples below.

### Histogram

```{.python eval=False} 
ax = merged_df.plot(kind="hist", y="Income")
ax.set_xlabel("Income")
ax.set_title("Histogram of income")
```

:::{.callout-note collapse="true"}
## Expand to see output
Results in @fig-pythonpandas-incomehistogram.

![This is the histogram of income that should appear with we run the code above.](https://github.com/Kevinnota/teaching_stuff/blob/main/histogram.png?raw=true){#fig-pythonpandas-incomehistogram}
:::

::: {.callout-tip title="Question" appearance="simple"}
This does not look very good because the x-axis extends so much!
Looking at the data, can you figure out what might cause this?
:::
::: {.callout-note collapse="true" title="Answer"}

When we look at the highest earners, we see that somebody put _666666_ as their income.
This is much higher than any other income, which makes the histogram very draged out to include this person.

```{.python eval=False} 
merged_df[merged_df["Income"].sort_values(ascending=False)]
```
:::{.callout-note collapse="true"}
## Expand to see output
```
1749    666666.0
1006    157733.0
1290    157146.0
512     153924.0
504     105471.0
        ...
1615    NaN
1616    NaN
1616    NaN
1621    NaN
1744    NaN
Name : Income, Length: 1754, dtype: float63
```
:::

:::

::: {.callout-tip title="Question" appearance="simple"}
Use conditional subsetting to make the histogram look nicer.
:::

::: {.callout-note collapse="true" title="Answer"}

```{.python eval=False} 
ax = merged_df[merged_df["Income"] != 666666].plot(kind="hist",y="Income")
ax.set_xlabel("Income")
ax.set_title("Fixed Histogram of income")
```
:::{.callout-note collapse="true"}
## Expand to see output
Results in @fig-pythonpandas-incomefixhistogram.

![To "fix" the histogram, the one person with the income of 666666 is removed, making the plot look a lot neater.](https://github.com/Kevinnota/teaching_stuff/blob/main/fixed_histogram.png?raw=true){#fig-pythonpandas-incomefixhistogram}
:::
:::

### Bar plot

Instead of making a plot from the original `DataFrame` we can use the `groupby` and `mean` methods to make a plot with summary statistics.

```{.python eval=False} 
grouped_by_education = merged_df.groupby(by="Education")["Income"].mean()
grouped_by_education

```
:::{.callout-note collapse="true"}
## Expand to see output
```
Education
2n Cycle      47633.190000
Basic         20306.259259
Graduation    52720.373656
Master        52917.534247
PhD           56145.313929
Name: Income, dtype: float64
```
:::

```{.python eval=False} 
ax = grouped_by_education.plot(kind="bar")
ax.set_ylabel("Mean income")
ax.set_title("Mean income for each education level")
```
:::{.callout-note collapse="true"}
## Expand to see output
Results in @fig-pythonpandas-barplot.

![Barplot of the mean income for each education level.](https://github.com/Kevinnota/teaching_stuff/blob/main/bar_plot.png?raw=true){#fig-pythonpandas-barplot}
:::
### Scatter plot

Another kind of plot is the scatter plot, which needs two columns for the _x_ and _y_ axis.

```{.python eval=False} 
ax = df.plot(kind="scatter", x="MntWines", y="MntFruits")
ax.set_title("Wine purchases and Fruit purchases")
```
:::{.callout-note collapse="true"}
## Expand to see output
Results in @fig-pythonpandas-scatterplot.

![A scatter plot with wine purchases on the x-axis and fruit purchases on the y-axis.](https://github.com/Kevinnota/teaching_stuff/blob/main/scatter_polt_01.png?raw=true){#fig-pythonpandas-scatterplot}
:::

We can also specify whether the axes should be on the log scale or not.

```{.python eval=False} 
ax = df.plot(kind="scatter", x="MntWines", y="MntFruits", logy=True, logx=True)
ax.set_title("Wine purchases and Fruit purchases, on log scale")
```
:::{.callout-note collapse="true"}
## Expand to see output
Text(0.5, 1.0, 'Wine purchases and Fruit purchases, on log scale')

![The scatter plot with wine purchases on the x-axis and fruit purchases on the y-axis, with on a log scale.](https://github.com/Kevinnota/teaching_stuff/blob/main/scatter_polt_02.png?raw=true)
:::

## Plotnine

Plotnine is the Python clone of ggplot2, which is very powerful and is great if we are already familiar with the ggplot2 syntax!

```{.python eval=False} 
from plotnine import *
```
```{.python eval=False} 
(ggplot(merged_df, aes("Education", "MntWines", fill="Education"))
 + geom_boxplot(alpha=0.8))
```
:::{.callout-note collapse="true"}
## Expand to see output
Results in @fig-pythonpandas-boxplot.

![Boxplot with the amount spent on wine per education.](https://github.com/Kevinnota/teaching_stuff/blob/main/boxplot.png?raw=true){#fig-pythonpandas-boxplot}
:::

```{.python eval=False} 
(ggplot(merged_df[(merged_df["Year_Birth"]>1900) & (merged_df["Income"]!=666666)],
        aes("Year_Birth", "Income", fill="Education"))
 + geom_point(alpha=0.5, stroke=0)
 + facet_wrap("Marital_Status"))
```
:::{.callout-note collapse="true"}
## Expand to see output
Results in @fig-pythonpandas-facetplot.

![Plot of the income of people born after 1900, faceted by marital status, and filled by education level.](https://github.com/Kevinnota/teaching_stuff/blob/main/facet_scatter.png?raw=true){#fig-pythonpandas-facetplot}
:::


### Advanced Questions and Exercises

Now that we are familiar with python, pandas, and plotting.
There are two data.tables from _AncientMetagenomeDir_ which contains metadata from metagenomes.
We should, by using the code in the tutorial be able to explore the datasets and make some fancy plots.

```verbatim
file names:
sample_table_url
library_table_url
```

## Summary

In this chapter, we have started exploring the basics of data analysis using `Python` with the versatile `Pandas` library.
We wrote `Python` code and executed it in a Jupyter Notebook, with just a handful of functions such as `.read_csv()`, `.loc[]`, `drop()`, `merge()`, `.concat() and `plot()`, we have done data manipulation, calculated summary statistics, and plotted the data.

The takeaway messages therefore are:

- Python, Pandas and Jupyter Notebook are relatively easy to use and powerful for data analysis
- If you know R and R markdown, the syntax of Python is easy to learn

These functions in this chapter and the general idea of the Python syntax should help you get started using Python on your data.

## (Optional) clean-up

Let's clean up your working directory by removing all the data and output from this chapter.

When closing your `jupyter` notebook(s), say no to saving any additional files.

Press <kbd>ctrl</kbd> + <kbd>c</kbd> on your terminal, and type <kbd>y</kbd> when requested.
Once completed, the command below will remove the `/<PATH>/<TO>/python-pandas` directory **as well as all of its contents*.

::: {.callout-tip}
## Pro Tip
Always be VERY careful when using `rm -r`.
Check 3x that the path you are specifying is exactly what you want to delete and nothing more before pressing ENTER!
:::

```bash
rm -r /<PATH>/<TO>/python-pandas*
```
Once deleted you can move elsewhere (e.g. `cd ~`).

We can also get out of the `conda` environment with

```bash
conda deactivate
```

To delete the conda environment

```bash
conda remove --name python-pandas --all -y
```