-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Notebook for Loading Data to NestedPandas (#85)
* Add Notebook for Loading Data to NestedPandas * Clear notebook output * Run pre-commit hooks * Address review comments
- Loading branch information
Showing
1 changed file
with
289 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,289 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Loading Data into Nested-Pandas" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"With a valid Python environment, nested-pandas and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# % pip install nested-pandas" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from nested_pandas.datasets import generate_parquet_file\n", | ||
"from nested_pandas import NestedFrame\n", | ||
"from nested_pandas import read_parquet\n", | ||
"\n", | ||
"import os\n", | ||
"import pandas as pd\n", | ||
"import tempfile" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Loading Data from Dictionaries\n", | ||
"Nested-Pandas is tailored towards efficient analysis of nested datasets, and supports loading data from multiple sources.\n", | ||
"\n", | ||
"We can use the `NestedFrame` constructor to create our base frame from a dictionary of our columns.\n", | ||
"\n", | ||
"We can then create an addtional pandas dataframes and pack them into our `NestedFrame` with `NestedFrame.add_nested`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"nf = NestedFrame(data={\"a\": [1, 2, 3], \"b\": [2, 4, 6]}, index=[0, 1, 2])\n", | ||
"\n", | ||
"nested = pd.DataFrame(\n", | ||
" data={\"c\": [0, 2, 4, 1, 4, 3, 1, 4, 1], \"d\": [5, 4, 7, 5, 3, 1, 9, 3, 4]},\n", | ||
" index=[0, 0, 0, 1, 1, 1, 2, 2, 2],\n", | ||
")\n", | ||
"\n", | ||
"nf = nf.add_nested(nested, \"nested\")\n", | ||
"nf" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Loading Data from Parquet Files" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"For larger datasets, we support loading data from parquet files.\n", | ||
"\n", | ||
"In the following cell, we generate a series of temporary parquet files with random data, and ingest them with the `read_parquet` method.\n", | ||
"\n", | ||
"First we load each file individually as its own data frame to be inspected. Then we use `read_parquet` to create the `NestedFrame` `nf`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"base_df, nested1, nested2 = None, None, None\n", | ||
"nf = None\n", | ||
"\n", | ||
"# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n", | ||
"# You can of course remove this and use your own directory and real files on your system.\n", | ||
"with tempfile.TemporaryDirectory() as temp_path:\n", | ||
" # Generates parquet files with random data within our temporary directorye.\n", | ||
" generate_parquet_file(10, {\"nested1\": 100, \"nested2\": 10}, temp_path, file_per_layer=True)\n", | ||
"\n", | ||
" # Read each individual parquet file into its own dataframe.\n", | ||
" base_df = read_parquet(os.path.join(temp_path, \"base.parquet\"))\n", | ||
" nested1 = read_parquet(os.path.join(temp_path, \"nested1.parquet\"))\n", | ||
" nested2 = read_parquet(os.path.join(temp_path, \"nested2.parquet\"))\n", | ||
"\n", | ||
" # Create a single NestedFrame packing multiple parquet files.\n", | ||
" nf = read_parquet(\n", | ||
" data=os.path.join(temp_path, \"base.parquet\"),\n", | ||
" to_pack={\n", | ||
" \"nested1\": os.path.join(temp_path, \"nested1.parquet\"),\n", | ||
" \"nested2\": os.path.join(temp_path, \"nested2.parquet\"),\n", | ||
" },\n", | ||
" )" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"When examining the individual tables for each of our parquet files we can see that:\n", | ||
"\n", | ||
"a) they all have different dimensions\n", | ||
"b) they have shared indices" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Print the dimensions of all of our underlying tables\n", | ||
"print(\"Our base table 'base.parquet' has shape:\", base_df.shape)\n", | ||
"print(\"Our first nested table table 'nested1.parquet' has shape:\", nested1.shape)\n", | ||
"print(\"Our second nested table table 'nested2.parquet' has shape:\", nested2.shape)\n", | ||
"\n", | ||
"# Print the unique indices in each table:\n", | ||
"print(\"The unique indices in our base table are:\", base_df.index.values)\n", | ||
"print(\"The unique indices in our first nested table are:\", nested1.index.unique())\n", | ||
"print(\"The unique indices in our second nested table are:\", nested2.index.unique())" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"So inspect `nf`, a `NestedFrame` we created from our call to `read_parquet` with the `to_pack` argument, we're able to pack nested parquet files according to the shared index values with the index in `base.parquet`.\n", | ||
"\n", | ||
"The resulting `NestedFrame` having the same number of rows as `base.parquet` and with `nested1.parquet` and `nested2.parquet` packed into the 'nested1' and 'nested2' columns respectively." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"nf" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Since we loaded each individual parquet file into its own dataframe, we can also verify that using `read_parquet` with the `to_pack` argument is equivalent to the following method of packing the dataframes directly with `NestedFrame.add_nested`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Packing Together Existing Dataframes Into a NestedFrame" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"NestedFrame(base_df).add_nested(nested1, \"nested1\").add_nested(nested2, \"nested2\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Saving NestedFrames to Parquet Files\n", | ||
"\n", | ||
"Additionally we can save an existing `NestedFrame` as one of more parquet files using `NestedFrame.to_parquet``\n", | ||
"\n", | ||
"When `by_layer=True` we save each individual layer of the NestedFrame into its own parquet file in a specified output directory.\n", | ||
"\n", | ||
"The base layer will be outputted to \"base.parquet\", and each nested layer will be written to a file based on its column name. So the nested layer in column `nested1` will be written to \"nested1.parquet\"." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"restored_nf = None\n", | ||
"\n", | ||
"# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n", | ||
"# You can of course remove this and use your own directory and real files on your system.\n", | ||
"with tempfile.TemporaryDirectory() as temp_path:\n", | ||
" nf.to_parquet(\n", | ||
" temp_path, # The directory to save our output parquet files.\n", | ||
" by_layer=True, # Save each layer of the NestedFrame to its own parquet file.\n", | ||
" )\n", | ||
"\n", | ||
" # List the files in temp_path to ensure they were saved correctly.\n", | ||
" print(\"The NestedFrame was saved to the following parquet files :\", os.listdir(temp_path))\n", | ||
"\n", | ||
" # Read the NestedFrame back in from our saved parquet files.\n", | ||
" restored_nf = read_parquet(\n", | ||
" data=os.path.join(temp_path, \"base.parquet\"),\n", | ||
" to_pack={\n", | ||
" \"nested1\": os.path.join(temp_path, \"nested1.parquet\"),\n", | ||
" \"nested2\": os.path.join(temp_path, \"nested2.parquet\"),\n", | ||
" },\n", | ||
" )\n", | ||
"\n", | ||
"restored_nf # our dataframe is restored from our saved parquet files" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We also support saving a `NestedFrame` as a single parquet file where the packed layers are still packed in their respective columns.\n", | ||
"\n", | ||
"Here we provide `NestedFrame.to_parquet` with the desired path of the *single* output file (rather than the path of a directory to store *multiple* output files) and use `per_layer=False'\n", | ||
"\n", | ||
"Our `read_parquet` function can load a `NestedFrame` saved in this single file parquet without requiring any additional arguments. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"restored_nf_single_file = None\n", | ||
"\n", | ||
"# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n", | ||
"# You can of course remove this and use your own directory and real files on your system.\n", | ||
"with tempfile.TemporaryDirectory() as temp_path:\n", | ||
" output_path = os.path.join(temp_path, \"output.parquet\")\n", | ||
" nf.to_parquet(\n", | ||
" output_path, # The filename to save our NestedFrame to.\n", | ||
" by_layer=False, # Save the entire NestedFrame to a single parquet file.\n", | ||
" )\n", | ||
"\n", | ||
" # List the files within our temp_path to ensure that we only saved a single parquet file.\n", | ||
" print(\"The NestedFrame was saved to the following parquet files :\", os.listdir(temp_path))\n", | ||
"\n", | ||
" # Read the NestedFrame back in from our saved single parquet file.\n", | ||
" restored_nf_single_file = read_parquet(output_path)\n", | ||
"\n", | ||
"restored_nf_single_file # our dataframe is restored from a single saved parquet file" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.9" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |