diff --git a/docs/gettingstarted.rst b/docs/gettingstarted.rst index 8505d8b..8ca08e9 100644 --- a/docs/gettingstarted.rst +++ b/docs/gettingstarted.rst @@ -9,4 +9,5 @@ we encourage you to open an issue on the :maxdepth: 1 Installing nested-pandas - Contribution Guide \ No newline at end of file + Contribution Guide + Quickstart Guide \ No newline at end of file diff --git a/docs/gettingstarted/quickstart.ipynb b/docs/gettingstarted/quickstart.ipynb new file mode 100644 index 0000000..8f938d8 --- /dev/null +++ b/docs/gettingstarted/quickstart.ipynb @@ -0,0 +1,233 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Quickstart" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With a valid Python environment, nested-pandas and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# % pip install nested-pandas" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Nested-Pandas is tailored towards efficient analysis of nested datasets. Let's load a toy dataset to show how it works." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nested_pandas.datasets import generate_data\n", + "\n", + "# generate_data creates some toy data\n", + "nf = generate_data(10, 100) # 10 rows, 100 nested rows per row\n", + "nf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above dataframe is a `NestedFrame`, which extends the capabilities of the Pandas `DataFrame` to support columns with nested information. In this example, we have the top level dataframe with 10 rows and 2 typical columns, \"a\" and \"b\". The \"nested\" column contains a dataframe in each row. We can inspect the contents of the \"nested\" column using pandas API tooling like `loc`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nf.loc[0][\"nested\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we see that within the \"nested\" column there are `NestedFrame` objects with their own data. In this case we have 3 columns (\"t\", \"flux\", and \"band\"). Alternatively, we could inspect the available columns using some custom properties of the `NestedFrame`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Shows which columns have nested data\n", + "nf.nested_columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Provides a dictionary of \"base\" (top-level) and nested column labels\n", + "nf.all_columns" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "nested-pandas extends the Pandas API, meaning any operation you could do in Pandas is available within nested-pandas. However, nested-pandas has additional functionality and tooling to better support working with Nested datasets. For example, let's look at `query`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Normal queries work as expected, rejecting rows from the dataframe that don't meet the criteria\n", + "nf.query(\"a > 0.2\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above query is native Pandas, however with nested-pandas we can use hierarchical column names to extend `query` to nested layers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Applies the query to \"nested\", filtering based on \"t >17\"\n", + "nf_g = nf.query(\"nested.t > 17.0\")\n", + "nf_g" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This query does not affect the rows of the top-level dataframe, but rather applies the query to the \"nested\" dataframes. If we look at one of them, we can see the effect of the query." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# All t <= 17.0 have been removed\n", + "nf_g.loc[0][\"nested\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A limited set of functions have been extended in this way so far, with the aim being to fully support this hierarchical access where applicable in the Pandas API." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we'll end with the flexible `reduce` function. `reduce` functions similarly to Pandas' `apply` but flattens (reduces) the inputs from nested layers into array inputs to the given apply function. For example, let's find the mean flux for each dataframe in \"nested\":" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "# use hierarchical column names to access the flux column\n", + "# passed as an array to np.mean\n", + "nf.reduce(np.mean, \"nested.flux\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This can be used to apply any custom functions you need for your analysis, and just to illustrate that point further let's define a custom function that just returns it's inputs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def show_inputs(*args):\n", + " return args" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Applying some inputs via reduce, we see how it sends inputs to a given function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nf_inputs = nf.reduce(show_inputs, \"a\", \"nested.band\")\n", + "nf_inputs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nf_inputs.loc[0]" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}