diff --git a/README.md b/README.md index 59973d0bd..753b70b42 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,7 @@ The goal is to offer high-level APIs for developers to quickly get started in wo ## 📝 Table of Contents + - [About](#about) - [Run your first transform](#gettingstarted) - [Scaling transforms from laptop to cluster](#laptop_cluster) @@ -32,6 +33,7 @@ The goal is to offer high-level APIs for developers to quickly get started in wo - [Papers and Talks](#talks_papers) ## 📖 About + Data Prep Kit is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning, RAG or instruction-tuning. Data Prep Kit contributes a set of modules that the developer can get started with to easily build data pipelines suitable for their use case. These modules have been tested while producing pre-training datasets for the [Granite open source LLM models](https://huggingface.co/ibm-granite). @@ -92,6 +94,7 @@ conda create -n data-prep-kit -y python=3.11 conda activate data-prep-kit python --version ``` + Check if the python version is 3.11. If you are using a linux system, install gcc using the below commands: @@ -100,28 +103,44 @@ If you are using a linux system, install gcc using the below commands: conda install gcc_linux-64 conda install gxx_linux-64 ``` + Next, install the data prep toolkit library. This library installs both the python and ray versions of the transforms. + ```bash -pip3 install data-prep-toolkit-transforms-ray -pip3 install jupyterlab +pip3 install data-prep-toolkit-transforms-ray +pip3 install jupyterlab ipykernel ipywidgets +## install custom kernel +python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit" ``` + Test, your installation. If you are able to import these data-prep-kit libraries successfully in python, your installation has succeeded. + ```bash -from data_processing_ray.runtime.ray import RayTransformLauncher -from data_processing.runtime.pure_python import PythonTransformLauncher +## start python interpretter +$ python + +# import DPK libraries +>>> from data_processing_ray.runtime.ray import RayTransformLauncher +>>> from data_processing.runtime.pure_python import PythonTransformLauncher ``` + +If there are no errors, you are good to go! + ### Run your first transform -Let's try a simple transform to extract content from PDF files. This [notebook](examples/notebooks/Run_your_first_transform.ipynb) demonstrates how to run a data preparation transformation that extracts content from PDF files using the data-prep-kit, leveraging Ray for parallel execution while still allowing local processing. To run this notebook, launch jupyter from the same virtual environment using the command below. +Let's try a simple transform to extract content from PDF files. We have following notebooks that demonstrate how to run a data preparation transformation that extracts content from PDF files using the data-prep-kit. -```bash -conda install ipykernel -# Add the Conda environment to Jupyter -python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit" -jupyter-lab -``` -After opening the jupyter notebook, change the kernel to dataprepkit. +- Option 1: Pure python notebook : [examples/notebooks/Run_your_first_transform_python.ipynb](examples/notebooks/Run_your_first_transform_python.ipynb) - easiest to get started +- Option 2: This one uses Ray framework for parallel execution while still allowing local processing : [examples/notebooks/Run_your_first_transform_ray.ipynb](examples/notebooks/Run_your_first_transform_ray.ipynb) + +You can try either one, or both 😄 + +To run the notebooks, launch jupyter from the same virtual environment you created using the command below. + +`jupyter lab` + +After opening the jupyter notebook, change the kernel to `dataprepkit`, so all libraries will be properly loaded. Explore more examples [here](examples/notebooks). @@ -140,24 +159,28 @@ The annotator design also allows a user to verify the results of the processing - **Filter** A filter transform processes the data and outputs the transformed data, e.g., exact deduplication. A general purpose [SQL-based filter transform](transforms/universal/filter) enables a powerful mechanism for identifying columns and rows of interest for downstream processing. + For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms). One can leverage Python-based processing logic and the Data Processing Library to easily build and contribute new transforms. We have provided an [example transform](transforms/universal/noop) that can serve as a template to add new simple transforms. Follow the step by step [tutorial](data-processing-lib/doc/simplest-transform-tutorial.md) to help you add your own new transform. -For a deeper understanding of the library's architecture, its transforms, and available runtimes, we encourage the reader to consult the comprehensive [overview document] (../../data-processing-lib/doc/overview.md) alongside dedicated sections on [transforms] (../../data-processing-lib/doc/transforms.md) and [runtimes] (../../data-processing-lib/doc/transform-runtimes.md). +For a deeper understanding of the library's architecture, its transforms, and available runtimes, we encourage the reader to consult the comprehensive [overview document](../../data-processing-lib/doc/overview.md) alongside dedicated sections on [transforms](../../data-processing-lib/doc/transforms.md) and [runtimes](../../data-processing-lib/doc/transform-runtimes.md). -Additionally, check out our video tutorial (https://www.youtube.com/watch?v=0WUMG6HIgMg) for a visual, example-driven guide on adding custom modules. +Additionally, check out our [video tutorial](https://www.youtube.com/watch?v=0WUMG6HIgMg) for a visual, example-driven guide on adding custom modules. ## 💻 -> 🖥️☁️ From laptop to cluster Data-prep-kit provides the flexibility to transition your projects from proof-of-concept (PoC) stage to full-scale production mode, offering all the necessary tools to run your data transformations at high volume. In this section, we enable you how to run your transforms at scale and how to automate them. ### Scaling of Transforms + To enable processing of large data volumes leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html) or [Spark](https://spark.apache.org) wrappers are provided, to readily scale out the Python implementations. + A generalized workflow is shown [here](doc/data-processing.md). ### Automation + The toolkit also supports transform execution automation based on [Kubeflow pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (KFP), tested on a locally deployed [Kind cluster](https://kind.sigs.k8s.io/) and external OpenShift clusters. There is an @@ -179,6 +202,7 @@ You can run transforms via docker image or using virtual environments. This [doc ## 🎤 + 📄 Talks and Papers + 1. [Granite Code Models: A Family of Open Foundation Models for Code Intelligence](https://arxiv.org/abs/2405.04324) 2. [Scaling Granite Code Models to 128K Context](https://arxiv.org/abs/2407.13739) 3. Talk on "Building Successful LLM Apps: The Power of high quality data" [Video](https://www.youtube.com/watch?v=u_2uiZBBVIE) [Slides](https://www.slideshare.net/slideshow/data_prep_techniques_challenges_methods-pdf-a190/271527890) diff --git a/examples/notebooks/.gitignore b/examples/notebooks/.gitignore new file mode 100644 index 000000000..65e0ef546 --- /dev/null +++ b/examples/notebooks/.gitignore @@ -0,0 +1,2 @@ +Input-Test-Data +Output-Test-Data \ No newline at end of file diff --git a/examples/notebooks/Run_your_first_transform_python.ipynb b/examples/notebooks/Run_your_first_transform_python.ipynb new file mode 100644 index 000000000..361783992 --- /dev/null +++ b/examples/notebooks/Run_your_first_transform_python.ipynb @@ -0,0 +1,430 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data Prep Kit - Hello World (Pure Python)\n", + "\n", + "This notebook guides you through running your first data preparation transformation using the data-prep-kit. In this example, we will demonstrate a transformation that takes PDF files as input and extracts their content.\n", + "\n", + "This notebook is a pure python version, for ray version see this notebook: [Run_your_first_transform_ray.ipynb](Run_your_first_transform_ray.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-1: Setting up Python Dev Environment\n", + "\n", + "Please follow instructions from [Getting started section](../../README.md#gettingstarted) to setup your python development environment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-2: Get Data\n", + "\n", + "For this example, we will show PDF processing capabilities of DPK. And we will download and use this PDF documents\n", + "\n", + "- [IBM Granite model](https://arxiv.org/abs/2405.04324)\n", + "- [Attention is all you need](https://arxiv.org/pdf/1706.03762) - seminal paper on transformer/attention architecture\n", + "\n", + "The code below will download the PDF. Feel free to try your own PDFs to test it out" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "INPUT_DIR = 'Input-Test-Data'\n", + "OUTPUT_DIR = 'Output-Test-Data'" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Input-Test-Data/Granite_code_models.pdf (1.27 MB) downloaded successfully.\n", + "\n", + "Input-Test-Data/attention_is_all_you_need.pdf (2.22 MB) downloaded successfully.\n" + ] + } + ], + "source": [ + "## This cell will download the input files\n", + "\n", + "import os\n", + "import shutil\n", + "import requests\n", + "from humanfriendly import format_size\n", + "\n", + "def download_file(url, local_file, chunk_size=1024*1024):\n", + " # Check if the local file already exists\n", + " if os.path.exists(local_file):\n", + " file_size = format_size(os.path.getsize(local_file))\n", + " print(f\"Local file '{local_file}' ({file_size}) already exists. Skipping download.\")\n", + " return\n", + "\n", + " # Create the directory if it doesn't exist\n", + " os.makedirs(os.path.dirname(local_file), exist_ok=True)\n", + "\n", + " # Stream the file download\n", + " with requests.get(url, stream=True) as r:\n", + " r.raise_for_status()\n", + " with open(local_file, 'wb') as f:\n", + " for chunk in r.iter_content(chunk_size=chunk_size):\n", + " if chunk: # filter out keep-alive new chunks\n", + " f.write(chunk)\n", + " print()\n", + " file_size = format_size(os.path.getsize(local_file))\n", + " print(f\"{local_file} ({file_size}) downloaded successfully.\")\n", + "## --- end: download_file ------\n", + "\n", + "## setup input/output directories\n", + "shutil.os.makedirs(INPUT_DIR, exist_ok=True)\n", + "shutil.rmtree(OUTPUT_DIR, ignore_errors=True)\n", + "shutil.os.makedirs(OUTPUT_DIR, exist_ok=True)\n", + "\n", + "## Download PDF files\n", + "download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(INPUT_DIR, 'Granite_code_models.pdf' ))\n", + "download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(INPUT_DIR, 'attention_is_all_you_need.pdf' ))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-3: Extract Text from PDF \n", + "\n", + "This code is designed to set up a data transformation process that extracts text from PDF. We will save the output as parquet format." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "import ast\n", + "\n", + "# Utilities from the data-prep-kit's data-processing-lib library provide functions and classes for parameter management, \n", + "from pdf2parquet_transform import (pdf2parquet_contents_type_cli_param, pdf2parquet_contents_types,)\n", + "from data_processing.utils import GB, ParamsUtils\n", + "\n", + "\n", + "ingest_config = {\n", + " pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,\n", + "}\n", + "\n", + "#local_conf: A dictionary specifying the local input and output folders where the PDF files will be read from and the transformed data will be saved.\n", + "local_conf = {\n", + " \"input_folder\": INPUT_DIR,\n", + " \"output_folder\": OUTPUT_DIR,\n", + "}\n", + "\n", + "#params: A dictionary containing various runtime parameters for the transformation.\n", + "#data_local_config: Configuration for local data access, such as input and output folders, converted into a format compatible with the transformation using ParamsUtils.convert_to_ast.\n", + "#data_files_to_use: Specifies that only PDF files (['.pdf']) will be used as input data.\n", + "\n", + "params = {\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " \"data_files_to_use\": ast.literal_eval(\"['.pdf']\"),\n", + "}\n", + "sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3 - Execute" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now its time to run the transformation.\n", + "\n", + "You will notice, that the code will download models to execute the transformation. These models will be used to process PDFs." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "00:21:36 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': False}\n", + "00:21:36 INFO - pipeline id pipeline_id\n", + "00:21:36 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'pure python', 'job id': 'job_id'}\n", + "00:21:36 INFO - code location None\n", + "00:21:36 INFO - data factory data_ is using local data access: input_folder - Input-Test-Data output_folder - Output-Test-Data\n", + "00:21:36 INFO - data factory data_ max_files -1, n_sample -1\n", + "00:21:36 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "00:21:36 INFO - orchestrator pdf2parquet started at 2024-09-04 00:21:36\n", + "00:21:36 INFO - Number of files is 2, source profile {'max_file_size': 2.112621307373047, 'min_file_size': 1.2146415710449219, 'total_file_size': 3.3272628784179688}\n", + "00:21:36 INFO - Initializing models\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "81aa36c6436e4331bc67bd80c5d72945", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Fetching 7 files: 0%| | 0/7 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0attention_is_all_you_need.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1541931a75f9df-e478-43b6-bab6-075e6e4cc52cpdfe8417f232bdadc1760dd998dd64ee650f6140493f1685e...1311732024-09-04T00:22:21.74044710.386293attention_is_all_you_need.pdf
1Granite_code_models.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...2817320aaf7c1f6-592b-4b5e-94d3-d79d887e6be2pdf153d0ed14d3c71894252d0e8584479ec71d793a8d9d7ea...5848262024-09-04T00:22:11.34310324.315634Granite_code_models.pdf
\n", + "" + ], + "text/plain": [ + " filename \\\n", + "0 attention_is_all_you_need.pdf \n", + "1 Granite_code_models.pdf \n", + "\n", + " contents num_pages num_tables \\\n", + "0 {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 15 4 \n", + "1 {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 28 17 \n", + "\n", + " num_doc_elements document_id ext \\\n", + "0 193 1a75f9df-e478-43b6-bab6-075e6e4cc52c pdf \n", + "1 320 aaf7c1f6-592b-4b5e-94d3-d79d887e6be2 pdf \n", + "\n", + " hash size \\\n", + "0 e8417f232bdadc1760dd998dd64ee650f6140493f1685e... 131173 \n", + "1 153d0ed14d3c71894252d0e8584479ec71d793a8d9d7ea... 584826 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2024-09-04T00:22:21.740447 10.386293 attention_is_all_you_need.pdf \n", + "1 2024-09-04T00:22:11.343103 24.315634 Granite_code_models.pdf " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df = read_parquet_files_as_df(OUTPUT_DIR)\n", + "output_df" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "('[\\n'\n", + " ' '\n", + " '\"{\\\\\"_name\\\\\":\\\\\"\\\\\",\\\\\"type\\\\\":\\\\\"pdf-document\\\\\",\\\\\"description\\\\\":{\\\\\"logs\\\\\":[]},\\\\\"file-info\\\\\":{\\\\\"filename\\\\\":\\\\\"attention_is_all_you_need.pdf\\\\\",\\\\\"document-hash\\\\\":\\\\\"bdfaa68d8984f0dc02beaca527b76f207d99b666d31d1da728ee0728182df697\\\\\",\\\\\"#-pages\\\\\":15,\\\\\"page-hashes\\\\\":[{\\\\\"hash\\\\\":\\\\\"8834a09ad99e9297886c9f8ad786c2784b7dc66dc6e6adfeff6bf2c1f07926d6\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":1},{\\\\\"hash\\\\\":\\\\\"72ded7022ad3cbfa9b5c4377a9c9b44511251f9489973956c23d2f3321e6307e\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":2},{\\\\\"hash\\\\\":\\\\\"38733274891513257d051950018621d95f73d05d5c70bfd7331def2f1194973d\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":3},{\\\\\"hash\\\\\":\\\\\"699ed16bf81021d0f86374d05c7b4b2b1049e63a28d2951ec1fb930747d755b9\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":4},{\\\\\"hash\\\\\":\\\\\"a17e6b313bdd51eff07a824253eff394d78ae1d6ebc985de3580bdfece38d2e1\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":5},{\\\\\"hash\\\\\":\\\\\"b3e9b63f2e8728fa83a5b7d911df2827585cf6040d2a4734cb3b44be264da6b6\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":6},{\\\\\"hash\\\\\":\\\\\"7b23bd1c80383b757a39456a4fd95ed2e9aaefd6a04512f181279c27a66c54a4\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":7},{\\\\\"hash\\\\\":\\\\\"c1dbbbf5b2ad441bf20149c26fc440a95f714987edf1f39690d703e67699dbc4\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":8},{\\\\\"hash\\\\\":\\\\\"ae2f68db548ba7a95d11ba1a1f0e36ca46c7d71d40a27c73dc56cf32932a9638\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":9},{\\\\\"hash\\\\\":\\\\\"58c67da2ef05339a90ab5c62b29eece0f60a7d8bb2f9eb390ba45a6d49e042f5\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":10},{\\\\\"hash\\\\\":\\\\\"b39d3fefe795d9d05aad7432498b5baeee7b255ed2da5bc88f60c051b8fae865\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":11},{\\\\\"hash\\\\\":\\\\\"1e5bf23dc7cd6799ee19684b7152560e0bda459b639ac3ea3f6f3fabf96362a4\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":12},{\\\\\"hash\\\\\":\\\\\"0ea258bf68e3835a534da65a943f679a19d02a19ae9b8afe9cd34af2cd29e9c4\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":13},{\\\\\"hash\\\\\":\\\\\"34c515052914fa661b7cedf6a25d7d4bd22a4d329971f74b9d1e75b3f6f02d16\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":14},{\\\\\"hash\\\\\":\\\\\"6ab332fddb6e68c5979d7481a53c9b75d91ac31564cad8972e3fa12cd7362769\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":15}]},\\\\\"main-text\\\\\":[{\\\\\"text\\\\\":\\\\\"arXiv:1706')\n" + ] + } + ], + "source": [ + "# Inspect contents\n", + "\n", + "import json\n", + "import pprint\n", + "\n", + "column_list = output_df['contents'].tolist()\n", + "column_json = json.dumps(column_list, indent=4)\n", + "pprint.pprint(column_json[:2000]) # display first few lines" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "data-prep-kit-3-py311", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/notebooks/Run_your_first_transform_ray.ipynb b/examples/notebooks/Run_your_first_transform_ray.ipynb new file mode 100644 index 000000000..8157d85bc --- /dev/null +++ b/examples/notebooks/Run_your_first_transform_ray.ipynb @@ -0,0 +1,440 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data Prep Kit - Hello World (Ray)\n", + "\n", + "This notebook guides you through running your first data preparation transformation using the data-prep-kit. In this example, we will demonstrate a transformation that takes PDF files as input and extracts their content.\n", + "\n", + "This notebook uses Ray framework, for pure python version see this notebook: [Run_your_first_transform_python.ipynb](Run_your_first_transform_python.ipynb)\n", + "\n", + "[Ray](https://docs.ray.io/en/latest/index.html) is a powerful framework that enables parallelization while still allowing you to run it efficiently on a local machine, such as your laptop. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-1: Setting up Python Dev Environment\n", + "\n", + "Please follow instructions from [Getting started section](../../README.md#gettingstarted) to setup your python development environment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-2: Get Data\n", + "\n", + "For this example, we will show PDF processing capabilities of DPK. And we will download and use this PDF documents\n", + "\n", + "- [IBM Granite model](https://arxiv.org/abs/2405.04324)\n", + "- [Attention is all you need](https://arxiv.org/pdf/1706.03762) - seminal paper on transformer/attention architecture\n", + "\n", + "The code below will download the PDF. Feel free to try your own PDFs to test it out" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "INPUT_DIR = 'Input-Test-Data'\n", + "OUTPUT_DIR = 'Output-Test-Data'" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Local file 'Input-Test-Data/Granite_code_models.pdf' (1.27 MB) already exists. Skipping download.\n", + "Local file 'Input-Test-Data/attention_is_all_you_need.pdf' (2.22 MB) already exists. Skipping download.\n" + ] + } + ], + "source": [ + "## This cell will download the input files\n", + "\n", + "import os\n", + "import shutil\n", + "import requests\n", + "from humanfriendly import format_size\n", + "\n", + "def download_file(url, local_file, chunk_size=1024*1024):\n", + " # Check if the local file already exists\n", + " if os.path.exists(local_file):\n", + " file_size = format_size(os.path.getsize(local_file))\n", + " print(f\"Local file '{local_file}' ({file_size}) already exists. Skipping download.\")\n", + " return\n", + "\n", + " # Create the directory if it doesn't exist\n", + " os.makedirs(os.path.dirname(local_file), exist_ok=True)\n", + "\n", + " # Stream the file download\n", + " with requests.get(url, stream=True) as r:\n", + " r.raise_for_status()\n", + " with open(local_file, 'wb') as f:\n", + " for chunk in r.iter_content(chunk_size=chunk_size):\n", + " if chunk: # filter out keep-alive new chunks\n", + " f.write(chunk)\n", + " print()\n", + " file_size = format_size(os.path.getsize(local_file))\n", + " print(f\"{local_file} ({file_size}) downloaded successfully.\")\n", + "## --- end: download_file ------\n", + "\n", + "## setup input/output directories\n", + "shutil.os.makedirs(INPUT_DIR, exist_ok=True)\n", + "shutil.rmtree(OUTPUT_DIR, ignore_errors=True)\n", + "shutil.os.makedirs(OUTPUT_DIR, exist_ok=True)\n", + "\n", + "## Download PDF files\n", + "download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(INPUT_DIR, 'Granite_code_models.pdf' ))\n", + "download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(INPUT_DIR, 'attention_is_all_you_need.pdf' ))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-3: Extract Text from PDF \n", + "\n", + "This code is designed to set up a data transformation process that extracts text from PDF. We will save the output as parquet format." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "import ast\n", + "\n", + "# Utilities from the data-prep-kit's data-processing-lib library provide functions and classes for parameter management, \n", + "from pdf2parquet_transform import (pdf2parquet_contents_type_cli_param, pdf2parquet_contents_types,)\n", + "from data_processing.utils import GB, ParamsUtils\n", + "\n", + "\n", + "ingest_config = {\n", + " pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,\n", + "}\n", + "\n", + "# num_cpus_available: Determines the number of CPUs to use for parallel processing. It's set to one-fourth of the total available CPUs on the machine.\n", + "# worker_options: Specifies the resources each worker will use, including the number of CPUs (num_cpus) and the memory (memory), set to 2 gigabytes (using a utility function GB).\n", + "\n", + "num_cpus_available = os.cpu_count()/4\n", + "worker_options = {\"num_cpus\" : num_cpus_available, \"memory\": 2 * GB}\n", + "\n", + "#local_conf: A dictionary specifying the local input and output folders where the PDF files will be read from and the transformed data will be saved.\n", + "local_conf = {\n", + " \"input_folder\": INPUT_DIR,\n", + " \"output_folder\": OUTPUT_DIR,\n", + "}\n", + "\n", + "#params: A dictionary containing various runtime parameters for the transformation.\n", + "#run_locally: A flag indicating that the transformation should run locally.\n", + "#data_local_config: Configuration for local data access, such as input and output folders, converted into a format compatible with the transformation using ParamsUtils.convert_to_ast.\n", + "#data_files_to_use: Specifies that only PDF files (['.pdf']) will be used as input data.\n", + "#runtime_worker_options: Specifies worker configuration options for Ray, with one gigabyte of memory per worker.\n", + "#runtime_num_workers: Number of workers to be used for the transformation.\n", + "#runtime_pipeline_id and runtime_job_id: Identifiers for the pipeline and job, respectively.\n", + "#runtime_code_location: Provides metadata about the code location, such as its repository and commit details, using the ParamsUtils.convert_to_ast function to format it correctly.\n", + "\n", + "params = {\n", + " \"run_locally\": True,\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " \"data_files_to_use\": ast.literal_eval(\"['.pdf']\"),\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": 2,\n", + " \"runtime_pipeline_id\": \"pipeline_id\",\n", + " \"runtime_job_id\": \"job_id\",\n", + "}\n", + "sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3 - Execute" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now its time to run the transformation.\n", + "\n", + "This will launch a *local RAY cluster* to execute our code in parallel (using multiple workers (=2)). You can view the RAY dashboard in the URL printed below.\n", + "\n", + "E.g. http://127.0.0.1:8265 \n", + "\n", + "You will notice, that the code will download models to execute the transformation. These models will be used to process PDFs." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "00:34:57 INFO - Running locally\n", + "00:34:57 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': False}\n", + "00:34:57 INFO - data factory data_ is using local data access: input_folder - Input-Test-Data output_folder - Output-Test-Data\n", + "00:34:57 INFO - data factory data_ max_files -1, n_sample -1\n", + "00:34:57 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "00:34:57 INFO - pipeline id pipeline_id\n", + "00:34:57 INFO - code location None\n", + "00:34:57 INFO - number of workers 2 worker options {'num_cpus': 4.0, 'memory': 2147483648, 'max_restarts': -1}\n", + "00:34:57 INFO - actor creation delay 0\n", + "00:34:57 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-04 00:34:59,763\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=313386)\u001b[0m 00:35:04 INFO - orchestrator started at 2024-09-04 00:35:04\n", + "\u001b[36m(orchestrate pid=313386)\u001b[0m 00:35:04 INFO - Number of files is 2, source profile {'max_file_size': 2.112621307373047, 'min_file_size': 1.2146415710449219, 'total_file_size': 3.3272628784179688}\n", + "\u001b[36m(orchestrate pid=313386)\u001b[0m 00:35:04 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 5.692151642404497, 'object_store': 2.846075820736587}\n", + "\u001b[36m(orchestrate pid=313386)\u001b[0m 00:35:04 INFO - Number of workers - 2 with {'num_cpus': 4.0, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=313386)\u001b[0m 00:35:04 INFO - Completed 0 files (0.0%) in 7.74065653483073e-06 min. Waiting for completion\n", + "\u001b[36m(RayTransformFileProcessor pid=314343)\u001b[0m 00:35:08 INFO - Initializing models\n", + "Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 102300.10it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=314343)\u001b[0m /home/sujee/apps/anaconda3/envs/data-prep-kit-3-py311/lib/python3.11/site-packages/torch/nn/modules/transformer.py:307: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)\n", + "\u001b[36m(RayTransformFileProcessor pid=314343)\u001b[0m warnings.warn(f\"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}\")\n", + "\u001b[36m(orchestrate pid=313386)\u001b[0m 00:36:16 INFO - Completed processing 2 files in 1.2009095589319865 min\n", + "\u001b[36m(orchestrate pid=313386)\u001b[0m 00:36:16 INFO - done flushing in 0.0009620189666748047 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=314342)\u001b[0m 00:35:09 INFO - Initializing models\n", + "Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 166818.91it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=314342)\u001b[0m /home/sujee/apps/anaconda3/envs/data-prep-kit-3-py311/lib/python3.11/site-packages/torch/nn/modules/transformer.py:307: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)\n", + "\u001b[36m(RayTransformFileProcessor pid=314342)\u001b[0m warnings.warn(f\"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}\")\n", + "00:36:26 INFO - Completed execution in 1.479572606086731 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Tranformation run completed successfully\n", + "CPU times: user 511 ms, sys: 264 ms, total: 775 ms\n", + "Wall time: 1min 30s\n" + ] + } + ], + "source": [ + "%%time \n", + "\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration\n", + "\n", + "launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Tranformation run completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Transformation run failed\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-4: Inspect the generated output\n", + "\n", + "We will use pandas to read parquet files and display.\n", + "\n", + "You should see one-entry per PDF input file" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import glob\n", + "import pandas as pd\n", + "\n", + "## Reads parquet files in a folder into a pandas dataframe \n", + "def read_parquet_files_as_df (parquet_dir):\n", + " parquet_files = glob.glob(f'{parquet_dir}/*.parquet')\n", + "\n", + " # read each parquet file into a DataFrame and store in a list\n", + " dfs = [pd.read_parquet (f) for f in parquet_files]\n", + "\n", + " # Concatenate all DataFrames into a single DataFrame\n", + " data_df = pd.concat(dfs, ignore_index=True)\n", + " return data_df" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0attention_is_all_you_need.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...15419339684ed7-3033-47fb-ae8f-26f470932cd3pdffee309974aabb59c48dbfaeb011ee8a2c78f2e492747b9...1311672024-09-04T00:35:41.54711020.117517attention_is_all_you_need.pdf
1Granite_code_models.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...2817320204e53cb-c39f-4ea6-81c4-177adda365b3pdf254309b69f2010ecff8af0907d3ae643daba6e8dfa1250...5848222024-09-04T00:36:16.45101455.192019Granite_code_models.pdf
\n", + "
" + ], + "text/plain": [ + " filename \\\n", + "0 attention_is_all_you_need.pdf \n", + "1 Granite_code_models.pdf \n", + "\n", + " contents num_pages num_tables \\\n", + "0 {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 15 4 \n", + "1 {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 28 17 \n", + "\n", + " num_doc_elements document_id ext \\\n", + "0 193 39684ed7-3033-47fb-ae8f-26f470932cd3 pdf \n", + "1 320 204e53cb-c39f-4ea6-81c4-177adda365b3 pdf \n", + "\n", + " hash size \\\n", + "0 fee309974aabb59c48dbfaeb011ee8a2c78f2e492747b9... 131167 \n", + "1 254309b69f2010ecff8af0907d3ae643daba6e8dfa1250... 584822 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2024-09-04T00:35:41.547110 20.117517 attention_is_all_you_need.pdf \n", + "1 2024-09-04T00:36:16.451014 55.192019 Granite_code_models.pdf " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df = read_parquet_files_as_df(OUTPUT_DIR)\n", + "output_df" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "('[\\n'\n", + " ' '\n", + " '\"{\\\\\"_name\\\\\":\\\\\"\\\\\",\\\\\"type\\\\\":\\\\\"pdf-document\\\\\",\\\\\"description\\\\\":{\\\\\"logs\\\\\":[]},\\\\\"file-info\\\\\":{\\\\\"filename\\\\\":\\\\\"attention_is_all_you_need.pdf\\\\\",\\\\\"document-hash\\\\\":\\\\\"bdfaa68d8984f0dc02beaca527b76f207d99b666d31d1da728ee0728182df697\\\\\",\\\\\"#-pages\\\\\":15,\\\\\"page-hashes\\\\\":[{\\\\\"hash\\\\\":\\\\\"8834a09ad99e9297886c9f8ad786c2784b7dc66dc6e6adfeff6bf2c1f07926d6\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":1},{\\\\\"hash\\\\\":\\\\\"72ded7022ad3cbfa9b5c4377a9c9b44511251f9489973956c23d2f3321e6307e\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":2},{\\\\\"hash\\\\\":\\\\\"38733274891513257d051950018621d95f73d05d5c70bfd7331def2f1194973d\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":3},{\\\\\"hash\\\\\":\\\\\"699ed16bf81021d0f86374d05c7b4b2b1049e63a28d2951ec1fb930747d755b9\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":4},{\\\\\"hash\\\\\":\\\\\"a17e6b313bdd51eff07a824253eff394d78ae1d6ebc985de3580bdfece38d2e1\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":5},{\\\\\"hash\\\\\":\\\\\"b3e9b63f2e8728fa83a5b7d911df2827585cf6040d2a4734cb3b44be264da6b6\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":6},{\\\\\"hash\\\\\":\\\\\"7b23bd1c80383b757a39456a4fd95ed2e9aaefd6a04512f181279c27a66c54a4\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":7},{\\\\\"hash\\\\\":\\\\\"c1dbbbf5b2ad441bf20149c26fc440a95f714987edf1f39690d703e67699dbc4\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":8},{\\\\\"hash\\\\\":\\\\\"ae2f68db548ba7a95d11ba1a1f0e36ca46c7d71d40a27c73dc56cf32932a9638\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":9},{\\\\\"hash\\\\\":\\\\\"58c67da2ef05339a90ab5c62b29eece0f60a7d8bb2f9eb390ba45a6d49e042f5\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":10},{\\\\\"hash\\\\\":\\\\\"b39d3fefe795d9d05aad7432498b5baeee7b255ed2da5bc88f60c051b8fae865\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":11},{\\\\\"hash\\\\\":\\\\\"1e5bf23dc7cd6799ee19684b7152560e0bda459b639ac3ea3f6f3fabf96362a4\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":12},{\\\\\"hash\\\\\":\\\\\"0ea258bf68e3835a534da65a943f679a19d02a19ae9b8afe9cd34af2cd29e9c4\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":13},{\\\\\"hash\\\\\":\\\\\"34c515052914fa661b7cedf6a25d7d4bd22a4d329971f74b9d1e75b3f6f02d16\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":14},{\\\\\"hash\\\\\":\\\\\"6ab332fddb6e68c5979d7481a53c9b75d91ac31564cad8972e3fa12cd7362769\\\\\",\\\\\"model\\\\\":\\\\\"default\\\\\",\\\\\"page\\\\\":15}]},\\\\\"main-text\\\\\":[{\\\\\"text\\\\\":\\\\\"arXiv:1706')\n" + ] + } + ], + "source": [ + "# Inspect contents\n", + "\n", + "import json\n", + "import pprint\n", + "\n", + "column_list = output_df['contents'].tolist()\n", + "column_json = json.dumps(column_list, indent=4)\n", + "pprint.pprint(column_json[:2000]) # display first few lines" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "data-prep-kit-3-py311", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}