From bd724237c3f9cd89ddb64704d423310d203bcae3 Mon Sep 17 00:00:00 2001 From: Sujee Maniyam Date: Tue, 15 Oct 2024 23:19:35 -0700 Subject: [PATCH 01/19] DPK intro example v1 Signed-off-by: Sujee Maniyam --- examples/notebooks/intro/README.md | 13 + .../notebooks/intro/dpk_intro_1_python.ipynb | 4204 +++++++++++++++++ .../notebooks/intro/dpk_intro_1_ray.ipynb | 3909 +++++++++++++++ .../data-prep-kit-3-workflow.excalidraw | 2832 +++++++++++ .../intro/images/data-prep-kit-3-workflow.png | Bin 0 -> 101303 bytes .../intro/input/solar-system/earth.md | 17 + .../intro/input/solar-system/earth.pdf | Bin 0 -> 58535 bytes .../intro/input/solar-system/mars.md | 17 + .../intro/input/solar-system/mars.pdf | Bin 0 -> 57872 bytes examples/notebooks/intro/my_utils.py | 55 + 10 files changed, 11047 insertions(+) create mode 100644 examples/notebooks/intro/README.md create mode 100644 examples/notebooks/intro/dpk_intro_1_python.ipynb create mode 100644 examples/notebooks/intro/dpk_intro_1_ray.ipynb create mode 100644 examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw create mode 100644 examples/notebooks/intro/images/data-prep-kit-3-workflow.png create mode 100644 examples/notebooks/intro/input/solar-system/earth.md create mode 100644 examples/notebooks/intro/input/solar-system/earth.pdf create mode 100644 examples/notebooks/intro/input/solar-system/mars.md create mode 100644 examples/notebooks/intro/input/solar-system/mars.pdf create mode 100644 examples/notebooks/intro/my_utils.py diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md new file mode 100644 index 000000000..53d21433c --- /dev/null +++ b/examples/notebooks/intro/README.md @@ -0,0 +1,13 @@ +# Data Prep Kit Introduction + +This is an example featuring some of the features of data prep kit. + +## Running the code + +## Intro + +This notebook will demonstrate processing PDFs + +`PDFs ---> text ---> chunks ---> exact dedupe ---> fuzzy dedupe ---> embeddings` + +[python version](dpk_intro_1_python.ipynb)   |   [ray version](dpk_intro_1_ray.ipynb) diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb new file mode 100644 index 000000000..6f4cf757e --- /dev/null +++ b/examples/notebooks/intro/dpk_intro_1_python.ipynb @@ -0,0 +1,4204 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", + "metadata": { + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" + }, + "source": [ + "# Data Prep Kit Demo 1 - Python version\n", + "\n", + "This notebook will introduce DPK and showcase some of it's capabilities.\n", + "\n", + "Here is the workflow\n", + "\n", + "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "b15976e3", + "metadata": { + "id": "b15976e3" + }, + "source": [ + "## How to run this notebook\n", + "\n", + "Two options:\n", + "\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", + "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", + "\n", + "The notebook will work as in both environments" + ] + }, + { + "cell_type": "markdown", + "id": "eb8b0d5c", + "metadata": { + "id": "eb8b0d5c" + }, + "source": [ + "## Step-1: Inspect the Data\n", + "\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/main/examples/notebooks/intro/input/solar-system)\n", + "\n", + "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/main/examples/notebooks/intro/input/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/mars.pdf)\n" + ] + }, + { + "cell_type": "markdown", + "id": "39a0ab6e", + "metadata": { + "id": "39a0ab6e" + }, + "source": [ + "## Step-2: Figure out Runtime Environment\n", + "\n", + "### 2.1 - Determine runtime\n", + "\n", + "Determine if we are running on Google colab or local python environment" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "1fe354b7", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1fe354b7", + "outputId": "0a38a7b5-238e-433a-c378-78444908aa8a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "markdown", + "id": "8e7c104b", + "metadata": { + "id": "8e7c104b" + }, + "source": [ + "### 2.2 -Download Data if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "3309799e", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3309799e", + "outputId": "9b44b764-d284-4da1-ad55-f08d5c9c0f89" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " !mkdir -p 'input'\n", + " !wget -O 'input/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/earth.pdf'\n", + " !wget -O 'input/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/mars.pdf'\n", + " !wget -O 'utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/my_utils.py'" + ] + }, + { + "cell_type": "markdown", + "id": "a5dc2b68", + "metadata": { + "id": "a5dc2b68" + }, + "source": [ + "### 2.3 - Install dependencies if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1fcec577", + "metadata": { + "id": "1fcec577" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " ! pip install --default-timeout=100 \\\n", + " data-prep-toolkit[ray]==0.2.2.dev1 \\\n", + " data-prep-toolkit-transforms[ray,all]==0.2.2.dev1 \\\n", + " deepsearch-toolkit\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "243322b8", + "metadata": { + "id": "243322b8" + }, + "source": [ + "### 2.4 - Restart Runtime\n", + "\n", + "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", + "\n", + "You do this by going to **`Runtime --> Restart Session`**\n", + "\n", + "Then you can continue to the next step (no need to re-run the notebook)" + ] + }, + { + "cell_type": "markdown", + "id": "e8b10be1", + "metadata": { + "id": "e8b10be1" + }, + "source": [ + "## Step-2: Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "356c66f7", + "metadata": { + "id": "356c66f7" + }, + "source": [ + "### 2.1 - Basic Config" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "e4YMZrBuFycl", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e4YMZrBuFycl", + "outputId": "42a9edae-205f-4dce-cd4e-a159bd8f620b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "33345487", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "33345487", + "outputId": "79b40d76-b4dd-48ea-9638-461c78a637a1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "MY_CONFIG.RAY_RUNTIME_WORKERS: 2\n", + "MY_CONFIG.RAY_NUM_CPUS: 0.8\n", + "MY_CONFIG.RAY_MEMORY_GB: 2\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "## Configuration\n", + "class MyConfig:\n", + " pass\n", + "\n", + "MY_CONFIG = MyConfig ()\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " MY_CONFIG.INPUT_DATA_DIR = 'input'\n", + "else:\n", + " MY_CONFIG.INPUT_DATA_DIR = os.path.join (os.path.abspath (''), '..', 'data', 'solar-system')\n", + " \n", + "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", + "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", + "\n", + "## Embedding model\n", + "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'\n", + "\n", + "## RAY CONFIGURATION\n", + "### For local runs, we can use more parallelism\n", + "### For google colab, be conservative\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + " MY_CONFIG.RAY_NUM_CPUS = 0.3\n", + " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", + "else: # local run\n", + " num_cpus_available = os.cpu_count()\n", + " # print (num_cpus_available)\n", + " MY_CONFIG.RAY_NUM_CPUS = 0.8\n", + " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", + " # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", + " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + "\n", + "print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)\n", + "print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)\n", + "print ('MY_CONFIG.RAY_MEMORY_GB:', MY_CONFIG.RAY_MEMORY_GB)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "b15e6827", + "metadata": { + "id": "b15e6827" + }, + "outputs": [], + "source": [ + "## Add parent dir to path\n", + "import os,sys\n", + "\n", + "this_dir = os.path.abspath('')\n", + "parent_dir = os.path.dirname(this_dir)\n", + "sys.path.append (os.path.abspath (parent_dir))" + ] + }, + { + "cell_type": "markdown", + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", + "metadata": { + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" + }, + "source": [ + "### 2.2 - Setup input/outpur directories" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "outputId": "5c305d54-1c91-455d-d0e2-b514b61a068b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Cleared output directory\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "\n", + "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", + " raise Exception (f\"āŒ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", + "\n", + "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", + "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", + "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", + "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", + "output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_fuzzy_dedupe_out')\n", + "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_embeddings_out')\n", + "\n", + "## clear output folder\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", + "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", + "\n", + "print (\"āœ… Cleared output directory\")" + ] + }, + { + "cell_type": "markdown", + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", + "metadata": { + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" + }, + "source": [ + "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", + "\n", + "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", + "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", + "metadata": { + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a" + }, + "source": [ + "### 3.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "outputId": "90eb1f89-35d1-4b6f-ea34-7667680dd256" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-1: Processing input='/home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system' --> output='output/01_parquet_out'\n" + ] + } + ], + "source": [ + "STAGE = 1\n", + "\n", + "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", + "output_folder = output_parquet_dir\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", + "metadata": { + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" + }, + "source": [ + "### 3.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 625, + "referenced_widgets": [ + "8226b2522ce446f6bd3a36c4e227370c", + "7616f1b493e1461c9fd1319fae3bc10b", + "4f63bfad92b64e7bae18e720376d402d", + "6957a659451b46dab702c1c62fa9cdd2", + "2eea7bc810e54eaeb325136352b71e66", + "ebc626c0750c470db6789b26acf15f60", + "3077f04af3a9447ab98717bd3131cd8f", + "709685da1c6c4164bed658357a2191bf", + "0a1ed94698ca4e4291c553929e0ca66c", + "5dbc6889a9c243c5a922f8cc5f1a704c", + "d6e520e4da004c818031ccfcc3588e5d" + ] + }, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "outputId": "e2c85b44-f605-4817-c120-2cdce79e3c84" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:02 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "18:40:02 INFO - pipeline id pipeline_id\n", + "18:40:02 INFO - code location None\n", + "18:40:02 INFO - data factory data_ is using local data access: input_folder - /home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system output_folder - output/01_parquet_out\n", + "18:40:02 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "18:40:02 INFO - orchestrator pdf2parquet started at 2024-09-18 18:40:02\n", + "18:40:02 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "18:40:02 INFO - Initializing models\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "6454e0eb538145aebeed98e2ec662b22", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Fetching 7 files: 0%| | 0/7 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0mars.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdf
1earth.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdf
\n", + "" + ], + "text/plain": [ + " filename contents num_pages \\\n", + "0 mars.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "\n", + " num_tables num_doc_elements document_id ext \\\n", + "0 0 11 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 0 11 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:06.831334 0.857239 earth.pdf " + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(5)\n", + "\n", + "## To display certain columns\n", + "#parquet_df[['column1', 'column2', 'column3']].head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "e5058a21", + "metadata": { + "id": "e5058a21" + }, + "source": [ + "\n", + "### 3.4 - Understand the output\n", + "\n", + "Here are some interesting attributes to note:\n", + "\n", + "- **filename** : original filename\n", + "- **contents** : text\n", + "- **document_id**: unique id (UUID) assignd to this document\n", + "- **hash** : hash of document\n", + "- **pdf_convert_time** : time to convert this pdf in seconds\n", + "\n", + "Let's inspect the **contents** column. See how the text is being divided up!" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "f870e624", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f870e624", + "outputId": "f70bfa9f-62f8-417d-d91a-30c1f024ccbd" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',\n", + " 'filename': 'mars.pdf',\n", + " 'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.35137939,\n", + " 654.45184326,\n", + " 169.88169861,\n", + " 667.98492432],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.09541321,\n", + " 630.68127441,\n", + " 210.66503906,\n", + " 642.34405518],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.84518433,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.02520752],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.18510437,\n", + " 570.83258057,\n", + " 374.99838257,\n", + " 581.07043457],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about the Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.22866821,\n", + " 542.98168945,\n", + " 163.86282349,\n", + " 554.45288086],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87440491,\n", + " 500.84011841,\n", + " 477.48345947,\n", + " 534.55810547],\n", + " 'page': 1,\n", + " 'span': [0, 196]}],\n", + " 'text': 'Mars, the fourth planet from the Sun, is a cold, '\n", + " 'desert world with a thin atmosphere composed '\n", + " 'primarily of carbon dioxide. Its reddish hue comes '\n", + " 'from iron oxide, or rust, prevalent on its surface.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.2026062,\n", + " 482.90710449,\n", + " 237.04431152,\n", + " 493.07443237],\n", + " 'page': 1,\n", + " 'span': [0, 23]}],\n", + " 'text': 'Basic facts about Mars:',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 453.019104,\n", + " 477.48171997,\n", + " 474.9703064],\n", + " 'page': 1,\n", + " 'span': [0, 78]}],\n", + " 'text': 'Ā· Distance from the Sun: Average of 228 million '\n", + " 'kilometers (142 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.79351807,\n", + " 431.73287964,\n", + " 451.2142334],\n", + " 'page': 1,\n", + " 'span': [0, 64]}],\n", + " 'text': 'Ā· Rotation Period: 24.6 hours (one Martian day - '\n", + " 'called a \"sol\")',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 429.10913086,\n", + " 365.9559021,\n", + " 438.83737183],\n", + " 'page': 1,\n", + " 'span': [0, 44]}],\n", + " 'text': 'Ā· Moons: Two small moons, Phobos and Deimos.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.51646423],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "import pprint\n", + "import json\n", + "\n", + "pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))\n", + "# json.loads(output_df.iloc[0, ]['contents'])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "e1a10c2d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e1a10c2d", + "outputId": "300e7688-692a-4039-c4a4-a86887d9138b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',\n", + " 'filename': 'earth.pdf',\n", + " 'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.30961609,\n", + " 654.45184326,\n", + " 174.04208374,\n", + " 667.93347168],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.12528992,\n", + " 630.69073486,\n", + " 210.66503906,\n", + " 642.27935791],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87112427,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.04595947],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.20942688,\n", + " 570.81555176,\n", + " 375.57919312,\n", + " 581.08459473],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about our Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.15542603,\n", + " 542.98168945,\n", + " 167.32983398,\n", + " 554.36669922],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.91053772,\n", + " 512.46295166,\n", + " 477.84887695,\n", + " 534.48431396],\n", + " 'page': 1,\n", + " 'span': [0, 107]}],\n", + " 'text': \"Earth is the third planet from the Sun. It's our home \"\n", + " 'planet. Earth is the only place we know of with life.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.30151367,\n", + " 494.86206055,\n", + " 240.17156982,\n", + " 505.07229614],\n", + " 'page': 1,\n", + " 'span': [0, 24]}],\n", + " 'text': 'Basic facts about Earth:',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 464.97409058,\n", + " 477.47979736,\n", + " 487.02810669],\n", + " 'page': 1,\n", + " 'span': [0, 79]}],\n", + " 'text': 'Ā· Distance from the Sun: Average of 149.6 million '\n", + " 'kilometers (93 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 452.86901855,\n", + " 317.90722656,\n", + " 463.24041748],\n", + " 'page': 1,\n", + " 'span': [0, 37]}],\n", + " 'text': 'Ā· Rotation Period: 24 hours (one day)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.71496582,\n", + " 396.66357422,\n", + " 451.19915771],\n", + " 'page': 1,\n", + " 'span': [0, 52]}],\n", + " 'text': 'Ā· Moons: One moon, called Luna or simply \"the Moon\".',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.53633118],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))" + ] + }, + { + "cell_type": "markdown", + "id": "72274586", + "metadata": { + "id": "72274586" + }, + "source": [ + "## Step-4: Doc chunks\n", + "\n", + "In the previous step, we have extracted text from oru PDFs. But we have the content of entire file as 'one row' in our parquet output.\n", + "\n", + "In this step, we are going to split the documents in chunks, according to their layout segmentation.\n", + "\n", + "This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`\n", + "to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.\n", + "It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: \"application/json\"`,\n", + "which provides the required JSON structure." + ] + }, + { + "cell_type": "markdown", + "id": "96198fa6", + "metadata": { + "id": "96198fa6" + }, + "source": [ + "### 4.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "305f00a3", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "305f00a3", + "outputId": "a787385b-214a-41b2-975d-0d3c5529c2c4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" + ] + } + ], + "source": [ + "STAGE = 2\n", + "\n", + "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_chunk_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "369f2cd1", + "metadata": { + "id": "369f2cd1" + }, + "source": [ + "### 4.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "5b7b18d5", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5b7b18d5", + "outputId": "cb338503-3dca-45bd-a60a-bd214843a97b" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:09 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'output_chunk_column_name': 'contents', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", + "18:40:09 INFO - pipeline id pipeline_id\n", + "18:40:09 INFO - code location None\n", + "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:40:09 INFO - orchestrator doc_chunk started at 2024-09-18 18:40:09\n", + "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "18:40:09 INFO - done flushing in 0.0 sec\n", + "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:2 completed successfully\n", + "CPU times: user 861 ms, sys: 140 ms, total: 1 s\n", + "Wall time: 1.21 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from doc_chunk_transform_python import DocChunkPythonTransformConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # doc_chunk arguments\n", + " # ...\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(DocChunkPythonTransformConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "213afdf6", + "metadata": { + "id": "213afdf6" + }, + "source": [ + "### 4.3 - Inspect Generated output\n", + "\n", + "We would see documents are split into many chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "d8138d43", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 893 + }, + "id": "d8138d43", + "outputId": "0d08e0a6-e743-44d9-b8f1-eec98b222a92" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Files processed : 2\n", + "Chunks created : 8\n", + "Input data dimensions (rows x columns)= (2, 12)\n", + "Output data dimensions (rows x columns)= (8, 15)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbbox
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...
2mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...
3mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...
6earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...
7earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "7 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "3 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "7 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "3 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "7 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (f\"Files processed : {input_df.shape[0]:,}\")\n", + "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "9e9ca75c", + "metadata": { + "id": "9e9ca75c" + }, + "source": [ + "### 4.4 - Understanding the Output\n", + "\n", + "Here we see 2 PDF files are split into 6 chunks. Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points\n", + "\n", + "See how **document_id** is carried throughout. This helps us identify original documents.\n", + "\n", + "Also note **contents** is now plain text (not JSON as before)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "3090c950", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "3090c950", + "outputId": "cf9bd956-7b31-42bc-ef77-9ebded8ba08e" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
4earth.pdfSolar System\\nOur solar system is a vast and f...
5earth.pdfSolar System\\nFor more details about our Solar...
6earth.pdfEarth\\nEarth is the third planet from the Sun....
7earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", + "1 mars.pdf Solar System\\nFor more details about the Solar...\n", + "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "3 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "4 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "5 earth.pdf Solar System\\nFor more details about our Solar...\n", + "6 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "7 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "d5f151ae", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d5f151ae", + "outputId": "2b48675c-328d-4d24-d689-ad77231ef4b7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 3------\n", + "Basic facts about Mars:\n", + "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "Ā· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "Ā· Rotation Period: 24 hours (one day)\n", + "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "7ad1c60d", + "metadata": {}, + "source": [ + "## Step-5: DOC ID generation of Chunks\n", + "\n", + "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", + "\n", + " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", + " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", + "\n", + "**This is a pre-requisite for fuzzy dedup** in the pipeline." + ] + }, + { + "cell_type": "markdown", + "id": "1afaa0fd", + "metadata": {}, + "source": [ + "### 5.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "6ffd6f54", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" + ] + } + ], + "source": [ + "\n", + "# Input for this stage is the output of exact dedeup component\n", + "# output of this component makes it possible for fdedup component to run on data.\n", + "\n", + "STAGE = 3\n", + "\n", + "input_folder = output_chunk_dir\n", + "output_folder = output_docid_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "f78a51b7", + "metadata": {}, + "source": [ + "### 5.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "5fc77557", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:09 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "18:40:09 INFO - pipeline id pipeline_id\n", + "18:40:09 INFO - code location None\n", + "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:40:09 INFO - orchestrator doc_id started at 2024-09-18 18:40:09\n", + "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.008135795593261719, 'min_file_size': 0.008058547973632812, 'total_file_size': 0.01619434356689453}\n", + "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "18:40:09 INFO - done flushing in 0.0 sec\n", + "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:3 completed successfully\n", + "CPU times: user 19.2 ms, sys: 603 Ī¼s, total: 19.8 ms\n", + "Wall time: 16.2 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " # doc id configuration\n", + " \"doc_id_doc_column\": \"contents\",\n", + " \"doc_id_hash_column\": \"chunk_hash\",\n", + " \"doc_id_int_column\": \"chunk_id\",\n", + "}\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "\n", + "launcher = PythonTransformLauncher(DocIDPythonTransformRuntimeConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "a9a8c1fa", + "metadata": {}, + "source": [ + "### 5.3 - Inspect Generated output\n", + "\n", + "You will notice we have two extra columns\n", + "\n", + "- **hash_column**\n", + "- **int_id_column**\n", + "\n", + "But still the same number or rows as before" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "da9adede", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 15)\n", + "Output data dimensions (rows x columns)= (8, 17)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_hashchunk_id
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5
2mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6
3mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1
6earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2
7earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "7 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "3 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "7 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "3 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "7 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " chunk_hash chunk_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 " + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53", + "metadata": { + "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53" + }, + "source": [ + "## Step-6: Exact Dedup\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", + "metadata": { + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe" + }, + "source": [ + "### 6.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "4c7a1b94", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4c7a1b94", + "outputId": "2a135853-c54f-4aa4-ffc4-83c2bc7a68ce" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" + ] + } + ], + "source": [ + "STAGE = 4\n", + "\n", + "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_exact_dedupe_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", + "metadata": { + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e" + }, + "source": [ + "### 6.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "outputId": "b9b3de92-4304-4540-dfba-a4549fa157eb" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:09 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", + "18:40:09 INFO - pipeline id pipeline_id\n", + "18:40:09 INFO - code location None\n", + "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:40:09 INFO - orchestrator ededup started at 2024-09-18 18:40:09\n", + "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", + "18:40:09 INFO - Starting from the beginning\n", + "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "18:40:09 INFO - done flushing in 0.0 sec\n", + "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:4 completed successfully\n", + "CPU times: user 15.4 ms, sys: 478 Ī¼s, total: 15.9 ms\n", + "Wall time: 12.9 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # ededup parameters\n", + " \"ededup_doc_column\": \"contents\",\n", + " \"ededup_doc_id_column\": \"chunk_hash\",\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "eaf1c3c3", + "metadata": { + "id": "eaf1c3c3" + }, + "source": [ + "### 6.3 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d824ebf6", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 358 + }, + "id": "d824ebf6", + "outputId": "14aa660f-6f1a-4f93-9b61-5f8f8adcf3fe" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 17)\n", + "Output data dimensions (rows x columns)= (7, 18)\n", + "Input chunks before exact dedupe : 8\n", + "Output chunks after exact dedupe : 7\n", + "Duplicate chunks removed : 1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_hashchunk_idremoved
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[]
2mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[]
3earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[]
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[]
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[]
6earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[]
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 earth.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "6 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " chunk_hash chunk_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", + "\n", + " removed \n", + "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [] \n", + "5 [] \n", + "6 [] " + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "82cc9bb0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "82cc9bb0", + "outputId": "2aff0a5f-8cc7-408c-e1cf-62c0b14b18fb" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nFor more details about the Solar...
1mars.pdfMars\\nMars, the fourth planet from the Sun, is...
2mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
3earth.pdfSolar System\\nOur solar system is a vast and f...
4earth.pdfSolar System\\nFor more details about our Solar...
5earth.pdfEarth\\nEarth is the third planet from the Sun....
6earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nFor more details about the Solar...\n", + "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "2 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "3 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "4 earth.pdf Solar System\\nFor more details about our Solar...\n", + "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "6 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "cc61dffa", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cc61dffa", + "outputId": "337b015f-3795-4c45-98a3-03ae817d4dca" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 1------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 2------\n", + "Basic facts about Mars:\n", + "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "Ā· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "Ā· Rotation Period: 24 hours (one day)\n", + "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "383f40ba", + "metadata": { + "id": "383f40ba" + }, + "source": [ + "### 6.4 - Understanding the output\n", + "\n", + "Remember we had 8 chunks initially. Now we have 7! One duplicate chunk is removed.\n", + "\n", + "If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf` is removed from one of the documents! Pretty neat, eh!\n", + "\n", + "```text\n", + "## Solar System\n", + "\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "85309751-8556-41c6-ac32-84acc941bc8d", + "metadata": { + "id": "85309751-8556-41c6-ac32-84acc941bc8d" + }, + "source": [ + " ## Step-7: Fuzzy Dedup\n", + "\n", + "Post exact deduplication, fuzzy deduplication is applied with the goal of removing **very similar** chunks\n", + "\n", + "And fuzzy dedupe is only available in RAY version." + ] + }, + { + "cell_type": "markdown", + "id": "fcf574a3-b287-419c-9c86-07b828b41ca6", + "metadata": { + "id": "fcf574a3-b287-419c-9c86-07b828b41ca6" + }, + "source": [ + "### 7.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "outputId": "4450ed63-3b09-42e4-8085-2951e700cf8f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-5: Processing input='output/04_exact_dedupe_out' --> output='output/05_fuzzy_dedupe_out'\n" + ] + } + ], + "source": [ + "## Input to this component is the output of doc_id generator component.\n", + "\n", + "STAGE = 5\n", + "\n", + "input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_fuzzy_dedupe_dir\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3", + "metadata": { + "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3" + }, + "source": [ + "### 7.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "outputId": "2baa790d-6944-4d20-f0c1-fc2979eb1686" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:09 INFO - Running locally\n", + "18:40:09 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.8}}\n", + "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_fuzzy_dedupe_out\n", + "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:40:09 INFO - pipeline id pipeline_id\n", + "18:40:09 INFO - code location None\n", + "18:40:09 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "18:40:09 INFO - actor creation delay 0\n", + "18:40:09 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:40:11,503\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - orchestrator started at 2024-09-18 18:40:12\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Number of files is 2, source profile {'max_file_size': 0.009611129760742188, 'min_file_size': 0.009521484375, 'total_file_size': 0.019132614135742188}\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.208082581870258, 'object_store': 4.104041289538145}\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - starting run from the beginning\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - continuing from the very beginning\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Fuzzy: num buckets 8, bucket length 8\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 bucket actors\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 minhash actors\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Table preprocessing uses 1 readers\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 table processor actors\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:13 INFO - Completed 1 files in 0.014 min\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:13 INFO - Completed 1 files (50.0%) in 0.014 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:15 INFO - Completed processing 2 files in 0.047 min\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:15 INFO - creating minhash snapshots\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:16 INFO - minhash snapshots created\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:16 INFO - creating bucket snapshots\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - bucket snapshots created\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created 1 document actors\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created 1 bucket processor actors\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created bucket processor invoker\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - added invoker to bucket collectors\n", + "\u001b[36m(BucketsHash pid=1191796)\u001b[0m 18:40:17 INFO - processing buckets 0 long, 53 short\n", + "\u001b[36m(BucketsHash pid=1191796)\u001b[0m 18:40:17 INFO - Done submitting long buckets\n", + "\u001b[36m(BucketsHashProcessorInvoker pid=1192188)\u001b[0m 18:40:18 INFO - Waiting bucket processing completion. Submitted requests 1\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:18 INFO - Done processing buckets in 0.011 min\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:18 INFO - creating document snapshots\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:19 INFO - document snapshots created\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:19 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:27 INFO - Completed processing 2 files in 0.131 min\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:27 INFO - done flushing in 0.004 sec\n", + "18:40:37 INFO - Completed execution in 0.462 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:5 completed successfully\n", + "CPU times: user 457 ms, sys: 296 ms, total: 753 ms\n", + "Wall time: 29.2 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import os\n", + "import sys\n", + "\n", + "from data_processing.utils import ParamsUtils\n", + "from fdedup_transform_ray import FdedupRayTransformConfiguration\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "\n", + "# create parameters\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # Orchestration parameters\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # columns used\n", + " \"fdedup_doc_column\": \"contents\",\n", + " \"fdedup_id_column\": \"chunk_id\",\n", + " \"fdedup_cluster_column\": \"chunk_hash\",\n", + " # infrastructure\n", + " \"fdedup_bucket_cpu\": 0.3,\n", + " \"fdedup_doc_cpu\": 0.3,\n", + " \"fdedup_mhash_cpu\": 0.3,\n", + " \"fdedup_num_doc_actors\": 1,\n", + " \"fdedup_num_bucket_actors\": 1,\n", + " \"fdedup_num_minhash_actors\": 1,\n", + " \"fdedup_num_preprocessors\": 1,\n", + " # fuzzy parameters\n", + " \"fdedup_num_permutations\": 64,\n", + " \"fdedup_threshold\": 0.7, # (default 0.8)\n", + " \"fdedup_shingles_size\": 5,\n", + " \"fdedup_delimiters\": \" \"\n", + "}\n", + "\n", + "# Pass commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "\n", + "launcher = RayTransformLauncher(FdedupRayTransformConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "a6f8cd11", + "metadata": { + "id": "a6f8cd11" + }, + "source": [ + "### 7.3 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "e899ad60", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 222 + }, + "id": "e899ad60", + "outputId": "17aaaea8-a106-4c9a-ceb3-6760d92f8b59" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (7, 18)\n", + "Output data dimensions (rows x columns)= (6, 18)\n", + "Duplicate chunks removed by fuzzy-dedupe: 1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_idremovedchunk_hash
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...6[]-1
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7[]-1
2earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...0[]-1
3earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...1[]5
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...2[]-1
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...3[]-1
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 earth.pdf 1 0 11 \n", + "3 earth.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "2 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "2 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "1 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "2 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "5 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox chunk_id \\\n", + "0 1 [132.87440491, 500.84011841, 477.48345947, 534... 6 \n", + "1 1 [133.2026062, 482.90710449, 237.04431152, 493.... 7 \n", + "2 1 [132.87112427, 588.96014404, 479.40917969, 623... 0 \n", + "3 1 [133.20942688, 570.81555176, 375.57919312, 581... 1 \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 2 \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 3 \n", + "\n", + " removed chunk_hash \n", + "0 [] -1 \n", + "1 [] -1 \n", + "2 [] -1 \n", + "3 [] 5 \n", + "4 [] -1 \n", + "5 [] -1 " + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (\"Duplicate chunks removed by fuzzy-dedupe: \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "ab7ea52b", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 81 + }, + "id": "ab7ea52b", + "outputId": "8e57385f-c925-4ac7-9e0d-ebc64e92530a" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfMars\\nMars, the fourth planet from the Sun, is...
1mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
2earth.pdfSolar System\\nOur solar system is a vast and f...
3earth.pdfSolar System\\nFor more details about our Solar...
4earth.pdfEarth\\nEarth is the third planet from the Sun....
5earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "1 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "2 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "3 earth.pdf Solar System\\nFor more details about our Solar...\n", + "4 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "5 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "6bdd3515", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6bdd3515", + "outputId": "00705442-b6ae-4238-b0f5-c94de690ecb4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 1------\n", + "Basic facts about Mars:\n", + "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "Ā· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "Ā· Rotation Period: 24 hours (one day)\n", + "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "2b34d9c6", + "metadata": { + "id": "2b34d9c6" + }, + "source": [ + "### 7.4- Understanding the output\n", + "\n", + "So we started with 7 rows and ended up with 6. Fuzzy dedupe removed the following **very similar** chunk.\n", + "\n", + "These are pretty similar chunks except for the words 'the' and 'our'\n", + "\n", + "**earth.pdf**\n", + "\n", + "`For more details about *our* Solar system see Chapter 1.`\n", + "\n", + "**mars.pdf**\n", + "\n", + "`For more details about *the* Solar system see Chapter 1.`\n", + "\n", + "Pretty neat, eh? šŸ‘\n", + "\n", + "### Configuring Fuzzy de-dupe\n", + "\n", + "You can tweak fuzzy dedupe by tweaking the following parameters\n", + "\n", + "```python\n", + "# fuzzy parameters\n", + " \"fdedup_num_permutations\": 64,\n", + " \"fdedup_threshold\": 0.7, # (default 0.8)\n", + " \"fdedup_shingles_size\": 5,\n", + " \"fdedup_delimiters\": \" \"\n", + "```\n", + "\n", + "In our case, we set `fdedup_threshold` parameter to 0.7. \n" + ] + }, + { + "cell_type": "markdown", + "id": "5370950a-2a3a-4143-8218-f9b4808099ba", + "metadata": { + "id": "5370950a-2a3a-4143-8218-f9b4808099ba" + }, + "source": [ + "## Step-8: Text encoding\n", + "\n", + "Encode text for the vector storage." + ] + }, + { + "cell_type": "markdown", + "id": "85aba685", + "metadata": { + "id": "85aba685" + }, + "source": [ + "### 8.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "outputId": "e1795167-9fac-4b7c-9417-f655c30848a1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-6: Processing input='output/05_fuzzy_dedupe_out' --> output='output/06_embeddings_out'\n" + ] + } + ], + "source": [ + "STAGE = 6\n", + "\n", + "input_folder = output_fuzzy_dedupe_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_embeddings_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "c97545f4", + "metadata": { + "id": "c97545f4" + }, + "source": [ + "### 8.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "outputId": "f4c2cba4-aed0-4eee-873b-d1a8abf60cbd" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:39 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "18:40:39 INFO - pipeline id pipeline_id\n", + "18:40:39 INFO - code location None\n", + "18:40:39 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", + "18:40:39 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:39 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:40:39 INFO - orchestrator text_encoder started at 2024-09-18 18:40:39\n", + "18:40:39 INFO - Number of files is 2, source profile {'max_file_size': 0.009204864501953125, 'min_file_size': 0.009014129638671875, 'total_file_size': 0.018218994140625}\n", + "18:40:41 INFO - Completed 1 files (50.0%) in 0.003 min\n", + "18:40:41 INFO - Completed 2 files (100.0%) in 0.003 min\n", + "18:40:41 INFO - Done processing 2 files, waiting for flush() completion.\n", + "18:40:41 INFO - done flushing in 0.0 sec\n", + "18:40:41 INFO - Completed execution in 0.032 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:6 completed successfully\n", + "CPU times: user 816 ms, sys: 204 ms, total: 1.02 s\n", + "Wall time: 2.53 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # text_encoder\n", + " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", + "}\n", + "\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "b734852c", + "metadata": { + "id": "b734852c" + }, + "source": [ + "### 8.3 - Inspect Generated output\n", + "\n", + "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "7b1c1d09", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 205 + }, + "id": "7b1c1d09", + "outputId": "86c49244-9f9f-4116-fb17-c27ff6c29bc7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (6, 18)\n", + "Output data dimensions (rows x columns)= (6, 19)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_idremovedchunk_hashembeddings
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...6[]-1[0.07728295, 0.024970993, -0.043180738, 0.0580...
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7[]-1[0.10598018, 0.025460618, 0.023627337, 0.03905...
2earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...0[]-1[0.0077404436, -0.02055944, 0.026426593, 0.011...
3earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...1[]5[-0.062105548, -0.0053322907, 0.031277698, 0.0...
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...2[]-1[0.072435796, -0.058001805, -0.019771898, -0.0...
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...3[]-1[0.091821924, 0.015197902, 0.07716932, 0.01711...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 earth.pdf 1 0 11 \n", + "3 earth.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "2 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "2 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "1 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "2 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "5 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox chunk_id \\\n", + "0 1 [132.87440491, 500.84011841, 477.48345947, 534... 6 \n", + "1 1 [133.2026062, 482.90710449, 237.04431152, 493.... 7 \n", + "2 1 [132.87112427, 588.96014404, 479.40917969, 623... 0 \n", + "3 1 [133.20942688, 570.81555176, 375.57919312, 581... 1 \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 2 \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 3 \n", + "\n", + " removed chunk_hash embeddings \n", + "0 [] -1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", + "1 [] -1 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", + "2 [] -1 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", + "3 [] 5 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", + "4 [] -1 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", + "5 [] -1 [0.091821924, 0.015197902, 0.07716932, 0.01711... " + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "f5e12630-be6b-4188-a925-77117155617b", + "metadata": { + "id": "f5e12630-be6b-4188-a925-77117155617b" + }, + "source": [ + "## Step-9: Copy output to final output dir" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "outputId": "aa667c65-8421-4d4d-f57e-47ccc4ea41ad" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Copied output from 'output/06_embeddings_out' --> 'output/output_final'\n" + ] + } + ], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", + "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", + "\n", + "print (f\"āœ… Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "0a1ed94698ca4e4291c553929e0ca66c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "2eea7bc810e54eaeb325136352b71e66": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "3077f04af3a9447ab98717bd3131cd8f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "4f63bfad92b64e7bae18e720376d402d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_709685da1c6c4164bed658357a2191bf", + "max": 7, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_0a1ed94698ca4e4291c553929e0ca66c", + "value": 7 + } + }, + "5dbc6889a9c243c5a922f8cc5f1a704c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6957a659451b46dab702c1c62fa9cdd2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5dbc6889a9c243c5a922f8cc5f1a704c", + "placeholder": "ā€‹", + "style": "IPY_MODEL_d6e520e4da004c818031ccfcc3588e5d", + "value": "ā€‡7/7ā€‡[00:00<00:00,ā€‡221.60it/s]" + } + }, + "709685da1c6c4164bed658357a2191bf": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7616f1b493e1461c9fd1319fae3bc10b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_ebc626c0750c470db6789b26acf15f60", + "placeholder": "ā€‹", + "style": "IPY_MODEL_3077f04af3a9447ab98717bd3131cd8f", + "value": "Fetchingā€‡7ā€‡files:ā€‡100%" + } + }, + "8226b2522ce446f6bd3a36c4e227370c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_7616f1b493e1461c9fd1319fae3bc10b", + "IPY_MODEL_4f63bfad92b64e7bae18e720376d402d", + "IPY_MODEL_6957a659451b46dab702c1c62fa9cdd2" + ], + "layout": "IPY_MODEL_2eea7bc810e54eaeb325136352b71e66" + } + }, + "d6e520e4da004c818031ccfcc3588e5d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "ebc626c0750c470db6789b26acf15f60": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb new file mode 100644 index 000000000..7ce746c67 --- /dev/null +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -0,0 +1,3909 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", + "metadata": { + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" + }, + "source": [ + "# Data Prep Kit Demo 1 - Ray Version\n", + "\n", + "This notebook will introduce DPK and showcase some of it's capabilities.\n", + "\n", + "Here is the workflow\n", + "\n", + "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "b15976e3", + "metadata": { + "id": "b15976e3" + }, + "source": [ + "## How to run this notebook\n", + "\n", + "Two options:\n", + "\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_ray.ipynb)\n", + "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", + "\n", + "The notebook will work as in both environments" + ] + }, + { + "cell_type": "markdown", + "id": "eb8b0d5c", + "metadata": { + "id": "eb8b0d5c" + }, + "source": [ + "## Step-1: Inspect the Data\n", + "\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/solar-system)\n", + "\n", + "- [earth.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/mars.pdf)\n", + "\n", + "### (Optional) How to create PDFs?\n", + "\n", + "If you like to play around with various inputs files, follow these steps to re-generate PDFs.\n", + "\n", + "**Option 1 (Easiest): Use a word editor or google docs editor**\n", + "\n", + "Write your content and export as PDF\n", + "\n", + "\n", + "**Option 2: markdown -> pdf**\n", + "\n", + "First edit the markdown files using any text editor.\n", + "\n", + "Then use [pandoc](https://pandoc.org/) to convert them to pdfs.\n", + "\n", + "```bash\n", + "pandoc earth.md -o earth.pdf\n", + "pandoc mars.md -o mars.pdf\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "39a0ab6e", + "metadata": { + "id": "39a0ab6e" + }, + "source": [ + "## Step-2: Figure out Runtime Environment\n", + "\n", + "### 2.1 - Determine runtime\n", + "\n", + "Determine if we are running on Google colab or local python environment" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "1fe354b7", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1fe354b7", + "outputId": "6fe04a4c-8092-49bb-f4ee-ffdcd42b6c11" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "markdown", + "id": "8e7c104b", + "metadata": { + "id": "8e7c104b" + }, + "source": [ + "### 2.2 -Download Data if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "3309799e", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3309799e", + "outputId": "5af8cfbc-346d-41bd-c14e-c917d0f403f3" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " !mkdir -p 'input'\n", + " !wget -O 'input/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/earth.pdf'\n", + " !wget -O 'input/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/mars.pdf'\n", + " !wget -O 'utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/my_utils.py'" + ] + }, + { + "cell_type": "markdown", + "id": "a5dc2b68", + "metadata": { + "id": "a5dc2b68" + }, + "source": [ + "### 2.3 - Install dependencies if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1fcec577", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "1fcec577", + "outputId": "93aa2df3-0cf5-4b04-84bb-6803bbf46df6" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " ! pip install --default-timeout=100 \\\n", + " data-prep-toolkit[ray]==0.2.2.dev1 \\\n", + " data-prep-toolkit-transforms[ray,all]==0.2.2.dev1 \\\n", + " deepsearch-toolkit" + ] + }, + { + "cell_type": "markdown", + "id": "243322b8", + "metadata": { + "id": "243322b8" + }, + "source": [ + "### 2.4 - Restart Runtime\n", + "\n", + "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", + "\n", + "You do this by going to **`Runtime --> Restart Session`**\n", + "\n", + "Then you can continue to the next step (no need to re-run the notebook)" + ] + }, + { + "cell_type": "markdown", + "id": "e8b10be1", + "metadata": { + "id": "e8b10be1" + }, + "source": [ + "## Step-2: Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "356c66f7", + "metadata": { + "id": "356c66f7" + }, + "source": [ + "### 2.1 - Basic Config" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "e4YMZrBuFycl", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e4YMZrBuFycl", + "outputId": "8a316776-582c-4d01-80de-cd530081a080" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "33345487", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "33345487", + "outputId": "47dca359-2740-493d-83eb-1291617d3db1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "MY_CONFIG.RAY_RUNTIME_WORKERS: 2\n", + "MY_CONFIG.RAY_NUM_CPUS: 1\n", + "MY_CONFIG.RAY_MEMORY_GB: 2\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "## Configuration\n", + "class MyConfig:\n", + " pass\n", + "\n", + "MY_CONFIG = MyConfig ()\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " MY_CONFIG.INPUT_DATA_DIR = 'input'\n", + "else:\n", + " MY_CONFIG.INPUT_DATA_DIR = os.path.join (os.path.abspath (''), '..', 'data', 'solar-system')\n", + "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", + "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", + "\n", + "## Embedding model\n", + "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'\n", + "\n", + "## RAY CONFIGURATION\n", + "### For local runs, we can use more parallelism\n", + "### For google colab, be conservative\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + " MY_CONFIG.RAY_NUM_CPUS = 0.3\n", + " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", + "else: # local run\n", + " num_cpus_available = os.cpu_count()\n", + " # print (num_cpus_available)\n", + " MY_CONFIG.RAY_NUM_CPUS = 1\n", + " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", + " # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", + " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + "\n", + "print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)\n", + "print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)\n", + "print ('MY_CONFIG.RAY_MEMORY_GB:', MY_CONFIG.RAY_MEMORY_GB)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "b15e6827", + "metadata": { + "id": "b15e6827" + }, + "outputs": [], + "source": [ + "## Add parent dir to path\n", + "import os,sys\n", + "\n", + "this_dir = os.path.abspath('')\n", + "parent_dir = os.path.dirname(this_dir)\n", + "sys.path.append (os.path.abspath (parent_dir))" + ] + }, + { + "cell_type": "markdown", + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", + "metadata": { + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" + }, + "source": [ + "### 2.2 - Setup input/outpur directories" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "outputId": "704d5f45-5d49-43b0-afeb-1dddf2aa326d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Cleared output directory\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "\n", + "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", + " raise Exception (f\"āŒ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", + "\n", + "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", + "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", + "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", + "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", + "output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_fuzzy_dedupe_out')\n", + "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_embeddings_out')\n", + "\n", + "## clear output folder\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", + "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", + "\n", + "print (\"āœ… Cleared output directory\")" + ] + }, + { + "cell_type": "markdown", + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", + "metadata": { + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" + }, + "source": [ + "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", + "\n", + "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", + "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", + "metadata": { + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a" + }, + "source": [ + "### 3.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "outputId": "5ef25857-46d4-463e-f847-369d18cb2d8d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-1: Processing input='/home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system' --> output='output/01_parquet_out'\n" + ] + } + ], + "source": [ + "STAGE = 1\n", + "\n", + "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", + "output_folder = output_parquet_dir\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", + "metadata": { + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" + }, + "source": [ + "### 3.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "outputId": "7a069b9a-1159-4993-d2b0-b26b16235f6b" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:49:32 INFO - Running locally\n", + "18:49:32 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "18:49:32 INFO - data factory data_ is using local data access: input_folder - /home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system output_folder - output/01_parquet_out\n", + "18:49:32 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:49:32 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "18:49:32 INFO - pipeline id pipeline_id\n", + "18:49:32 INFO - code location None\n", + "18:49:32 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", + "18:49:32 INFO - actor creation delay 0\n", + "18:49:32 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:49:33,959\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - orchestrator started at 2024-09-18 18:49:37\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.135861206799746, 'object_store': 4.06793060246855}\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(RayTransformFileProcessor pid=1212179)\u001b[0m 18:49:40 INFO - Initializing models\n", + "Fetching 7 files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 7/7 [00:00<00:00, 167772.16it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=1212180)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:46 INFO - Completed processing 2 files in 0.14 min\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:46 INFO - done flushing in 0.001 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=1212180)\u001b[0m 18:49:40 INFO - Initializing models\n", + "18:49:56 INFO - Completed execution in 0.4 min, execution result 0\n", + "Fetching 7 files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 7/7 [00:00<00:00, 38031.25it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=1212179)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:1 completed successfully\n", + "CPU times: user 4.1 s, sys: 1.17 s, total: 5.27 s\n", + "Wall time: 28.2 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import ast\n", + "import os\n", + "import sys\n", + "\n", + "from pdf2parquet_transform import (\n", + " pdf2parquet_contents_type_cli_param,\n", + " pdf2parquet_contents_types,\n", + ")\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration\n", + "from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration\n", + "\n", + "from data_processing.utils import GB, ParamsUtils\n", + "\n", + "\n", + "# create parameters\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS, \"memory\": MY_CONFIG.RAY_MEMORY_GB * GB}\n", + "ingest_config = {\n", + " pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,\n", + "}\n", + "\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " \"data_files_to_use\": ast.literal_eval(\"['.pdf']\"),\n", + " # orchestrator\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + "}\n", + "\n", + "\n", + "sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))\n", + "# create launcher\n", + "launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())\n", + "# launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Ray job failed\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "5ca790e0", + "metadata": { + "id": "5ca790e0" + }, + "source": [ + "### 3.3 - Inspect Generated output\n", + "\n", + "Here we should see one entry per input file processed." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "fe59563d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 254 + }, + "id": "fe59563d", + "outputId": "9ba799f3-a183-4467-d50f-44dbbc86d19a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Output dimensions (rows x columns)= (2, 12)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0mars.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdf
1earth.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdf
\n", + "
" + ], + "text/plain": [ + " filename contents num_pages \\\n", + "0 mars.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "\n", + " num_tables num_doc_elements document_id ext \\\n", + "0 0 11 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 0 11 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:45.937701 1.966178 earth.pdf " + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(5)\n", + "\n", + "## To display certain columns\n", + "#parquet_df[['column1', 'column2', 'column3']].head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "e5058a21", + "metadata": { + "id": "e5058a21" + }, + "source": [ + "\n", + "### 3.4 - Understand the output\n", + "\n", + "Here are some interesting attributes to note:\n", + "\n", + "- **filename** : original filename\n", + "- **contents** : text\n", + "- **document_id**: unique id (UUID) assignd to this document\n", + "- **hash** : hash of document\n", + "- **pdf_convert_time** : time to convert this pdf in seconds\n", + "\n", + "Let's inspect the **contents** column. See how the text is being divided up!" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "f870e624", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f870e624", + "outputId": "e759dddf-64ac-4b55-a9bf-d0722620d6ab" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',\n", + " 'filename': 'mars.pdf',\n", + " 'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.35137939,\n", + " 654.45184326,\n", + " 169.88169861,\n", + " 667.98492432],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.09541321,\n", + " 630.68127441,\n", + " 210.66503906,\n", + " 642.34405518],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.84518433,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.02520752],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.18510437,\n", + " 570.83258057,\n", + " 374.99838257,\n", + " 581.07043457],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about the Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.22866821,\n", + " 542.98168945,\n", + " 163.86282349,\n", + " 554.45288086],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87440491,\n", + " 500.84011841,\n", + " 477.48345947,\n", + " 534.55810547],\n", + " 'page': 1,\n", + " 'span': [0, 196]}],\n", + " 'text': 'Mars, the fourth planet from the Sun, is a cold, '\n", + " 'desert world with a thin atmosphere composed '\n", + " 'primarily of carbon dioxide. Its reddish hue comes '\n", + " 'from iron oxide, or rust, prevalent on its surface.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.2026062,\n", + " 482.90710449,\n", + " 237.04431152,\n", + " 493.07443237],\n", + " 'page': 1,\n", + " 'span': [0, 23]}],\n", + " 'text': 'Basic facts about Mars:',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 453.019104,\n", + " 477.48171997,\n", + " 474.9703064],\n", + " 'page': 1,\n", + " 'span': [0, 78]}],\n", + " 'text': 'Ā· Distance from the Sun: Average of 228 million '\n", + " 'kilometers (142 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.79351807,\n", + " 431.73287964,\n", + " 451.2142334],\n", + " 'page': 1,\n", + " 'span': [0, 64]}],\n", + " 'text': 'Ā· Rotation Period: 24.6 hours (one Martian day - '\n", + " 'called a \"sol\")',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 429.10913086,\n", + " 365.9559021,\n", + " 438.83737183],\n", + " 'page': 1,\n", + " 'span': [0, 44]}],\n", + " 'text': 'Ā· Moons: Two small moons, Phobos and Deimos.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.51646423],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "import pprint\n", + "import json\n", + "\n", + "pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))\n", + "# json.loads(output_df.iloc[0, ]['contents'])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "e1a10c2d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e1a10c2d", + "outputId": "d9eab8cc-79ac-4f5e-99f3-596e357a2e39" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',\n", + " 'filename': 'earth.pdf',\n", + " 'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.30961609,\n", + " 654.45184326,\n", + " 174.04208374,\n", + " 667.93347168],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.12528992,\n", + " 630.69073486,\n", + " 210.66503906,\n", + " 642.27935791],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87112427,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.04595947],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.20942688,\n", + " 570.81555176,\n", + " 375.57919312,\n", + " 581.08459473],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about our Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.15542603,\n", + " 542.98168945,\n", + " 167.32983398,\n", + " 554.36669922],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.91053772,\n", + " 512.46295166,\n", + " 477.84887695,\n", + " 534.48431396],\n", + " 'page': 1,\n", + " 'span': [0, 107]}],\n", + " 'text': \"Earth is the third planet from the Sun. It's our home \"\n", + " 'planet. Earth is the only place we know of with life.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.30151367,\n", + " 494.86206055,\n", + " 240.17156982,\n", + " 505.07229614],\n", + " 'page': 1,\n", + " 'span': [0, 24]}],\n", + " 'text': 'Basic facts about Earth:',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 464.97409058,\n", + " 477.47979736,\n", + " 487.02810669],\n", + " 'page': 1,\n", + " 'span': [0, 79]}],\n", + " 'text': 'Ā· Distance from the Sun: Average of 149.6 million '\n", + " 'kilometers (93 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 452.86901855,\n", + " 317.90722656,\n", + " 463.24041748],\n", + " 'page': 1,\n", + " 'span': [0, 37]}],\n", + " 'text': 'Ā· Rotation Period: 24 hours (one day)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.71496582,\n", + " 396.66357422,\n", + " 451.19915771],\n", + " 'page': 1,\n", + " 'span': [0, 52]}],\n", + " 'text': 'Ā· Moons: One moon, called Luna or simply \"the Moon\".',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.53633118],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))" + ] + }, + { + "cell_type": "markdown", + "id": "72274586", + "metadata": { + "id": "72274586" + }, + "source": [ + "## Step-4: Doc chunks\n", + "\n", + "In the previous step, we have extracted text from oru PDFs. But we have the content of entire file as 'one row' in our parquet output.\n", + "\n", + "In this step, we are going to split the documents in chunks, according to their layout segmentation.\n", + "\n", + "This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`\n", + "to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.\n", + "It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: \"application/json\"`,\n", + "which provides the required JSON structure." + ] + }, + { + "cell_type": "markdown", + "id": "96198fa6", + "metadata": { + "id": "96198fa6" + }, + "source": [ + "### 4.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "305f00a3", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "305f00a3", + "outputId": "d680cc28-2d3a-4793-9373-c56635a308c9" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" + ] + } + ], + "source": [ + "STAGE = 2\n", + "\n", + "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_chunk_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "369f2cd1", + "metadata": { + "id": "369f2cd1" + }, + "source": [ + "### 4.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "5b7b18d5", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5b7b18d5", + "outputId": "7151d997-74f1-42fd-90a2-0124c6a68c84" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:49:58 INFO - Running locally\n", + "18:49:58 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'output_chunk_column_name': 'contents', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", + "18:49:58 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "18:49:58 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:49:58 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:49:58 INFO - pipeline id pipeline_id\n", + "18:49:58 INFO - code location None\n", + "18:49:58 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "18:49:58 INFO - actor creation delay 0\n", + "18:49:58 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:50:00,178\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - orchestrator started at 2024-09-18 18:50:02\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.085193634033203, 'object_store': 4.042596817016602}\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:04 INFO - Completed processing 2 files in 0.033 min\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:04 INFO - done flushing in 0.001 sec\n", + "18:50:14 INFO - Completed execution in 0.271 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:2 completed successfully\n", + "CPU times: user 917 ms, sys: 285 ms, total: 1.2 s\n", + "Wall time: 18.6 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "from doc_chunk_transform_ray import DocChunkRayTransformConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # doc_chunk arguments\n", + " # ...\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = RayTransformLauncher(DocChunkRayTransformConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "213afdf6", + "metadata": { + "id": "213afdf6" + }, + "source": [ + "### 4.3 - Inspect Generated output\n", + "\n", + "We would see documents are split into many chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "d8138d43", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 893 + }, + "id": "d8138d43", + "outputId": "3cbc98f8-1dcb-4a32-9259-f801a83cf241" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Files processed : 2\n", + "Chunks created : 8\n", + "Input data dimensions (rows x columns)= (2, 12)\n", + "Output data dimensions (rows x columns)= (8, 15)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbbox
0mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...
1mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...
2mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...
3mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....
4earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...
5earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...
6earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...
7earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "7 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "7 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "7 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (f\"Files processed : {input_df.shape[0]:,}\")\n", + "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "9e9ca75c", + "metadata": { + "id": "9e9ca75c" + }, + "source": [ + "### 4.4 - Understanding the Output\n", + "\n", + "Here we see 2 PDF files are split into 6 chunks. Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points\n", + "\n", + "See how **document_id** is carried throughout. This helps us identify original documents.\n", + "\n", + "Also note **contents** is now plain text (not JSON as before)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "3090c950", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "3090c950", + "outputId": "fa82f54b-53a3-4447-a4ca-2fe92dea452a" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
4earth.pdfSolar System\\nOur solar system is a vast and f...
5earth.pdfSolar System\\nFor more details about our Solar...
6earth.pdfEarth\\nEarth is the third planet from the Sun....
7earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", + "1 mars.pdf Solar System\\nFor more details about the Solar...\n", + "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "3 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "4 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "5 earth.pdf Solar System\\nFor more details about our Solar...\n", + "6 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "7 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "d5f151ae", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d5f151ae", + "outputId": "87a8d7a0-0bc0-4735-9edb-57e9c9e5a8e1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 3------\n", + "Basic facts about Mars:\n", + "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "Ā· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "Ā· Rotation Period: 24 hours (one day)\n", + "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "20217298", + "metadata": {}, + "source": [ + "## Step-5: DOC ID generation\n", + "\n", + "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", + "\n", + " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", + " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", + "\n", + "**This is a pre-requisite for fuzzy dedup** in the pipeline." + ] + }, + { + "cell_type": "markdown", + "id": "66811f5b", + "metadata": {}, + "source": [ + "### 5.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "1f747c0d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" + ] + } + ], + "source": [ + "\n", + "# Input for this stage is the output of exact dedeup component\n", + "# output of this component makes it possible for fdedup component to run on data.\n", + "\n", + "STAGE = 3\n", + "\n", + "input_folder = output_chunk_dir\n", + "output_folder = output_docid_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "18aa0fe1", + "metadata": {}, + "source": [ + "### 5.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "f6e9e145", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:50:16 INFO - Running locally\n", + "18:50:16 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "18:50:16 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "18:50:16 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:50:16 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:50:16 INFO - pipeline id pipeline_id\n", + "18:50:16 INFO - code location None\n", + "18:50:16 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "18:50:16 INFO - actor creation delay 0\n", + "18:50:16 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:50:17,977\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - orchestrator started at 2024-09-18 18:50:19\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Number of files is 2, source profile {'max_file_size': 0.008135795593261719, 'min_file_size': 0.008058547973632812, 'total_file_size': 0.01619434356689453}\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.074102020822465, 'object_store': 4.037051009945571}\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Completed processing 2 files in 0.013 min\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - done flushing in 0.001 sec\n", + "18:50:29 INFO - Completed execution in 0.231 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:3 completed successfully\n", + "CPU times: user 107 ms, sys: 137 ms, total: 244 ms\n", + "Wall time: 15.1 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "from doc_id_transform_ray import DocIDRayTransformRuntimeConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # doc id configuration\n", + " \"doc_id_doc_column\": \"contents\",\n", + " \"doc_id_hash_column\": \"chunk_hash\",\n", + " \"doc_id_int_column\": \"chunk_id\",\n", + "}\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "\n", + "launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "4954402f", + "metadata": {}, + "source": [ + "### 5.3 - Inspect Generated output\n", + "\n", + "You will notice we have two extra columns\n", + "\n", + "- **hash_column**\n", + "- **int_id_column**\n", + "\n", + "But still the same number or rows as before" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "1911179a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 15)\n", + "Output data dimensions (rows x columns)= (8, 17)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_hashchunk_id
0mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0
1mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...1
2mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...2
3mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...3
4earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4
5earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...5
6earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...6
7earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...7
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "7 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "7 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "7 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " chunk_hash chunk_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 1 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 2 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 3 \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 5 \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 6 \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 7 " + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "852829dc", + "metadata": {}, + "source": [ + "## Step-6: Exact Dedup\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", + "metadata": { + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe" + }, + "source": [ + "### 6.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "4c7a1b94", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4c7a1b94", + "outputId": "7998935d-3f72-4617-ea03-fd2a40ad9f23" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" + ] + } + ], + "source": [ + "STAGE = 4\n", + "\n", + "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_exact_dedupe_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", + "metadata": { + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e" + }, + "source": [ + "### 6.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "outputId": "aa460fea-a393-47d3-b084-59d47f26f0a7" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:50:31 INFO - Running locally\n", + "18:50:31 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", + "18:50:31 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "18:50:31 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:50:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:50:31 INFO - pipeline id pipeline_id\n", + "18:50:31 INFO - code location None\n", + "18:50:31 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "18:50:31 INFO - actor creation delay 0\n", + "18:50:31 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:50:33,176\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - orchestrator started at 2024-09-18 18:50:34\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.064273834228516, 'object_store': 4.032136917114258}\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:35 INFO - Completed processing 2 files in 0.014 min\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:35 INFO - done flushing in 0.001 sec\n", + "18:50:45 INFO - Completed execution in 0.23 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:4 completed successfully\n", + "CPU times: user 99.9 ms, sys: 168 ms, total: 268 ms\n", + "Wall time: 15.1 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "from ededup_transform_ray import EdedupRayTransformRuntimeConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # ededup parameters\n", + " \"ededup_hash_cpu\": 0.5,\n", + " \"ededup_num_hashes\": 2,\n", + " \"ededup_doc_column\": \"contents\",\n", + " \"ededup_doc_id_column\": \"chunk_hash\",\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "eaf1c3c3", + "metadata": { + "id": "eaf1c3c3" + }, + "source": [ + "### 6.3 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d824ebf6", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 358 + }, + "id": "d824ebf6", + "outputId": "89f1013d-6dcf-418f-a0d7-5f78b19b74ac" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 17)\n", + "Output data dimensions (rows x columns)= (7, 18)\n", + "Input chunks before exact dedupe : 8\n", + "Output chunks after exact dedupe : 7\n", + "Duplicate chunks removed : 1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_hashchunk_idremoved
0mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[]
1mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...1[]
2mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...2[]
3mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...3[]
4earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
5earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...6[]
6earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...7[]
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "6 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " chunk_hash chunk_id \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 1 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 2 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 3 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 5 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 6 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 7 \n", + "\n", + " removed \n", + "0 [] \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "5 [] \n", + "6 [] " + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "82cc9bb0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "82cc9bb0", + "outputId": "293489a5-a840-4d5c-fafd-245db30d81c0" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
4earth.pdfSolar System\\nFor more details about our Solar...
5earth.pdfEarth\\nEarth is the third planet from the Sun....
6earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", + "1 mars.pdf Solar System\\nFor more details about the Solar...\n", + "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "3 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "4 earth.pdf Solar System\\nFor more details about our Solar...\n", + "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "6 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "cc61dffa", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cc61dffa", + "outputId": "cf6393e6-c4c7-4606-87e5-892c26b28801" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 3------\n", + "Basic facts about Mars:\n", + "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "Ā· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 1------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Basic facts about Earth:\n", + "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "Ā· Rotation Period: 24 hours (one day)\n", + "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "383f40ba", + "metadata": { + "id": "383f40ba" + }, + "source": [ + "### 6.4 - Understanding the output\n", + "\n", + "Remember we had 8 chunks initially. Now we have 7! One duplicate chunk is removed.\n", + "\n", + "If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf` is removed from one of the documents! Pretty neat, eh!\n", + "\n", + "```text\n", + "## Solar System\n", + "\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "85309751-8556-41c6-ac32-84acc941bc8d", + "metadata": { + "id": "85309751-8556-41c6-ac32-84acc941bc8d" + }, + "source": [ + "## Step-7: Fuzzy Dedup\n", + "\n", + "Post exact deduplication, fuzzy deduplication is applied with the goal of removing code files that may have **slight variations** and thereby unbiasing\n", + "the data further.\n", + "\n", + "Small variations are quite commonly seen in code data in the form of variations in the values of variables, addittion of logging statements etc." + ] + }, + { + "cell_type": "markdown", + "id": "fcf574a3-b287-419c-9c86-07b828b41ca6", + "metadata": { + "id": "fcf574a3-b287-419c-9c86-07b828b41ca6" + }, + "source": [ + "### 7.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "outputId": "4548fff6-f86f-45d4-a812-49aa061fdef2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-5: Processing input='output/03_docid_out' --> output='output/05_fuzzy_dedupe_out'\n" + ] + } + ], + "source": [ + "## Input to this component is the output of doc_id generator component.\n", + "\n", + "STAGE = 5\n", + "\n", + "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_fuzzy_dedupe_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3", + "metadata": { + "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3" + }, + "source": [ + "### 7.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "outputId": "1164345a-93db-4f8e-ad34-58a1c3d0c116" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:50:46 INFO - Running locally\n", + "18:50:46 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 1}}\n", + "18:50:46 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", + "18:50:46 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:50:46 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:50:46 INFO - pipeline id pipeline_id\n", + "18:50:46 INFO - code location None\n", + "18:50:46 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "18:50:46 INFO - actor creation delay 0\n", + "18:50:46 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:50:48,381\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - orchestrator started at 2024-09-18 18:50:49\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.067702485248446, 'object_store': 4.033851241692901}\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - starting run from the beginning\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - continuing from the very beginning\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Fuzzy: num buckets 8, bucket length 8\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 bucket actors\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 minhash actors\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Table preprocessing uses 1 readers\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 table processor actors\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:57 INFO - Completed 1 files in 0.131 min\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:57 INFO - Completed 1 files (50.0%) in 0.131 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:02 INFO - Completed processing 2 files in 0.215 min\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:02 INFO - creating minhash snapshots\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:03 INFO - minhash snapshots created\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:03 INFO - creating bucket snapshots\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - bucket snapshots created\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created 1 document actors\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created 1 bucket processor actors\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created bucket processor invoker\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - added invoker to bucket collectors\n", + "\u001b[36m(BucketsHash pid=1218636)\u001b[0m 18:51:04 INFO - processing buckets 0 long, 53 short\n", + "\u001b[36m(BucketsHash pid=1218636)\u001b[0m 18:51:04 INFO - Done submitting long buckets\n", + "\u001b[36m(BucketsHashProcessorInvoker pid=1219171)\u001b[0m 18:51:05 INFO - Waiting bucket processing completion. Submitted requests 1\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:05 INFO - Done processing buckets in 0.011 min\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:05 INFO - creating document snapshots\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:06 INFO - document snapshots created\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:06 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:12 INFO - Completed processing 2 files in 0.098 min\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:12 INFO - done flushing in 0.001 sec\n", + "18:51:22 INFO - Completed execution in 0.592 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:5 completed successfully\n", + "CPU times: user 174 ms, sys: 166 ms, total: 341 ms\n", + "Wall time: 36.7 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import os\n", + "import sys\n", + "\n", + "from data_processing.utils import ParamsUtils\n", + "from fdedup_transform_ray import FdedupRayTransformConfiguration\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "\n", + "# create parameters\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # Orchestration parameters\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # columns used\n", + " \"fdedup_doc_column\": \"contents\",\n", + " \"fdedup_id_column\": \"chunk_id\",\n", + " \"fdedup_cluster_column\": \"chunk_hash\",\n", + " # infrastructure\n", + " \"fdedup_bucket_cpu\": 0.3,\n", + " \"fdedup_doc_cpu\": 0.3,\n", + " \"fdedup_mhash_cpu\": 0.3,\n", + " \"fdedup_num_doc_actors\": 1,\n", + " \"fdedup_num_bucket_actors\": 1,\n", + " \"fdedup_num_minhash_actors\": 1,\n", + " \"fdedup_num_preprocessors\": 1,\n", + " # fuzzy parameters\n", + " \"fdedup_num_permutations\": 64,\n", + " \"fdedup_threshold\": 0.7, # (default 0.8)\n", + " \"fdedup_shingles_size\": 5,\n", + " \"fdedup_delimiters\": \" \"\n", + "}\n", + "\n", + "# Pass commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "\n", + "launcher = RayTransformLauncher(FdedupRayTransformConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "a6f8cd11", + "metadata": { + "id": "a6f8cd11" + }, + "source": [ + "### 7.3 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "e899ad60", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 222 + }, + "id": "e899ad60", + "outputId": "70d040ab-b1d5-4797-f725-11982ef82413" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 17)\n", + "Output data dimensions (rows x columns)= (6, 17)\n", + "Duplicate chunks removed by fuzzy-dedupe: 2\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_idchunk_hash
0mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...04
1mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...15
2mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...2-1
3mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....3-1
4earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...6-1
5earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...7-1
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "5 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox chunk_id \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... 0 \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... 1 \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... 2 \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... 3 \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 6 \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 7 \n", + "\n", + " chunk_hash \n", + "0 4 \n", + "1 5 \n", + "2 -1 \n", + "3 -1 \n", + "4 -1 \n", + "5 -1 " + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (\"Duplicate chunks removed by fuzzy-dedupe: \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "ab7ea52b", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 81 + }, + "id": "ab7ea52b", + "outputId": "13a1847a-bdd1-4dc9-a281-a8faac59c3a8" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
4earth.pdfEarth\\nEarth is the third planet from the Sun....
5earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", + "1 mars.pdf Solar System\\nFor more details about the Solar...\n", + "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "3 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "4 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "5 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "6bdd3515", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6bdd3515", + "outputId": "5a214fa3-c420-42d7-dcab-574b661e0cd8" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 3------\n", + "Basic facts about Mars:\n", + "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "Ā· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 1------\n", + "Earth\n", + "Basic facts about Earth:\n", + "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "Ā· Rotation Period: 24 hours (one day)\n", + "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "2b34d9c6", + "metadata": { + "id": "2b34d9c6" + }, + "source": [ + "### 7.4- Understanding the output\n", + "\n", + "So we started with 7 rows and ended up with 6. Fuzzy dedupe removed the following **very similar** chunk.\n", + "\n", + "These are pretty similar chunks except for the words 'the' and 'our'\n", + "\n", + "**earth.pdf**\n", + "\n", + "`For more details about *our* Solar system see Chapter 1.`\n", + "\n", + "**mars.pdf**\n", + "\n", + "`For more details about *the* Solar system see Chapter 1.`\n", + "\n", + "Pretty neat, eh? šŸ‘\n", + "\n", + "### Configuring Fuzzy de-dupe\n", + "\n", + "You can tweak fuzzy dedupe by tweaking the following parameters\n", + "\n", + "```python\n", + "# fuzzy parameters\n", + " \"fdedup_num_permutations\": 64,\n", + " \"fdedup_threshold\": 0.7, # (default 0.8)\n", + " \"fdedup_shingles_size\": 5,\n", + " \"fdedup_delimiters\": \" \"\n", + "```\n", + "\n", + "In our case, we set `fdedup_threshold` parameter to 0.7. \n" + ] + }, + { + "cell_type": "markdown", + "id": "5370950a-2a3a-4143-8218-f9b4808099ba", + "metadata": { + "id": "5370950a-2a3a-4143-8218-f9b4808099ba" + }, + "source": [ + "## Step-8: Text encoding\n", + "\n", + "Encode text for the vector storage." + ] + }, + { + "cell_type": "markdown", + "id": "85aba685", + "metadata": { + "id": "85aba685" + }, + "source": [ + "### 8.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "outputId": "1c7835d1-1f2c-4545-8533-d9ab7a3ad0aa" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-6: Processing input='output/05_fuzzy_dedupe_out' --> output='output/06_embeddings_out'\n" + ] + } + ], + "source": [ + "STAGE = 6\n", + "\n", + "input_folder = output_fuzzy_dedupe_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_embeddings_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "c97545f4", + "metadata": { + "id": "c97545f4" + }, + "source": [ + "### 8.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "outputId": "91dd893c-3056-4d2a-bffe-49645e584a12" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:51:23 INFO - Running locally\n", + "18:51:23 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "18:51:23 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", + "18:51:23 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:51:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:51:23 INFO - pipeline id pipeline_id\n", + "18:51:23 INFO - code location None\n", + "18:51:23 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "18:51:23 INFO - actor creation delay 0\n", + "18:51:23 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:51:25,784\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - orchestrator started at 2024-09-18 18:51:28\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Number of files is 2, source profile {'max_file_size': 0.008937835693359375, 'min_file_size': 0.00830841064453125, 'total_file_size': 0.017246246337890625}\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.01370926015079, 'object_store': 4.0068546291440725}\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:33 INFO - Completed processing 2 files in 0.084 min\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:34 INFO - done flushing in 0.001 sec\n", + "18:51:44 INFO - Completed execution in 0.334 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:6 completed successfully\n", + "CPU times: user 611 ms, sys: 194 ms, total: 805 ms\n", + "Wall time: 22.1 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from text_encoder_transform_ray import TextEncoderRayTransformConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # text_encoder\n", + " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", + "}\n", + "\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "# create launcher\n", + "launcher = RayTransformLauncher(TextEncoderRayTransformConfiguration())\n", + "# Launch the ray actor(s) to process the input\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "b734852c", + "metadata": { + "id": "b734852c" + }, + "source": [ + "### 8.3 - Inspect Generated output\n", + "\n", + "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "7b1c1d09", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 205 + }, + "id": "7b1c1d09", + "outputId": "9e695b9d-f196-4cb7-c56f-3789251e7860" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (6, 17)\n", + "Output data dimensions (rows x columns)= (6, 18)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_idchunk_hashembeddings
0mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...04[0.0077404897, -0.020559434, 0.026426662, 0.01...
1mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...15[-0.051861413, 0.0035226392, 0.030617053, 0.04...
2mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...2-1[0.07728298, 0.024971062, -0.04318075, 0.05809...
3mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....3-1[0.1059802, 0.025460616, 0.02362733, 0.0390564...
4earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...6-1[0.0724358, -0.058001805, -0.01977186, -0.0243...
5earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...7-1[0.091821924, 0.015197907, 0.07716932, 0.01711...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "5 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox chunk_id \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... 0 \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... 1 \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... 2 \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... 3 \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 6 \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 7 \n", + "\n", + " chunk_hash embeddings \n", + "0 4 [0.0077404897, -0.020559434, 0.026426662, 0.01... \n", + "1 5 [-0.051861413, 0.0035226392, 0.030617053, 0.04... \n", + "2 -1 [0.07728298, 0.024971062, -0.04318075, 0.05809... \n", + "3 -1 [0.1059802, 0.025460616, 0.02362733, 0.0390564... \n", + "4 -1 [0.0724358, -0.058001805, -0.01977186, -0.0243... \n", + "5 -1 [0.091821924, 0.015197907, 0.07716932, 0.01711... " + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "f5e12630-be6b-4188-a925-77117155617b", + "metadata": { + "id": "f5e12630-be6b-4188-a925-77117155617b" + }, + "source": [ + "## Step-9: Copy output to final output dir" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "outputId": "e6a04d78-b8e9-431a-e9f5-1f9ad1aee3a7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Copied output from 'output/06_embeddings_out' --> 'output/output_final'\n" + ] + } + ], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", + "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", + "\n", + "print (f\"āœ… Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc0a6728", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw b/examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw new file mode 100644 index 000000000..c0525c556 --- /dev/null +++ b/examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw @@ -0,0 +1,2832 @@ +{ + "type": "excalidraw", + "version": 2, + "source": "https://excalidraw.com", + "elements": [ + { + "type": "image", + "version": 128, + "versionNonce": 146671843, + "index": "b45", + "isDeleted": false, + "id": "nQdFTOsh8Rjwn3poFcnOO", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 258.1818181818182, + "y": 213.63636363636363, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 64, + "height": 64, + "seed": 222183398, + "groupIds": [ + "4aSnKsxGoqeqA7eYu4s2e" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726186954844, + "link": null, + "locked": false, + "status": "saved", + "fileId": "83ba3062a1490699e3ccc129acb25b1f4ec5534d", + "scale": [ + 1, + 1 + ] + }, + { + "type": "image", + "version": 240, + "versionNonce": 2054222979, + "index": "b46", + "isDeleted": false, + "id": "hlPJZs7lUbLYhuRbSmYHs", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 260.90909090909093, + "y": 285.4545454545455, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 64, + "height": 64, + "seed": 961787386, + "groupIds": [ + "4aSnKsxGoqeqA7eYu4s2e" + ], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "FVhCmDYbWjGck9rgcESwp", + "type": "arrow" + }, + { + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + } + ], + "updated": 1726186954844, + "link": null, + "locked": false, + "status": "saved", + "fileId": "83ba3062a1490699e3ccc129acb25b1f4ec5534d", + "scale": [ + 1, + 1 + ] + }, + { + "type": "arrow", + "version": 2550, + "versionNonce": 1240871476, + "index": "b47", + "isDeleted": false, + "id": "FVhCmDYbWjGck9rgcESwp", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 823.5583207607388, + "y": 273.73602641681657, + "strokeColor": "#2f9e44", + "backgroundColor": "transparent", + "width": 154.2895204048931, + "height": 2.3372664247598323, + "seed": 1954615226, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726708776348, + "link": null, + "locked": false, + "startBinding": { + "elementId": "Wxv71stEiYRpNjyhzzXgO", + "focus": 1.202109076005182, + "gap": 9.103775306193256, + "fixedPoint": null + }, + "endBinding": null, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 154.2895204048931, + 2.3372664247598323 + ] + ] + }, + { + "type": "text", + "version": 324, + "versionNonce": 1281521869, + "index": "b4M", + "isDeleted": false, + "id": "zSJvmm-7DrsR5-qRb96Kl", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 595.4118679291607, + "y": 242.27481706603328, + "strokeColor": "#1e1e1e", + "backgroundColor": "#ffc9c9", + "width": 141.51840079198635, + "height": 59.453152259008114, + "seed": 409665722, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + }, + { + "id": "0wYqjwjKHCGbx7CfmDR__", + "type": "arrow" + } + ], + "updated": 1726186894805, + "link": null, + "locked": false, + "fontSize": 23.781260903603247, + "fontFamily": 1, + "text": "2. split into\nchunks", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "2. split into\nchunks", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "arrow", + "version": 848, + "versionNonce": 138401069, + "index": "b4N", + "isDeleted": false, + "id": "JMprrs8mNVD4CrqUlVm7i", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 329.1268602850381, + "y": 278.24885892455757, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", + "width": 185.2530890548909, + "height": 2.823455039174007, + "seed": 1319994682, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726186962183, + "link": null, + "locked": false, + "startBinding": { + "elementId": "hlPJZs7lUbLYhuRbSmYHs", + "focus": -1.189794049219074, + "gap": 7.205686529987929, + "fixedPoint": null + }, + "endBinding": { + "elementId": "YFlD_rDw6IwCctPG9BjYf", + "focus": 1.1403432588201572, + "gap": 6.460959750980123, + "fixedPoint": null + }, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 185.2530890548909, + -2.823455039174007 + ] + ] + }, + { + "type": "text", + "version": 757, + "versionNonce": 361576332, + "index": "b4O", + "isDeleted": false, + "id": "G0k27V_VE7lyh7YGr_fts", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 1128.9917648038, + "y": 212.9780740734803, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 110.85037231445312, + "height": 58.225670034857664, + "seed": 970452474, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "FVhCmDYbWjGck9rgcESwp", + "type": "arrow" + } + ], + "updated": 1726708803406, + "link": null, + "locked": false, + "fontSize": 23.290268013943066, + "fontFamily": 1, + "text": "4. dedupe\n(exact)", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "4. dedupe\n(exact)", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 598, + "versionNonce": 1689279715, + "index": "b4g", + "isDeleted": false, + "id": "XUbC5cWQCm-GEFrdqZW7g", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 333.94038113680745, + "y": 243.15978750685963, + "strokeColor": "#1e1e1e", + "backgroundColor": "#ffc9c9", + "width": 173.54608154296875, + "height": 28.457738187179977, + "seed": 1458850132, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + } + ], + "updated": 1726187078639, + "link": null, + "locked": false, + "fontSize": 22.766190549743982, + "fontFamily": 1, + "text": "1. extract text", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "1. extract text", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "image", + "version": 145, + "versionNonce": 1461008621, + "index": "b4h", + "isDeleted": false, + "id": "XH-Rt0Q5-K2g4tM9reh76", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 520.8409090909091, + "y": 209.88636363636368, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 64, + "height": 64, + "seed": 1159948140, + "groupIds": [ + "KKvJ56bTHwzAbN8YXYU0-" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726186894805, + "link": null, + "locked": false, + "status": "saved", + "fileId": "fffa228d79e3bc7053142e0031890d5aaf369b8a", + "scale": [ + 1, + 1 + ] + }, + { + "type": "image", + "version": 193, + "versionNonce": 1127846733, + "index": "b4i", + "isDeleted": false, + "id": "YFlD_rDw6IwCctPG9BjYf", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 520.8409090909091, + "y": 279.8863636363637, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 64, + "height": 64, + "seed": 1369151980, + "groupIds": [ + "KKvJ56bTHwzAbN8YXYU0-" + ], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "0wYqjwjKHCGbx7CfmDR__", + "type": "arrow" + }, + { + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + } + ], + "updated": 1726186894805, + "link": null, + "locked": false, + "status": "saved", + "fileId": "fffa228d79e3bc7053142e0031890d5aaf369b8a", + "scale": [ + 1, + 1 + ] + }, + { + "type": "arrow", + "version": 753, + "versionNonce": 1832909987, + "index": "b4j", + "isDeleted": false, + "id": "0wYqjwjKHCGbx7CfmDR__", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 587.6995151292258, + "y": 276.08728311464677, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", + "width": 160.10395921482052, + "height": 0.6238794650969908, + "seed": 1397245780, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726186894829, + "link": null, + "locked": false, + "startBinding": { + "elementId": "YFlD_rDw6IwCctPG9BjYf", + "focus": -1.1101505124640194, + "gap": 3.799080521716917, + "fixedPoint": null + }, + "endBinding": { + "elementId": "zSJvmm-7DrsR5-qRb96Kl", + "focus": -0.1259939432648205, + "gap": 10.873205622899263, + "fixedPoint": null + }, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 160.10395921482052, + -0.6238794650969908 + ] + ] + }, + { + "type": "text", + "version": 19, + "versionNonce": 1725165603, + "index": "b4t", + "isDeleted": false, + "id": "56KAsZE3Fub50OzL9XJ35", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 344.7055268721148, + "y": 290.01136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 137.6798553466797, + "height": 25, + "seed": 961622755, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726187031887, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "(pdf2parquet)", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "(pdf2parquet)", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 89, + "versionNonce": 1217800429, + "index": "b4u", + "isDeleted": false, + "id": "GEwyTqhl4LrSwcaOeKRT5", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 514.7055268721148, + "y": 356.01136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 74.97993469238281, + "height": 50, + "seed": 31755757, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726187172155, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "parquet\nfiles", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "parquet\nfiles", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 273, + "versionNonce": 821721012, + "index": "b5F", + "isDeleted": false, + "id": "ZGkHBN9UBrJLYPIlm-KTj", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1355.555487199263, + "y": 305.51136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 118.5198974609375, + "height": 50, + "seed": 1591407981, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708923087, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "duplicate 'B'\nis removed", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "duplicate 'B'\nis removed", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 747, + "versionNonce": 104645940, + "index": "b5G", + "isDeleted": false, + "id": "DolT9H5aqzEugA7sUfNlx", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 827.643003983931, + "y": 226.3985286189349, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 166.41502380371094, + "height": 29.112835017428832, + "seed": 466678605, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708795102, + "link": null, + "locked": false, + "fontSize": 23.290268013943066, + "fontFamily": 1, + "text": "3. document id", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "3. document id", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "arrow", + "version": 1071, + "versionNonce": 474965812, + "index": "b5U", + "isDeleted": false, + "id": "cXhTkxU13WdQeAv3Z_1mR", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 1318.993474938044, + "y": 401.3233033689122, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", + "width": 0.8539592148204065, + "height": 113.62612053490295, + "seed": 605419139, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726709016812, + "link": null, + "locked": false, + "startBinding": null, + "endBinding": null, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 0.8539592148204065, + 113.62612053490295 + ] + ] + }, + { + "type": "text", + "version": 976, + "versionNonce": 988237964, + "index": "b5V", + "isDeleted": false, + "id": "Ba_pxAykcwH_ZsTbAtduc", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 1218.815207047896, + "y": 429.9549461276493, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 184.07017517089844, + "height": 29.112835017428832, + "seed": 1665190893, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726709020882, + "link": null, + "locked": false, + "fontSize": 23.290268013943066, + "fontFamily": 1, + "text": "5. fuzzy dedupe", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5. fuzzy dedupe", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 580, + "versionNonce": 693951668, + "index": "b5h", + "isDeleted": false, + "id": "XFHbtP2KmiHNNjZhz8ajW", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1299.1022727272725, + "y": 517.40625, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 410701101, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "OdGsWefGyr6uqMl0wC6mH" + } + ], + "updated": 1726708989657, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 323, + "versionNonce": 1216816692, + "index": "b5i", + "isDeleted": false, + "id": "OdGsWefGyr6uqMl0wC6mH", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1315.9786418568, + "y": 522.40625, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 13.519989013671875, + "height": 25, + "seed": 593665933, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708989657, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "XFHbtP2KmiHNNjZhz8ajW", + "originalText": "A", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 573, + "versionNonce": 1856782260, + "index": "b5j", + "isDeleted": false, + "id": "NzWqph0M7tEkeTDKLPGZR", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1301.1931818181815, + "y": 564.5880681818182, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 2053187053, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "K1QK2dyVWiWfd32P8ovQK" + } + ], + "updated": 1726708989657, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 264, + "versionNonce": 334637364, + "index": "b5k", + "isDeleted": false, + "id": "K1QK2dyVWiWfd32P8ovQK", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1317.219552473588, + "y": 569.5880681818182, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 1350557773, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708989657, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "NzWqph0M7tEkeTDKLPGZR", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 680, + "versionNonce": 1002365620, + "index": "b5l", + "isDeleted": false, + "id": "Lf5-FqrnO7iDVhOKUtEnT", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1306.9204545454545, + "y": 619.3267045454547, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 999837357, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "cTJ-8HZCMcNbXqDHggxAH" + } + ], + "updated": 1726708989657, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 375, + "versionNonce": 213412916, + "index": "b5m", + "isDeleted": false, + "id": "cTJ-8HZCMcNbXqDHggxAH", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1324.2668248956852, + "y": 624.3267045454547, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 12.579986572265625, + "height": 25, + "seed": 1515450637, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708989657, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "C", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Lf5-FqrnO7iDVhOKUtEnT", + "originalText": "C", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 141, + "versionNonce": 1757726132, + "index": "b5n", + "isDeleted": false, + "id": "LK6nmMo09HhGvAeViRfcK", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1274.397727272727, + "y": 523.3664772727274, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 975980397, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708989657, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "1", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "1", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 196, + "versionNonce": 761917108, + "index": "b5o", + "isDeleted": false, + "id": "LbPBuhQ2btuEnjbeSDvuK", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1278.397727272727, + "y": 569.6164772727275, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 2104152525, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708993287, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "2", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "2", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 385, + "versionNonce": 800257204, + "index": "b5p", + "isDeleted": false, + "id": "tEnh5H4Dm1tA62FJY7ZnT", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1279.647727272727, + "y": 629.6164772727275, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 1129349773, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726709003336, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "5", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 307, + "versionNonce": 51819060, + "index": "b5q", + "isDeleted": false, + "id": "TExMhRi4612k0BcybcpHE", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1251.2855058149858, + "y": 678.5113636363637, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 143.59986877441406, + "height": 50, + "seed": 2082336653, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708989657, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "near duplicate \nA' is removed", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "near duplicate \nA' is removed", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "arrow", + "version": 1039, + "versionNonce": 199529869, + "index": "b5r", + "isDeleted": false, + "id": "KvvwHoDnDT0vBh2bOfiTz", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 1245.243474938044, + "y": 579.5733033689121, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", + "width": 192.8960407851796, + "height": 1.126120534903066, + "seed": 1004556899, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726188444758, + "link": null, + "locked": false, + "startBinding": null, + "endBinding": null, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + -192.8960407851796, + 1.126120534903066 + ] + ] + }, + { + "type": "text", + "version": 989, + "versionNonce": 923042467, + "index": "b5s", + "isDeleted": false, + "id": "cPSHqIr9Peb5h5TNxl3Bb", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 1100.5103669600053, + "y": 536.2049461276495, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 138.99639892578125, + "height": 29.112835017428832, + "seed": 865272429, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726188447614, + "link": null, + "locked": false, + "fontSize": 23.290268013943066, + "fontFamily": 1, + "text": "6. vectorize", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "6. vectorize", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "diamond", + "version": 103, + "versionNonce": 679668419, + "index": "b5vV", + "isDeleted": false, + "id": "tPvUjMUp7lW3F8V3H2MGV", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 960.0454545454546, + "y": 515.5113636363637, + "strokeColor": "#1e1e1e", + "backgroundColor": "#d0bfff", + "width": 63.75, + "height": 45, + "seed": 782762477, + "groupIds": [ + "CuM_sg3LC9KTYRVST18pX" + ], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726188516836, + "link": null, + "locked": false + }, + { + "type": "diamond", + "version": 117, + "versionNonce": 224511779, + "index": "b5w", + "isDeleted": false, + "id": "uOIVUAj_hGKNiZ3NnQm2n", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 961.9204545454546, + "y": 564.5113636363637, + "strokeColor": "#1e1e1e", + "backgroundColor": "#d0bfff", + "width": 63.75, + "height": 45, + "seed": 1245990083, + "groupIds": [ + "CuM_sg3LC9KTYRVST18pX" + ], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726188516836, + "link": null, + "locked": false + }, + { + "type": "diamond", + "version": 122, + "versionNonce": 1205596301, + "index": "b5x", + "isDeleted": false, + "id": "ylh6O0GmjhRAHndHyuEo2", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 966.9204545454546, + "y": 615.7613636363637, + "strokeColor": "#1e1e1e", + "backgroundColor": "#d0bfff", + "width": 63.75, + "height": 45, + "seed": 499397773, + "groupIds": [ + "CuM_sg3LC9KTYRVST18pX" + ], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726188516836, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 260, + "versionNonce": 1136192621, + "index": "b5y", + "isDeleted": false, + "id": "ekXIjXxtZ6f2w_A-9CVUV", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 938.2855058149859, + "y": 670.7613636363637, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 107.5399169921875, + "height": 25, + "seed": 1616985635, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726188507123, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "embeddings", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "embeddings", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 381, + "versionNonce": 1618061620, + "index": "b5z", + "isDeleted": false, + "id": "Uv-8TiLeECJuuNx1yJjtv", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 768.5454545454545, + "y": 280.72727272727275, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 637818278, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "id": "0wYqjwjKHCGbx7CfmDR__", + "type": "arrow" + }, + { + "type": "text", + "id": "B8Nj-HzRDl-FA-5UJ2hiw" + } + ], + "updated": 1726708776347, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 140, + "versionNonce": 1472181260, + "index": "b60", + "isDeleted": false, + "id": "B8Nj-HzRDl-FA-5UJ2hiw", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 783.2418233698064, + "y": 285.72727272727275, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 17.879989624023438, + "height": 25, + "seed": 1971906541, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A'", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Uv-8TiLeECJuuNx1yJjtv", + "originalText": "A'", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 391, + "versionNonce": 1280205492, + "index": "b61", + "isDeleted": false, + "id": "l7XMM15Xwzq5xmDF0QvyN", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 764.090909090909, + "y": 186.09090909090912, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1556091898, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "SZp9x_uNQ-65LQPMQ768C" + } + ], + "updated": 1726708776347, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 132, + "versionNonce": 809849484, + "index": "b62", + "isDeleted": false, + "id": "SZp9x_uNQ-65LQPMQ768C", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 780.9672782204367, + "y": 191.09090909090912, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 13.519989013671875, + "height": 25, + "seed": 912377443, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "l7XMM15Xwzq5xmDF0QvyN", + "originalText": "A", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 413, + "versionNonce": 1599597620, + "index": "b63", + "isDeleted": false, + "id": "Wxv71stEiYRpNjyhzzXgO", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 767.1818181818182, + "y": 234.27272727272725, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 775085434, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "id": "0wYqjwjKHCGbx7CfmDR__", + "type": "arrow" + }, + { + "id": "FVhCmDYbWjGck9rgcESwp", + "type": "arrow" + }, + { + "type": "text", + "id": "zyU1230-bmsHaQTSoi7Ov" + } + ], + "updated": 1726708776347, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 102, + "versionNonce": 1402151180, + "index": "b64", + "isDeleted": false, + "id": "zyU1230-bmsHaQTSoi7Ov", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 783.2081888372248, + "y": 239.27272727272725, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 1842733667, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Wxv71stEiYRpNjyhzzXgO", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 397, + "versionNonce": 997475764, + "index": "b65", + "isDeleted": false, + "id": "IkaeA2i4mlTdmulYEI_na", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 771.3636363636363, + "y": 325.3636363636364, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1839286010, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "IgKDOIQhfqb_x9gQh30eh" + } + ], + "updated": 1726708776347, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 89, + "versionNonce": 421732236, + "index": "b66", + "isDeleted": false, + "id": "IgKDOIQhfqb_x9gQh30eh", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 787.3900070190429, + "y": 330.3636363636364, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 1893385699, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "IkaeA2i4mlTdmulYEI_na", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 440, + "versionNonce": 1439264564, + "index": "b67", + "isDeleted": false, + "id": "qGfihx9_lQSyc1F8oQTu0", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 772.909090909091, + "y": 369.01136363636374, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1381062179, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "0DIl-np94wHje4sIubFJp" + } + ], + "updated": 1726708776347, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 133, + "versionNonce": 1496272396, + "index": "b68", + "isDeleted": false, + "id": "0DIl-np94wHje4sIubFJp", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 790.2554612593218, + "y": 374.01136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 12.579986572265625, + "height": 25, + "seed": 1722325443, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "C", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "qGfihx9_lQSyc1F8oQTu0", + "originalText": "C", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 70, + "versionNonce": 247294132, + "index": "b69", + "isDeleted": false, + "id": "lkM4ke2d8E4KSisX5yE08", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 762.5454545454546, + "y": 429.51136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "#d0bfff", + "width": 64.55995178222656, + "height": 25, + "seed": 1905848653, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "chunks", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "chunks", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 527, + "versionNonce": 1269467404, + "index": "b698", + "isDeleted": false, + "id": "JNHVvikjirDDllCKotbJC", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1025.9545454545455, + "y": 275.68750000000006, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 848769955, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "8Msc7tXcZdg2UUH2NmUn-" + } + ], + "updated": 1726708934863, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 287, + "versionNonce": 1779271564, + "index": "b69G", + "isDeleted": false, + "id": "8Msc7tXcZdg2UUH2NmUn-", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1040.6509142788973, + "y": 280.68750000000006, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 17.879989624023438, + "height": 25, + "seed": 1297532739, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A'", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "JNHVvikjirDDllCKotbJC", + "originalText": "A'", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 565, + "versionNonce": 1888269836, + "index": "b69O", + "isDeleted": false, + "id": "fkbHGW5tJ-Ay0sh8h-9hJ", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1022.5, + "y": 182.05113636363643, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 2116216547, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "BNiP4zX7PtFTn_e_5vXX3" + } + ], + "updated": 1726708934863, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 308, + "versionNonce": 1814172812, + "index": "b69V", + "isDeleted": false, + "id": "BNiP4zX7PtFTn_e_5vXX3", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1039.3763691295276, + "y": 187.05113636363643, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 13.519989013671875, + "height": 25, + "seed": 1804210819, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "fkbHGW5tJ-Ay0sh8h-9hJ", + "originalText": "A", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 558, + "versionNonce": 981967628, + "index": "b69d", + "isDeleted": false, + "id": "QYKbNgibs7-HxaNNr8tfG", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1024.590909090909, + "y": 229.23295454545456, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1716177443, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "C-rwFmAbwI_qgVqpkXy7m" + } + ], + "updated": 1726708934863, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 249, + "versionNonce": 1916232076, + "index": "b69l", + "isDeleted": false, + "id": "C-rwFmAbwI_qgVqpkXy7m", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1040.6172797463155, + "y": 234.23295454545456, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 592678339, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "QYKbNgibs7-HxaNNr8tfG", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 653, + "versionNonce": 1248546828, + "index": "b69t", + "isDeleted": false, + "id": "m2Wj9fp76PKCAhrulCmTa", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1027.318181818182, + "y": 365.97159090909105, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 901963107, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "MNgTOO1UYazXucNSjXZ_z" + } + ], + "updated": 1726708934863, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 348, + "versionNonce": 52260492, + "index": "b6A", + "isDeleted": false, + "id": "MNgTOO1UYazXucNSjXZ_z", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1044.6645521684127, + "y": 370.97159090909105, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 12.579986572265625, + "height": 25, + "seed": 1223112963, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "C", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "m2Wj9fp76PKCAhrulCmTa", + "originalText": "C", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 127, + "versionNonce": 1292352780, + "index": "b6AG", + "isDeleted": false, + "id": "J1KVE_C00rdGo7FWIwu1X", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 998.7954545454545, + "y": 188.01136363636374, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 1442121325, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "1", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "1", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 181, + "versionNonce": 832846732, + "index": "b6AV", + "isDeleted": false, + "id": "TIEDsM4QhNNDJARAJnvDz", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1001.7954545454545, + "y": 234.26136363636374, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 846611715, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "2", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "2", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 229, + "versionNonce": 2066541068, + "index": "b6Al", + "isDeleted": false, + "id": "tGvqUuD_kCzfMYn-UX8o-", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1004.2954545454545, + "y": 283.01136363636374, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 758667053, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "3", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "3", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 360, + "versionNonce": 479971468, + "index": "b6B", + "isDeleted": false, + "id": "IQM8OVr381UGBDKQtda8U", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1004.0454545454545, + "y": 371.26136363636374, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 618433805, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "5", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 611, + "versionNonce": 430626572, + "index": "b6BV", + "isDeleted": false, + "id": "fJGd6Pf-SaTmbDMUGHhUW", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1028.3972327492456, + "y": 322.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1491526540, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "Ax-8fSsrXvrkMhlGAgJgO" + } + ], + "updated": 1726708934863, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 302, + "versionNonce": 1859392908, + "index": "b6C", + "isDeleted": false, + "id": "Ax-8fSsrXvrkMhlGAgJgO", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1044.423603404652, + "y": 327.2812500000001, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 1943704076, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "fJGd6Pf-SaTmbDMUGHhUW", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 259, + "versionNonce": 2035385356, + "index": "b6CV", + "isDeleted": false, + "id": "07qZABiLS71UbigBsFpnK", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1002.0335963856091, + "y": 327.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 1965424820, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "4", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "4", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "arrow", + "version": 2600, + "versionNonce": 1259679372, + "index": "b6D", + "isDeleted": false, + "id": "M_WCuesgPRdSQ_zqaUtz0", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1113.5321305627851, + "y": 279.97561555378826, + "strokeColor": "#2f9e44", + "backgroundColor": "transparent", + "width": 154.2895204048931, + "height": 2.3372664247598323, + "seed": 1489010356, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726708895234, + "link": null, + "locked": false, + "startBinding": null, + "endBinding": null, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 154.2895204048931, + 2.3372664247598323 + ] + ] + }, + { + "type": "text", + "version": 176, + "versionNonce": 14571020, + "index": "b6E", + "isDeleted": false, + "id": "wkavhEPwz2TNGwf8xFeLA", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1263.0335963856091, + "y": 188.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 809955212, + "groupIds": [ + "uHtPh4-PiLJtgc-p_Cdgo" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708942969, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "1", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "1", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 538, + "versionNonce": 1071049484, + "index": "b6F", + "isDeleted": false, + "id": "Qaz1byDgzm-0ZrVLBmU4v", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1288.9545454545455, + "y": 273.1875000000001, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 144156909, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "D2HbgzHXdGyxGppwaWbBy" + } + ], + "updated": 1726708966705, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 296, + "versionNonce": 2108300212, + "index": "b6G", + "isDeleted": false, + "id": "D2HbgzHXdGyxGppwaWbBy", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1303.6509142788973, + "y": 278.1875000000001, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 17.879989624023438, + "height": 25, + "seed": 2062418765, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A'", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Qaz1byDgzm-0ZrVLBmU4v", + "originalText": "A'", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 569, + "versionNonce": 509454732, + "index": "b6H", + "isDeleted": false, + "id": "-LxVJeZLqj0MgI5FEg_pm", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1281.5, + "y": 179.55113636363643, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1514803629, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "trFDjiJr6cfNlCSEKqNjE" + } + ], + "updated": 1726708966705, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 311, + "versionNonce": 1054115124, + "index": "b6I", + "isDeleted": false, + "id": "trFDjiJr6cfNlCSEKqNjE", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1298.3763691295276, + "y": 184.55113636363643, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 13.519989013671875, + "height": 25, + "seed": 1674925069, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "-LxVJeZLqj0MgI5FEg_pm", + "originalText": "A", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 566, + "versionNonce": 713594892, + "index": "b6J", + "isDeleted": false, + "id": "Kxu9owye4gMpRvh7kJ1Nl", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1287.590909090909, + "y": 226.73295454545456, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1938377325, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "UP92rSYiIXnnBFhov6WNx" + } + ], + "updated": 1726708966705, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 256, + "versionNonce": 301317812, + "index": "b6K", + "isDeleted": false, + "id": "UP92rSYiIXnnBFhov6WNx", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1303.6172797463157, + "y": 231.73295454545456, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 707753165, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Kxu9owye4gMpRvh7kJ1Nl", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 593, + "versionNonce": 5355148, + "index": "b6L", + "isDeleted": false, + "id": "KMOsOR4pOx-ute2ztnw1k", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1293.318181818182, + "y": 361.4715909090911, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 635317229, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "SsRO-f6mzQzf5jQOudz6C" + } + ], + "updated": 1726708966705, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 287, + "versionNonce": 800311348, + "index": "b6M", + "isDeleted": false, + "id": "SsRO-f6mzQzf5jQOudz6C", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1310.6645521684127, + "y": 366.4715909090911, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 12.579986572265625, + "height": 25, + "seed": 1382819405, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "C", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "KMOsOR4pOx-ute2ztnw1k", + "originalText": "C", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 206, + "versionNonce": 745735436, + "index": "b6N", + "isDeleted": false, + "id": "US1PK13ekocRlMvOrHSJL", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1265.0335963856091, + "y": 231.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 1525760780, + "groupIds": [ + "bQ__H1TgpJXskAm32UBLZ", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "2", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "2", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 241, + "versionNonce": 1274323380, + "index": "b6O", + "isDeleted": false, + "id": "NxUqy-MsYDga_9XDrU9l7", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1267.5335963856091, + "y": 277.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 1116920372, + "groupIds": [ + "4mN8vM1PMjtKHfzWdqXES", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "3", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "3", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 240, + "versionNonce": 342262668, + "index": "b6P", + "isDeleted": false, + "id": "lSEPKkiY8if2M9pDun8DS", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1270.5335963856091, + "y": 370.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 932194828, + "groupIds": [ + "Z8bVLPerSCYHViV4Ld1Ed", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "5", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5", + "autoResize": true, + "lineHeight": 1.25 + } + ], + "appState": { + "gridSize": 20, + "gridStep": 5, + "gridModeEnabled": false, + "viewBackgroundColor": "#ffffff" + }, + "files": { + "83ba3062a1490699e3ccc129acb25b1f4ec5534d": { + "mimeType": "image/png", + "id": "83ba3062a1490699e3ccc129acb25b1f4ec5534d", + "dataURL": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAAXNSR0IArs4c6QAABd1JREFUeF7tm39sE2UYx79vu61b2cbKVjackBHX6TpNFIzxDzaBIBInEHQQgxoCqIygyIBoDGZSRRNRcIHAnPIjDpBfhgTkp6DRiIAYfyErsk62yToDY+vItv6+vebewmyX9u5Kr9263P3V3Pvc93neT5973ifv3RGEeTSPeUCXoOEWApgJwABgJAASpoyguTotzZNQaFyQtWv7Tjl1g2mFFbjVYJwP4CMAI6IZmGp4OjTjxrup0/5cZu32L6PpSzKAVkPRuxT0rWgGc1ubB5A0fhwopU5qt5fpd35+JFp+JQFozS98gRJSG60g+uveBsCfp4AdDufMrNptp6LhXxTAdaMx1eNBA4DsaAQQTNMfAIPQS3uos7tUv2PH93LHIAqgJb9wMSFks9yOhfT6A2C2FJ2U630ia+sn5+WMRRRAq8F4hAJPyulUTCsoAN9S0+7h6OTsLdUXxDSkjosCsBqMFgD5UgXlsAsFgCUCRZuGcBPTamrMcviSAqA92sueUBEMOklKrV41SrKrq69ECkEcQIGxAxS6SB2Fc71QBvjpXFV5SbFu66bmcLT728YzABBK/+a0mmJ9VdW/dwohrgGwSRNyKXFYckn6unU37gRC/APwVcaLKq2mWFdV1RkuhKEBwLc6/EIzUifp167tCgfCkAHAJq0ip+2Ge6aOXr7cIRXC0ALAZwJRfZdpLJhGli51SYEw5AD4CqPq2IicrBlk9WqvGIShCYAxUB3STSp5msyZwwlBGJQAiFaLxAJ+synCg/N+qj+wf1HcAYhw2v9fTmDLrTcL7l4NygxQAMhFQMkA5RZQaoBSBJVVQFkGlT5A6QQFCCidoFjTZR2AXWGxmCSPK52g0gkqnaDSCcaiE8z44H0kT54UUJuo2wWusQmOYyfQ88Uetm+tnV2G9NdXBNr19MBjscB54iQcB78C9Xj6xjPeMyF56tSQNc956ht0vinw0kqsiqBuw8dImRY60J5du3HTtAbDnp+L4ZWrQk7Ia2lAe/kScFdbmI1u/YdIeSr0k3nH0eOwLQsEGiA+EABcp8+As1qhHpUDTfEE/tEV+/evTZmG5MdK+gB4LtbBY2lAwpjRSBr3kM8OgLepGW3TZ4G6XAEAXGfPgfvnasD83BfrYN+7P/SqOBAAOpa8BudJ3+s8mVtqoCmZwH7z59XZI/sAdG3cjK6Nm9hY4n33YsRn1VBn+97C6aw0wb5nXwAA27KVcBw9JrkFYIYDCUCdk43M2m1IyMtjsbTPfwkJY/OCAuDHtXPKkLHGxGx5gDww/1vAeepbeK809gFw/3GhD3RIKgMBIFgwvTYbrk18HNpnZoUEwGeB/tABdrmnzoy2WbMFa0DP7r24+fY7whkxGAD0dnXB9moFXGfOBhRB/1uA3Qb3F0F/YB+bkPvCn7hR9mx8AuDT19vYDOp2w9vUBNcPP4LPAP7wXwX6A0hd9CLSV1QwO8ehw7CtfCO+a0Cw3AwKQK1GyvRSZJgqQVJS2GW2pRVwHP96aAPg2trQ22GDetQoqNLT+ni5zp1H+7wFbOn0L4JxtwqIZUCwcb5r7FxVCdrdzYbjBkDaK4uR9PB4FnTXhk1w//pb0OqcPGUyqwP+B3U44bVY4DhxklV//yO1/GVoHn2EnequroHrp5+Fq37/0VitAuFFFUNrBYCyIaJsiCgbIrHYEIlhWQvPlVIE5SiCBmPMX5cP728WtO7ItZgzhSzEnwwZjPW3vg+UMa6YSf2VazEXRgSg1VB0mIKWxixkWR3Rw7mWS9MjAtBiMJYToFrWuGIkRoHyuy3mmogAsM/mvLCAIidGccvl5nqSisvXX74s+Pa4aA3go2k1FM2loLvkiiw2OmRurqVut5gvSQB4EWuB0QSKSjHBQTFOYMqtN6+WEotkALxYS0HRPELp+lh/RSZlIrds2gmlFXc1XNoh9ZqwAPCijXkPZiQmuRcSihkgKABln9SGrSM1QBE7CoJroKinBAc97qRtY5t+D+uzmf8A6hsfbisiXOQAAAAASUVORK5CYII=", + "created": 1711006482453, + "lastRetrieved": 1726708752969 + }, + "fffa228d79e3bc7053142e0031890d5aaf369b8a": { + "mimeType": "image/png", + "id": "fffa228d79e3bc7053142e0031890d5aaf369b8a", + "dataURL": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAAXNSR0IArs4c6QAABGRJREFUeF7tm0tsE1cUhv8Zu3FiJ3HihJCkpbxKQI55ryq6aKXQFigPBYG6aQAJEbHgKZAQCxYIlQ2gQJFSEK2gRRUtKuUhNSAC2YDEgleIbdpCpEhGEEIgNsGOnYxn0L0G14HAzI3H44w9d+N5nHt8z3fPPef4eoZDljcuy+2HAYDVA7zetgUAdwjAh6x9hyN/+Lemqw27tn42nL5K+jB7gMfj9nEcPlKiXA2Z9Tt+hGvK2JRBYAbg9bolNQxTqoMAIG2ac8K1fTs3f6q0n1I53QBIFQRdAUgFBN0BUBuCLgGoCUG3ANSCoGsAakDQPYBkIWQEgGQgZAyA4UIY8QCUVnSv5ZxOF5NNTMLkS7QuhQ0AjAQMDzCWgBEDmOIak7ARBI0sYKTBIeuA/T+fRXvHQ8aElZz4pPGVWLdq0VtK0pIGsx5AcnOpbu+0eIC6JiSnzQBgVIJpqASzPghmPYDkwpa6vY0gaATBNARBdZ04OW3GEkjFEgjU1izgAc2eCknOB0hv7gHPod7258W/5XQp2hDprZ3rAyTNngqRG7TC+76CU80fy8kqBFCj6VMhcoNWer/gVLOsfbIC5Mt6aw0Ash4g8SaEps8GzOYhJ4gP9iLvrhuhqTPBCQPI87YNkutzToVkMsPqvo2+KS6I+QVDT7QgwNp6A5wYlXUETT0g/MlkPFm59r2Dsl84h8BXCwFJwujGfch56KPy/ZVj8HjtZoDjUNT0F/xfL6HH72qjjjYi9/6/IwuAUORA94p6iCYzYLEgasunAzQH/JCiUZgiYRSfPoGu+k2QeB6WjnaUHfmBynStXofIuIngRBGjG/ei55ulEArs4EwmCPYiKmMKvgAiEfBRAaXHDsHsfzayACSOJjh9Np4t+45eqmj4Huburvht/7zF6J3zBT0vOXGUfj79diX9LLzSAvv5M3FZobQMjzZup+eOk7/C1npD1uhEAU2XgFIAUo4FnRu2QbAXw9QTm8VosQPmQA/KG3aDG+jPbADEur4qJ7rr1gyazdLjR5D3j3vQtYz0gNcWEtcmBtI40d1Fl8qbLWMBhKbNwtPldYPsLfnjF1jv3Mx8DxBt+ehcv41miQ86Y3+mDJRXgg8GUXFgN3gS7V+1jPQAkh1IliCt7KeDgCjSNEjyvq31Ohwnj2cugMRiydp2CyW/H4ulweV1IMuCNJLj8+7dpce69YCQa0Y8t1fs2RkvWp5//iUCNfNpqisn9UHAH0uDhXaa70mKtF9qQmHLhRiAIgcebdlBj0nNQMpklpa2OoBUepGJVeDDYeT4OuJjJtf7XDPAh4JvlbLhCVUQrVZYva2A+P9Pj/4x4yDm5sLS/h+tFFla2gCwDDKVsioCyPINkeDSufNFiWyJ6WZXyMeLYr3t9OUmOQ9TtCGSqMR4UFLjl6bkZvDN+ynfFtf6tTlGAD6n0yW7EZqok3kJeDyeeYB0WMt3BxVC8EkSt6a6uvq8QnkqxgyARbkeZA0AepilVI7xJZGIcV/ibAoaAAAAAElFTkSuQmCC", + "created": 1721376622438, + "lastRetrieved": 1726708752969 + } + } +} \ No newline at end of file diff --git a/examples/notebooks/intro/images/data-prep-kit-3-workflow.png b/examples/notebooks/intro/images/data-prep-kit-3-workflow.png new file mode 100644 index 0000000000000000000000000000000000000000..851adbfebc0511560625bf5e7afd35dfc9ef9d1c GIT binary patch literal 101303 zcmd43WmuH`w>AujFiJ>E=g=iccf-&rEeZojmwTS z=gAVSYeBL0v+w5*a!ysiXd=wynYWX& z-d^DQ_291d_rroQI1~EruCELVv%Y{nh4{m*-gXwEv;Ge+s6#?79j)v#op@FC*vnxt3P zcCneZq@<9AhiVvZtCML^^r39$o6quAxgJdu>l=mp>+?f}?sP>0ORZrHc#M-je^ao9dgwL`_gGBUvkM`=4QVD^1O-bCS#+ zk6_GZqaTFB!MRE=BOInjy*a)RJ28A!FMk&jQrMBHV|QOyyk`p%C18Si$pzF@*+qL@ zu+dLo3(5p#qJwPMi`?k^l`OCOgv>juPsT<(R&C{4NhP*?`Lflc;`oZvP64vtcxdwKB`Nx zv&k!y5kf%F;WYZxT8jb6WnVJf&pej46s|i8J0tvL)7sIQL=atLXuZw=qMw}KuI``M zS7!&!G%A>P7@N`)&SILpMU3_-tP*Vfg^)51VBnmq%YHLKt*L>YrdpDsVCAtGMexk4 zlxfqw(bUWq3%!l6x06ldJ$?C1)C9M93Y4SR4Zpr%kB%dP{3?zsP|5g!hx@@|Oo6nq z-?j{H6(?vvXZ=r9Q2zl*q^ulB`a1>U=!W4kGT-;$(;pPXe>63sc#`A#ixk!Qt+&SW z_S^AB^`tm0+#g8`Vhj`&Q4t2|#h_qL6sCz8Vx{Kafk$>wdUhofgu)gkuAp<= z^auJ0T8Cjo?>~nrp^ob>)7|@oHu~N2GsbMA9l4F@-rhts!iMUEi%cbto754^O$KSb z)a&(h2N_c{`ukw~sGHEzU#%pVqHl1ENdwN_clT6V#Yf1ZaS<~2y{2{jd!*Sg8G)au z9GMU>;TFn{a#g0Qv^ks#j{@vi|0PG_?cJHy_OFt7@|iqY32UWvK`lIvCK>+UD1k*p znvdye`#kr*bF?|=_fo)f_FNlXtA#onJsN4Y1OzUfayp*p`S>TM_q;-LDGKt`X5-+2 z+m46VG6q2$#!6=s^iT*atY^oZW!3(z-aL!1}M!S9B$=__1|?uA`w)u=vkJds3f@9GS)AnnlWSdNr#dwXSgdA<|5}UPcj5o77XmtX>o8tU;R~g z=XJ5u1YiHz$3l?yHhUJay)9q&{2{N#TD%Qde=0hQ*(AVlA zmm4ai;-1s(H-iXSiT2mX7x|l0Aad*YgN&4bcQ70me!JkJSoGBA3=x?ixTu!*^`TzC58#FAGAA|$o) zqljrBX5U$=4+g0JaMhs%tS%T`S?=krEZK8e!6!v`m}0M0_|*7H14}d)Gc(!0EIWlk zw>DDwM4fK-ouAHrx~sDo0q_5mr>}Iej4s>ZO!D+i1Q5d;dKu4uVTF1v?pLV6E5=cx zPMwZ>mJ(t2!QL;~2ja!d%y2&I1dTi#R)`92u?lFt%7|Gw3_Edk&vs(}&6qf+Yplq) zB?vPWxC-aB$^!J)dDFXO{GdEJ+&BcY7TD?N9sL9fueyP%4Bc-9l|+gl^r0mzGG+0! z1=nPr=6lcBYt+=Fo9kUTw?&lv_Z#OQU-fIDE8~UEreWA!t?zbnB%Oa+eYw)qzH+%3 z!NB8aRbOMgjx9=3fBG15^m+e;+@v#VionYXLq^E-TQY6%^NWK-v+FZlUq4>6&;lNt zO?MxXY##<$K|#f$tLOqE90+LL=eC+=yr5ApkBkSlnn7-Lw^OFM^+Vd{Ak=9;jhD3; zYVc!{31$_~sI&^B0e;Q<>7NMRr^w%v*lQ}(sF+9lnY_e!DEO2~J&EIoq38EIW#g8T zIG87UCt;NE(f#=$ddm!&7sFZXS_iGKU!~jH%`09btmIOQ1{4BceKLRw0_AD~L+T%p zSiAvxVXA^pupGQIgv+6WcMaTA95GtgUat`D=`K}CQ)v|UfnJoe6e z(n2JbEVBnjJA(>4LtZN0MlcR?97Jp64=-owU5DeY{=IXDP;&Rgcee>K&@sx7c3i%l zO-yB`Qciit?XmQ*JoDxCLA#-6hVy4#^Crv9y77{YJ*K7Q7dofKt!JW(&mvFg!*}w!;op5s0&ymqy zmLE;?3dC=}uD6~ik$?SI(!*o7<3csqkJ!q>_?&L3KE(tWMC8EJbVMfS!=-(M)wqoF zve#ITj;)w)>*V3mV*V&vymHIk0OQ6YA?tcBnLR9>vr*#h^h0*yL&?C^CgM6pSO0g{8nR<}e@1~Dd;5YX-9ZBZG@sL?KeRWp_M?%XoZdUOCy$5jdL9k!AaF1btD7x6c$B zaJgLO)X>nV`5p8E+Vgs>Un>Y*X-rG)+>dy!K5=BRCH=>Xoy~3W3YAF5U$M_y9M+pfp zeN9c;=~xQZ?1!VE*1tcbxj*b^L&9T|$BI-i6hxDhF;L@hU}%q(GaOf8ciH5m<6ejw zYG55p37>3f=xMNd340r?7!B|6Z7UGHD$osT790P?ADVpo z)}IbMF6|F*$1O*-xkntpbx3MB!1LfIoDu)~JZN-CsYLUoiC`e?c?337{4YLUwqV$5_H;08Ir8;i^n?fffgoyW-v9+os>`H_)|uyOY5`Kb@s+F3_4)`Ip&7Xr{A1N#)&vlI*{C!mM_u zmBzKyrh0OJiZP?;(87K-RIOLNWw&d;R?J&nD-&t2c@mx~_ruArA7QPLUr-2{>yHDH zE$_9bvx*G0%P+Qp6ka6*>sp#~|F3A$Gfdp-C(g6svy z`zv^2GNX;}jmqO6=+UqND&TO_p_t%LHF|RUEZOaC#iFSm46U1WqJXYtKYrxi!mmHP zO&WdM2rMm?7M#8(vP|UV7p_d{D&4&vJ0tCsfLTrD$~jGz(3S8nmqXL0=_(l~c~BKk zPR*x-UMMqo&s(w{j=#eNEn}9U2CnbQfKNGZ zy>_*^D=uQ%E{y(ph*zM5IICnMJXnfVkNh_Od#S=dVgNLDHoiRYEt?qM8sk3);|8&F zsLE^O79WH5u2wN-1)<>y0^|obq=fVlJq&x(V>WCkl0`a^q~<1Q9SK00tar7kx~X7f zLg9THkhj>faTv+Thw3FuwA@rnlf}xNP?-uoYuv324>muwYp@DE3@2~uw?P1%8NbDx z=|BEKodYU}C1)0lB!Z0FH9&&aVouBU9DJ{2?TPtS9ZzlbI@%H9M-qa^;dJXM^>!P@ z(%5nH49c)Nzk6o!(R9WL`_(U>3JX?Y4E@ zN32``01>0IrMmqXPQ{G~FV&!RoFCg<#LvxRjsg z9Ks6Eea=d(sX45@&cF;zmjg8b$7a3ey>qkrQKrjOHq~rR1Hk90F3D+*e_5PS zkVFxo_jw}t1u>4(H`qw+yKXd~@3vn@bgqOWJp$A3NY73eDl zdul3B+uYwh$9&H$2+&Li4aB!Xe5KV8l92}k}n@tZ(33D2H-mlD#+dT5&rLn>87On(yweKaEp96_J$Lhy^%V-W7J-p0)&$Y< z1SH)>6_|dgpD5Qk48T`Fu|E!@WO3S`g8(6^Cql*+kkn!tazqkow`ibw!K~d)Wo-9e zMS}VR)WGRY{T@xsz?Mn4+0Yy2r8YzGx5%xOzG!-d7Ln8>E!LZJDs9!Z0YpHGMN<~3=Clkwth|BzIl#C>JF2A=wgq1l2@1ky`P zvx}miG=;-uOic^IU41_y1l$gIIw}Om$A{&=$57bcND30Bp1O}_mrSHe)PfDhi#13& zx!5X`=BC1PU{+p49gMSDMM^(MGl2tO(lj!<#(x*7Y70IFTCwx0HiI2EA&zf_HckMO zg!2KlrnVN$>uRGA&NFwvujbc3bm+}N5i6w3vHL8Sh^A#P0PXFN7HX*kwck%#NVbrU zW?xJ+Q@Qh-3iz32whphNOmrF_-vTa^%i+!s$yID&HgCc*C#Tg9OAC$eQU%NJj$FaWi_S)qf5&L3)uN4IPdR6c zkvMp(2=g57cqMeV@ZhKXQ!FNSJs{uQ%2TtlPQa{kx%&|wRo>v^ zW7mzwd~97QcI|IAqPSuHW{W})zja*5-t6n!L?+cgML=S?62GM$sI}Y;7Fw+o0(_G0 zw`v6ZM`5Ig*r(llE90-Wewhz$m$Rkn8Gfg;ltYKN;`gw?k#zcpn?sdq#e$H-sgKjf z?gQx#?aawW&JoI4B7z^VuyjNvcm0XVfyl!RlK=?z+e>lSVt1(0K$a*CG75La!L#p? zGO`fue7W(XA~JA&8;hD0O8c8U&|iNE*ElO4ooBpMXv%6OI-qrt^8{%7`J0~N32w1# zh39;Bn$wjAu8G!`(h>d|dFA?aO^2;!le95*^F7UWO*Zo7T74aZF&NrZ>P6tQ9dpxy zV|{iJ+&@3_!^#aCR5==dqj+3ze+l~4uRev3!2Zzr=qgSi)F|a3iTR(qc+A%}O8pW{ zbQ8t6QWfu|tsHk%G`EhrrGKGSC;(?HBA+8Pz(YSNLd7UrlSB)Uy;l57A561?TUx%* z2^(r2woOC#O>tFzIPn765X6o`*yJSm(7^&PD?MC14?pn{*ZCw8V8 zUxATa={S6_U92TPE*JHx5s#DJV>BbYOAp{70Uk{v2lvxf$4;k4z8oZ5%m?}-NAiZy zZzbgA;0wS-?trrSzB}}3^Y!zgd4yH`$Njk}Z@<0j?}K@=k%5}-#XHXw*XkZIYq%aN zid@k?34s%eCfJNn*7zd-n~ko-r=Q_fZ~Il z-{GQ-QNJlZ{z!+}W#U#w+fa|fT$2{L*M-7nxxQRtpjWcR(PWuEU^;7I6o2z3Gu2zCFEf2FzFbf9j1Y?F@F1`M#IoSF@Ot}K6;PXozgYIQk=>yWn_w; zmLf!!r8dO{5vSI#Ex!ymTPuR0XyB$2g$!BNTEGR4+#;-h(iYf_|JTErqIQKS=ZBC= zt5j(}J|6WHbtSoBy*3Fe^NgNBG6#sqNO(=y9osF@p#8rp*!$RTFHO8@rcM0X2nY!W z3Lx?8<4T;1RU>7(RY^k}Ey~yezmfFYy_;m@<-^KNINQ$Pm=r>jlk&>7B(E({3l9Kq zPUy4Aee~w)M{iaVzT#n?YZESz(3UAml>>rJc?fFt%F!R7T1dy;tjvKce^UlH(oQ6( zPn2on9#JB1=_y>)mZ4PS2`G-%34Ryu(y8q_*W(KG2W#NN;!?!$2%U@rrB$H$gBkw@>a zee@1H@#;tdON$VKXn@Z32`0tAKyzQ<_^Cn~hOje81rB5{R-!Qu@ZoKp(tj90CX~2v zB(Cu+uLD8Bm0RS4$CMR8FBz zHu3dP-+2LKB&$}U$GhQ1N{lc};1tWsD=8~}=4o{Cm*KZ1u6|MofLd!5lof>{=G-^< zm2CS#|6mLC79479169?S;S52oQSk@whRlcUFWJgbh+h?kL!Sv*1F>S*U#f=H8x;iI3OY8E(%vR2exB&=b~ACwsQ; zi_EHZ^0Bk4G=JQt1=49|hU+i=M50sR;-Lg)E$CFKjwX<5D1_a3tiP#d3V9f|QC2-Z zioZlOJtix}HE_$F>jkGhTD42|<7 z?!Ktd+ZUSuAoIsiY{n5d`yu^Qv&7?cvo9pRgN)|`wb$hVA*ghJz9D@t6{ZJRw1Dkx z>kOC~gdlwRa%gY1b_1w(yc}N;VFFggKl-A{7;7y?xJg7`vgv&_nd)(AIsavd6|rdz zrJ%nl<~7;ryo=X!)-~~|6Z!i)$iAf=I)G<2RhmfCUcTt}aL14uROWiPWXzzJ#rEvk zvl}1z2f0{EyS|=T>X$T|M)=aGp5(N-B3%**A>n^_tHxzyZjPJ__S37r_bAv7B^544 zMDyymP@nF?|Iq>zs#3&%>c3U$7MJN#pHAcZh0To>VxSz=L+%aZXW=sQ)z*gi!&R(uhCuZKm-zkPd?SyIL;xT%LLd-sXg`kCY)#>C zwz!Gd!>voc>CH+QRn$#s)p^d0G`ky6lG5E8=YV^!&zQv~T8yM6GGDgeo+xcEwz$vc zf^ENvy6w-^0XP$|s>|juv+m@>-SR_XoGUp-6J)mdGa_e$f!g4AxBr)9e7%1h=@eIy~m`MeOCSpM_>X?dP) zwoaz*?B(g^NPHZfJZTpO%J!U16)($cg|8C;m(ZAEVPP@Y@o?vm|F-sxo*vUyhD%Ju z#;;sy^a*qjJLdQx-1iKFFopA6;+tfGxus=#4?0-Fk=Mwis(FVcA~rU8+4p9|JKAuQ zwW?LqjVX90Gq&s2GFXiQkP1xv+cEuld%d9xAVL@RDa)Prc1cY?u|9H8o0Iq_P=OetzL77ck#$ zZy>EpO9N6zFgCzUNrlz@DgYvt7autRmu*koc{o#8S6!ueiX3&x>rix!CDe;Nm9)E0 zJ#9}bt0&vLCVZmkW9mE+htYg}Vw2O)3))HM;x~VdHEAEgZh9}rIz&T5|E1YpV2j7; zgO5o^dk+Jn(q3*@(yx;A6-pC494Q>>M&mq0^XaJYQ19#0#vDa#aMH}O-@|@VdCl>- z0EHB0VDyX?anIaLwb_WJw%JyudltF(QD=?o?i+lUDDGH+Rgm<%_wAF?^;!ySd;1N` zNoz5}_8nPz?xSzpJfA@@U(KBiadxDPeaZ4HrD3ZOz5lx=XD8n(bfB}>N9K^ACU({1 zEoUX~m&Rjdx?q#-du{btTff6zhB8Oh7Nq||L1_w9uGLE_SfT- zi1#IO0y%W6+c8AwPfn1vy`|0jv4iGa4v*6jew8Cv)7LsJ=H`&T;Afg&ecSYT45iOI z$>%b6ldeWtG%s*1l9*f$)Kc@H%i?xdm5K>9Ld=mKEfWi(D}flO&T5Oz7af|mp+oO* z1%=%Y5=JwHlX2xok7pcL)=A|mo>n8gF74MWp8f23pe_OJsj>09mkRas8!--O@7Rbk zt=s;7C{#)w1Oomob^#U{A8{}Fey4eoSM12=i~afZb>x{G$)Kp465g9zS&w?(JI}SQ zh%W51VotC&^}O{&kxJQ}8@9}o695XFt#s30RknZ=m!(6A$5|Glwji6Uy>f&vCz|UB z2#FNO|NLC}rEI4bOs6h9z++n6+ozbRnpw~pJuL5hm=1(a_J^PzoccZ+>OOveY%Gn?t55D32ljCgZ)j@EBiUbK#xKYDqRy&1g-+ZH_=ei{dtEDUKz4= zyyF4trT(-CI0haTVVX*@M7~c2x+&&UeE=s%Sjrie#}x#iK6}K{SJQS4iwwn+Y)+Q3 z!dL6@(cB|qWPJ7uOqN+5MOUPks&LkvH*}p{E@PsWrZWM8GKPwZp9P8FF+DiEy?j)D ztu#NVOIPRWFfe2dWM@vi6}(O$`58mTEpJv;MKnm!SKbygo0Z)iQ3e<&6_FI1nt~4s z@ab2sZDjNETIJQo9*4nIkIXG=#B3xne~6DLh}5(fMK?MaiD(&|%kKTOj7s zrRjqr=GMvPx~q8NXQpjc-!Wdc*?w~n51{Pzd6t36&E48cgOkYo1KP9nJz$0Kj5-91WjsI;V zN3%lv!~M;jr_Uhsmz0rbW&N}u@y@@`7GgE`KeJZ88q?6u6)`O1Hrb(3C0BgC6DZ-r zYqU=DQ=?e27}u;4XrmY`NI$_)11_3LcLe*T`z0c2(m4*>Z|wDo4a>U5vk$RSFRr$L z*_vEpr~xk?<|^mPf(yO-h9YEdT5Yk5?)?wif@oZoA`)JGBsi!C%$vBrF+>aPN2Gq1 zSbyZd)?+g3jc{LqpyoQns0(N`TVHGZ*XQd!o@S`nj6oPie0G1m%{<%%W&*`b@c4L@ zs4OBVZDd58-F*E3&}i6kkLtH<7o|s)<~9#|XhhuLVhBAR+YUvzCedxcD^uQF3(5Bv zKpXOmD_(>vJVearIz?JdXdCQtKBX%v$jGlCSbcFC}J33`Jjra9TH2!z<5^u?vp(YJHLn%H$zn8n)nSpKnl02nW~qC3v6 z3*V2)Va&=o&7+P-2-jWzvTs|i1m}ul z2G(D-TLz*ZJu-7O=8^j3wM-25B$9(|Tnd=l>}(4pR0`F7MpZh*J@Y?urAIV+RJ++L zn%R8%ail-vlQC;XZE0QKUaj_z66t)=5w(Uj;8-(Ns28scQnK6teNGfDCLD`4+LBF% z5c;dqxCLN26GIoepYpzbtwqH7*4F6Yj*Cy_F92`NLwX$#rHXRPyJd-3IXL(tp0Ru` zQHbJMop_JHg~wPzOilAVvb+I!CKn(ds=gq<$b5CXt;;k3{_Tw32_P!7hd-x5@y*zz?4Z zDOdU%MQXjTS!rwYUku26lQ0y^Dn(sQRs4!c@Njr^V@I2t9V(qeD2sy7lX~c!0;PORUY-+ zw?G^^bsv@IwiM4Xoid?Z81FB3t2zcv{O-(&1}j5bUd}$Km*J`n7)qA}VURebg4z$r zZ(H+f94Wj2T4;E>VZk|4V?sL_7rLc)8)9HB2B$%m z*C3YNz-^E1c8X^!A=aZ4pX}?2R%>w8*fGxC=;cg9Bk&ooe+f{HvCP#eac>RBwT_9! zuP(OG7wNw)nXW72IaoHyTePLtd4q-OWj&S!=XNXiK)*3-;eJie%CBpmE}-!^MP)=d zaHda;uYVDLNLX&L=atPjNEdO;DcBpr*;DfwR#U$CIa@n5GC@UszTo_*Q-G!|;eD3q z$8nS|GT%CgtMEI7-v)NU>gf3R$pyqB&O*m#+rjRRl=XhtT|5`3$d*aIwzM9DU(u`N{12qn`lO>C; zc31T&cJN)#%PU#?8azFc^xX&RY^F)4KP)f{fBZxJa6R2IeAlm5oTbEP$mkj`(n&_Q zjWVDtc6%g!b$7Wu!hm`F+tFNfA2BiP;5X&Ega|Xcw4Vb5I$RcDn5hX&9ScTj!W32< zIwIDpI>41?@>ak=R2T9W1yVHIo-xx<^>KF+LV38{QSRk;f7hVwxk2c$1xQN@^q1nPa7C(&d8JMynAi6H{-7i(+4$`hUewytE?ig?gcw@p*G1bCQZI~vM@Wx@#}9U zn(ge9eI|v*o9Svv>qQ|$jiKotnm&cFnuuEE5EWfygGRW^*fu(H5iNiXbE26B9iO(G zjVl%UJ=`b4)0xJI)e_7f`?_aR*ze`CH8Ncf7G~KzHDXtit|qnALJ&h|L#beY2{Sgw zr-ro_(N6ef96`#F(DMkEiPGJ{h&@ja`@H+!jl}ZIID6s4cEgh-_V~VrWqZEND8iRY z1|7bS8gN90xXEgaVBea(0XI7=-6rg`|+xwp}NvB0Ur_nshsiv`8}KTe7L=#&5)hN|1he)eAm6=b<*3ApWn$Yj%6uoHU~ zy4nEd9+LI?`(eRTfDlO}b7<87n(`s;OVzFBXIarV#&_4}cYTd{D8H&q92V}{uXk8o zKr9iEc!pFV4cm9nGX&}^(9=gL(bxy#@Ek-%xi!)VHjvTx0_+v1`0Y`*@0SGt&u}~b z{#Nslze6h^1}0F%a}TXSZGeDS9c;J93oVd$Im(t}D1?VU;g}6f^?NtH>qI$q0Ldu) zK&SHl0X+=T>nVI*h!G4PAPowFJr!LgBpF%Z#;6kP{-}+tEj>gTbiBOpO)pt&e*7b1*k}i+Gug=uSk}f(*H`zZ* zJ4jwzD%NtDfpg_w#I}6M#wwnCf~)3K)lCxy-1VdAfq?#3&VvjaGf!*^7IC`;rVg=( z)}HB#T_?juiVvnNL<9{;8Wszp9NMqsk~mE`?B)<$`sxES(`{bvF1L^D z9UwCf-3y%Q-n>mlqURt%z7_`>6leWAkcjU9eR4{QB2y!1PlYPnlHc#%+eOT2Up)&K zBbv<%@qV&c!x>=UlOlVy0(?O{c zF^JC}8VKoY25QAAiF1>o2y)atmneY`jo#BTYj>kw?Y)yFnumx9kpt750Yq1u*T&IO z+XlgETny<(Jw-Am6~}U=3Fj^m8x9=*uUU`wuyL&%e!|};d0B>!{0AU`G?c;xk)WiF zy#7B?W&$cW-=@2=nf@$3NWk#Tub^m<3z%xkQ+R)cc=-Y1E9V|bbch+=bJQQ0xfi#m z@P{EzNh}V`fDl$nu-47RK9&ZdLI~;e+S*YeB?W&8qJE%*o$Qg5PSk0zHM~HVG@u`YQz)?$DEY!B+zI z*;GK)$?!U8l0)9IX1bT_S}!fDWYL-s1vt~b z%wK6E2uK(DoHeARt4nI4513PY#|g2__NfOb74*Pe)_ppVfFdvVN4$p^GxZd8D*N(L zG%oqC4*;j4y5xJa?{Az4^ufHJr4&Ss__@M|AE~nwyw$@;>dZ$1Sz4K%zy~KB`a-sC zGcGZqKMBZVU~v-zK;3oxCBsye?pgnG@>>>qugk#zKmn^UIv`%kBmhioA=v^y8$Hmv zq$&Ut1Jda~Q%c57C|sZFzGAKtt#XXr_19JxkyzImzYuvxP=*dHH~f6t7;nfOee@#W zlNg{6*iY0QJl>xhNaq+!zwMIp*b8}ysyh|~NURS z0PTf|JhDmhPlTf2df%qWQ^s-n-a2)-Nq`Z;M-tmkgUrrTvzAIjMJjjJWumDNZVvL! z(a8`pdXhiXZ>a+*z4og85!q7ZLn{I3lpiLQitEZH!(DfOjlkGvfELx#H2pn}0!%Bh z_fE=Oe&H2?W$|}_DvFAT9Db zoUNZVoM%CeCKSM28Dv4k>nV>kN4?$r;8Z|59a`tdpRzAW+*!kA#U?C3@`H3m#4LjF zSHD>xUmfQxu=X7HW$OBxpx6cXuSW&)jV1v8!C3kgrO(;Vt}TgK03b(w#>qPcGCY58 zUTF|hNPqA*7d|r1=m&Z*IOTpqT#XG_&nOsO2~)bOG#;YJ`j0{jz4f8J2Szb1J(MbH z=q!E(rnWcICeb{H7-bC1e1WEg1yzLYhart=P9lomd)S8x;X=NK76`o7a#PFZ5EO~c z%PZLSxrB7HR-Kg!Wif8!IMLH1(@YfK{${O&y@5cS2LO)wQqER6YBv1Fbl>2Lk#)$n zy^Y%wfpD1UE30o7CVuL5-4z0eZvX}LJ3#WNGQiVDMi2m_$(a?AMsuCfM)=&gUG3G5 zMH%XqBFCE^QBHdL-op#6Z%+WdA7|@P zpay3X1CoSm$aN`DQsow(YM7ZN>vLn7z6c!4lpM{0OA&^W0rKXkF?5QL>WTXet~f){ z0yd+enwYM{pKy|+WRO$nC&I((?H8MCwA8zC52L?5;6NM2#>H1Vf68M9VmLP zh5mKn%VPX3Bf^V${xX=L&%H4qFSxsWJUHb;kDLpif*64*^iW39GTQRxr?_=w?oq>_ zD3Lt?(*cC26q_Ky2OOIyDj8~GJ!z+KVOE(63X=uV%mHPG-L1bJoJ}$p_DgJb?YEkF zt&XvS9y`nf#_b~Spau$4jDP@tt}5Hfo&H%s8trcklwt*s(H4gM*@)2por z#sUFhL(>-`>TtQJqX8^ZZuZjwM*4?ole;WBJ)^nv19kwLG>2>#(8a(8+?TAlJ<+-n zfOhq>fm~CQ!ZSl(yvk5re7F~S^AD{96Rhu!;y zAXX>B8Q?ioRrx;ifXLiGGY-DfP&h$a(5A7Y?xC(uL2>qzV$d;%y@9E~!h2L*_y%h$ zvqul$Ny@d~SG%^o?dR+5qLrAeQ4`)X4O7pAvxnwk5|S>jRYF9#R)66}V$#OlXLxL; znveMFnIT8vKEW~_CpE_;<>_bI$(XY7J}jSciNQsu3BQhw$_3bnvgecy%WFxW^I)Is zu}FDd4N5t>$E4O_$r6eBAzw*YeyJ!?5DNpR5)-IyCZSuZa7_ ziJ@~`{0r3j99&ZS{$K`he?^<;!2W}eLoX^dNwCxD?H@XbQ-q;@J}1(?kvgurc^b<^ zpl5*9%4h%AkLs@W-A=TZRN}s^&w46|?tx`4v++3rBx^2D;r;Vz^DZUM0d#yuMf8{a z(U~M9&Kt!VHBR$<9$pIqf)_ue?@*X5r&tEOAAGVI276|^C|>a5gVoI608_?CPR&UN zT(4U7wrzL?^^|dx#6As2d31+PO-OF|6@gBpxU;XDTRroOIwEu28w`|ZPf4r2ZTnP| z3|6`zwp`)Nn1%R*wcn}y7J8I~foWf~7Yhfhct|~33M#pp8|>o7>S}jbR@H93)7|e2 z!Q^w4*nfH)tVD3Wk_O^H^rk8YL4y=(pAjeoN*LO@X(l)qcPf1IFU25yX};>b@hk>M z<8hLTu$>);*)w&OYoHAZStcvp04Q=UG%b4OBkV2A11G6Rua;|8ghYJWtlWl8hWdD- z7~Pz@hc26jkp_97RPn({M>atIhl^u@@0RbU@z6HETbc(zV0~rtOW~tbAmY;wbA<4P zo|-DVXi<&?cVA3m;Q8(oU`timXFNB^R_t`xK)&eswyTh|!Lri0Z05}Sx6OSVYz=4g zH#9qrw{03s3u!rjga5K(W1&fN>vz!I4)@@BV61o&Il;(l%WIt^(W}Y(JNoku`{T&I zSbi5IQ?{c0CH8FJ%gojq+tk~y?}IX2Z;wxgf8_e&Tf?lH#P0~c%E%%kOJGEN_%A%n zp2zvl&KpVtP=U~BU1|fH0R+$yeC=ygLPbe-hWm6lolN{K zjLsbgy6uht){r+C7fOhV$-**Pdl23Zdj*qVsx-nvK^uSml(gRLJGS)oPmW0o9I2ka zl21e85dMwOo~^E@_`=%-@tDV9a4fnv`iB63920-w@srleP3yYDYf$#hhu0j;;wTp* z`hB;p!SDSSf_Tthnhxfo)3<9C)BQxL1ZYL-pFvd|J3X>=ot;V# z^zjkN7G@1i7f+}4!sROIhFNfFG)d^x%7HbIsO<^*T|lY#Ig32R`CJ15A(n&pfqqErEX8H-Bpc^-*id!}6+X7@f)6pbmMuLeG8UtTd^SSP5XPesPs?y%S+;|Ur2guy`>|;mK1fMz~ z0Maa;LMpv7{RJMbVX2Mb8>qUFZ?lpCxbz_-op`Iw*CXgRkAaK1LI*218w!6+^- zV0?$Qd5~ZWGbq0MHUfWB-wG)0NOV711}Q*a`!BbX7Lb3Mu`R+$X-v(d>@p_w-e%R4 z5HJXu2_c&OM+*R>4<#SuQ!{d;85>8#P5`J8OpjmdSzZP0@fhX0`!DcPiCE4f*8HD< z>nz6%@|D+`8C7XHmfxBOobTj;4uhJq)!nm9LehkId8e2hX>MtDJ%~Q#d14(d>5ezU z@ve3!wN!qM@(m@YFW-{_ixI3<;bu1%{`Ay;UfXAA`3bQSj}u0F}PhZb~Hi?&OfKkSXAl z7on@CRZ);5albPz`?y4S4T!|+)6bhf#$9h^?Q)?8&`x~fhouy#+LmLc^A0hMQ4dlN z8`=G5Z{}(&)pAfqe2ZSO$U|h(Gw1QYFAzbwqn|P3^DqR&W#IN%Oa-!;CX3)jXWB8$ zvpv@%N1e09S<_Mz)d9x;#D7+OII5cGb!C;MU;h=hK(!W0ZK^ z%W7_Q{Z^?-QGk_TUs4h6rKY9^mJu8WKhC{t|CT?KL$ZV8d|InSjkp-$Urza}1Dja+ z5JjbrhL4Abdh#NNVD4Ax?15ju+gG?aYr-3QkvbXTL5FdfA?Vw5JP9b5cuscL9nGo(pa@p z6MAD~E?3S!>DpNGi_fgH^}O}80L-bTx#S^ldeL-z9VFfN=jWpk1KITyGw~YSQLHeI zh6(HOy$j5mnx=|+;>vNb*l~A}u8|_eH!Uaq3}ZV6ejfFDv=|RXyLPy3hFEfTw~Wj^ zkuanadA_nq?z!JpVOJ9PVxKkIp=D$(Y0h@}aa~6cf}7g@05pa*g6t})qSqf@GkyY; zRp$Lbf2$(u`M71o(6V)C*f2N`DDW_n*tm9f>G2u*-}_pOj(&RZkzTbbGKmc{dQpx7 z5E_9BMb%oL*HadaiN5zdiO}xu4Wi(jCT5JAq*`zSohdQAUeHFdE53xy)M~{ph#8dw zCPX+l?WOW?aEJ2$-0na^CC$++Cnq+3f-cugsX$mlKMGL z6;nnwe=)55=B2$WFSD2mK)sMSw9&7beCpgUBqzhCivqG86gxn^!?0Er&8k1^^YydN+}l6$9$4lRsH{Xdgti4 z-Y@LCNyElyoJJGdw$<2nV>LD#V`AI3(KKdb+cq0K=lgq~=lv^dWzDR0&pG#;+4tVp z^|=JAbTeeXAQT?e(1Sm&Sw_5e>EprZWo|}{mhO_zgxBCTzU7W<1!~QQv^~aWMVW^S zdRofsfPbKB@DeI?4t>gaU95VT+D~IL%QmZ!9?j%S4e9WayCwo6m-sNsS({|KL_Mgv zXfd)@Mmx^oj_^nHA7tu?d(?J9rwdgGKd%K+G}h0Zyr0YV1i2RuA2b%j>ppl?c77qE zxZl$;gdGRbewE-OTN@xY3de;y7CS>(OLxHD{fM^#P6n)zh8I1~liZSUTQN6?E!Gru z)sDGG6-YCLM}^Rh0ZUS*JfhTxvGD`7dL6FCVFQuai&WxK;c3ZU&cV}qp;`dlY>Srj) z195S0n7~sG`whOzOMn`}|FC2-NFRwKd1ie5rypS_aO)4F4JiDJzO;Zf^8{Q`Pmm z5vPmTlFH8E8x4~Aw@w4ol_809WetAZ*<66oC7QNmg5k#)=!m1brr)0=<69~! zS88SOS_Kr-P}6`;L)twTpa-~EW&G66xsPTt(w^vYwlpobl-9eO9BNGvunbHHL@kfO zX`>lD<)@mW#q9PB(*Lq8-kW#l&Bpe!OLCIGm1L480DFglVA;dtW_B>zoMoaAk2AyK zqC??*za&Hd1!&LQUUW1Wlbm0+9zC;v4#?N**aM7a$vd5VkAx8s;5}Sc&_*OGA@`I3 zZ5m(z!%kDzJ4sjps=}+_wkI9?F0=Z}1sId{YhY55G_k$Rw9$0xcPngx=>Bn&J%~TR zuFgWDxTJfy9(2=Gu4&x_p`-BfW27+=P?r_!6s7@gH4fwL&5gfcOBKqbc*_uj8gRbt zDu|*jWR8E|G`^IIdbvT;^{!QMrz1-XP>#Rdb2`4ZDH$tk%dVTmG7m{B#bZ|g6QA)i zpd=3onb{km1bpIs^VWqwtgX2B5d3s$+opOnw|_vT%l(0d7-34Qj?-sG|# zG4kgg-Mdl+px8-t0LpYxM+_T;7FQ3E;;%!C37rMJG8f?`5n=c&RfS?Fg>UqA#~r2$ zzY56h#w0ACo+;52GU&-&cK`bq=rD1X+!fe7qc<^UROlGHSSRf|oU(uRbEIqr1(T4) zYlqor$hYW)+tRkSUC*rg*LT~i;5#Pzse+mI^!cm|C?wZ(N)&SJAb%xjA&_o>SCizU z;ALX0v2mY?mpj=Z(+gzCbx0HJF$+yi@A`>H4yJM#EKNgRG-0u1FyfD8O z8XN@VGm#(N@$S$+@#5+66U7{^lezf04EyTqU>tl_gHb!sYKQ(x2e4ZqEP7iYG)We*4+1B9@{ZJ9~{e0WJz+e87$)H zuqX@9!l?!p@pGM4WV5cDJ~?}SH)UdgMtk;HrO+-V0m<%T&GlD*qn(-rV2Md5TW&az zQvDe<$DJR<=6mB9E)1cj3hE%jA@SPq?jT4@X8zfF_Iy=vW3uG{2cb<{FO*=}x#iAv z3n`Q)Bs4BXI_n@v7L?3uGA-+`l86H$g@eUryE+)sTK^m%jjt5ahPQ`RnC!E>Q@$^I zr6l#qShbZk(ee!Eo8Rko@wdehAtoa9?FELZ;YRn7(t&D3_d}H; zE{5-*=9uS+XQdEm3P4&nxJiB;G=QGb)-huQN0Byo-)1(3RWR-p?FhO3S$Kk30Z`vK zN+Tx4PuuSfMu&m|pGZjpdj_GB!hOXZ2OO-)3zB(d0i6rb+)njM6NBmxDPtf*U`p#- z!<|B{56?r)&1&QY)BiPHdF=UMz~;a5&vZSjyXgxzFMS z_gshWc|YX=GH|*&-$aa&-4a_3s8?~~kK3u}ialYVrJv#ZnuAwV zb3-9Hi#5KXC|nbfGj}VSo4V(LtA~Q4C!PZ`yK>viM z1<-()J?cFs0yKcm4l#nB5B*T(?e?6at%fODap2OVJDV*GgztUkdkr!|mF|>MlCMQI zCxOM47?d%XGo}(-THc0^e)%i{ry-0(~SYB0woB$8=ripuqKVR3JY%ouI%#p4!zU)HJs0Nz1c=!BPd{(T>1l$C8XRrMCL3 z_&~5cT{>}6Y!@F&E0A!srW-yv7QgcX5B9i#u5S^i4LA5`P)hy!ZBD(u-6f?o5YqDL z9$Wcxh<)eBif+X2Zqe2iGI+Kvvj1X@1m0R*CW~-2I=$-q3P*T3X~Peuk;c(lGw@hf zAak1u7+X%%50Pr85G3Y57raUCQnafEim{*)O`FNSx>X1nsU zXD3Xn&lw!f{73=%R!8BHEZkpbqDwf)3fYH|qAJCgy}gA=M8rr1tCVpwK5(D7X`znZ zGTWTfk3okzyweU0A8u{LzK+2_GwdW1A}0e=2PxrzP3GC*!LJ%6F@-?)jPm~6T8hPO zw4riM_t+OG`RV!fy1id4xqD;dAdUgF1$T~bxv4gnuGC6sm^kqa_IQSUhrCq~Pf}Yu zr?ayzDD6)HmK%E*n-n$X)`Gi2ULj@mDWTii_JgN>hiet!ZRj00@)+8l56nk|#=Yc` zVmKJF=}uVBxEB%+eT0kf-l=TIyFJ5xAg|AxO0-zz0c8IeEPbm;QVZPUb)_klYn=)6 zfZ>oyyqMaO$YiyWG zc)`ARQYa3W$|{HjAJ}&qJ8X75P%e8cC+!cjtqCYqX^r{ZyNuwec;p@%f-U0|I^{I! z)bB`d4oOGH##o^i>&?a{(%; zb$e1#gD3I_FY*D9^Uu(Ald!DeR#(xT0V6w=g3yL#(k|PSo+YlFOZYVPi>E<((7(op zs8#dV7lXp%35%B!^Cc0W2ql`g&-)Xn2-#Tt%ijU5nmgI@ zu6)&I8YU$x;6Ia_c`Uo>`x`DCD`R`h_uWk+P&8zzB-aj8wWjzTV}KI_O(C1}+s~#^ zQDR1quK}t~f+d_` z>iq)-Z!H!{0o`LJ=mfdb*^*ci4qXGqfL1%rPrH{2#-E1st!ug#p#C0)jE*H?V@KXJ z+sdl>;hb<`xj;QgK9t6wrj z`@%eg0G9Jb)k78NY#l+D*-}`UV!%Mq5^yGP*=0DkYWd}%Vo0XrMhs1^1(R%5UPF$X2)zC|JV{EG8;7pmHklgY-1;tg($@iyx764oDKkP)_{DdBraP#54 z+)I$>WB*SQf_n)}Kr7U7&Hm@vcnTFNq@A^jq{aS1FU6dwR>?`cNW(Hrs>_{b;Y4uj zVa>q@=j_ypUwLh3q2D|B6DVxt53KrBh|#yYhIRe9eSzTt(y_#HX(e~Im`o+M%^FL` z^kgwE4Tr`Q?x~aXdiA-XN{(}?0jkXJ`O;sHrHodtw4R3D6P-F$U!dM!RagLb(%M(r z*O)RKt4e3}$~b`QWF3OwURqZ*aj%A-23V?hYv`4gTQvZ`rqp&sP0^22b?W7gmboNn z0}cfZ7W(aC0WjB@0IRvzhR>$y@z3uz0G?yDW~Z!u$N2v+(L9QvQ^LQ#25 z+v?2q>J-zq`n2Mg+15+tmV}*=^+2I2x^9dPXP`;JZdco%T|mU4$n8@Q z3&muyV;s#ankKPI!kel}tniCKh#oEtf`aZUO_9q&07Xc~TzE#KBFP}xP&B-jJdfT<3u$+tD(A@2^Z8PnzqdnnD5DSX$ z(jjsmuasd-t5NEUq9mXSoRWX?T4YEI?0dfKA?K%+b-KBcTlvQy5;Az-;1~omL&2@Z z(`ZSCYFYBlr*X=KJ0}&8)$Ubmkx^fElLGg*@U%MN$tp_inD(J*Bq`F3 z{8BEA?%?%;!{o}>aQBND+oi_>g}Y^Kz@LurXVBm)KKl4_Y6C_?6gS@&tBnlxb(rUV ziH?b2OKI){7{Dk=Q&QOyoE{j$v|0GLThzZLB#}Q~cB+(L;jw`7b@@qVi04$>^l59>a$VY2h8x#KrG#oL zRmGKcO{LgxwfgzfurSE3*QeqK$%qnq|aE^ui)F6kH~i_t`*HLW%@nMT+;q(ns~ zWQi4FAS4Kwu8Y92n+dQ|L$edSrFe<`3_w{t%);G&6L^En*@?!X5CbO=2_+N)q0i z;Y_eB0}Y-l3JI$cVS9SX`4Y%?G1!&G1o^ZJO_Mi%lFMmnWRD zZqlJcDwYeO{u_;DV1C8jY#83gjRaSbg6F`_YmaLm^JWrh4}jMYAfdRh|336gqp|?| z3yrYlFwbFpg+L`nlio2)k?zdUOxcc#BjeXo2ADcP2mC*&>h{{4e`9pp=S$>+H=EM> z>5K9w3P7;spnCM0+w02CvA2BXni3Z&F90T4=Kd0l<&4QOT^S#cIbplTBi(m!_S`qv z=FCFA(pwyV3-b+MGgyjb4FZ2?14qUcbEn$M^hmIORWt&R_RUR$Z01vmu{#PsQV5wd$qS-bfb|s4^cI>SpZwjzOdGInbHE8OQWY)*uq4XSwJI`R?T8+4TP3w6-=Leqj^agX?;y33 z8FPgiuJtoIMgIwfL~vunRLvYi$jn{8aBaz4{e>PHLL)qtwM(>R1-QvfxJYOZL#)*Z z9?aym&Xm=P3LUExgv!tQ^IrYOYhhgDg5yGx+N?ZSTwPiGp}y7k;$Z{<);fH05)7gj zg|0Z?N!z+1QMjG@V+KRs1mcx&^a(srDU$YObyjxcSi<|WlN|9#Blrl1nyWF(^t~hH zOy0^I9V70UwY*Rhz57`ikN#tcYQENie71FvM2tj^mKC$4|9$J-Y2xX4jH{Iwwuy|l z7gd%@i}`{Zj(wP9u`-b(8-GX`tHZpZ$uy1?KBf=-5A@n{>tz`c%aodrRgS%G3ms&o z@KONbwUW2XkkIj<={k4Xmq!XH{y_`Zbz5Dwxym_NUcGtKHRsl`4)FWh$65j(w$iaS zOBE^h&sD*l#af8Iy-h6$oYL!uMgq~||BsG&dQ5HPOu(!(W8kf%{EkngDi5#^yd`@< zq-fOgV#LOUs&Q!o{qkQtU2GKdUX7gzT4ck>51-6A&`>XF;3(lZ zF)ha;(U-WNU0rAi%;lgOkJqj3lnw1RS_R9Tae!t zrI7Z`V~`@L;U$k>wzk5F>tztezyKh+Ssg4Z3R zsH?+KE$8ITCq{?abW4aYHH8|4nnvJ=7u~n>KM7O!xmACmH$C9!$ZNSqN`iI(0}QfJJNY*ypB@<)fgsr?w{2!D0M z+jn8B_{!)-*t9{h&FAcfMfwD5w*+0tcZ}QE!*G|_J%W{? zhOybozKg4RhSO#n)ou=4gtuTgXZ55V4#;B#5!AV!N%Np@9fOv2AbbpOIN|JBz5-fV zUr1_6Zq$-S2<4h~L>75vl_JNf?D=`9oKrKBtmhF*il^59EYD9dz_4^4EMS#bxCIIL z^wC61=hZbB`Z$;tMb|^}@OFJhI?~;L$7x?|n5gNIl$LoAu@aTgReU%3oc>tA&f!@_ z8q>z+?85PDsdQ)}@saY7= zt;8px4yC~kyY7bgiBNL)>x`E%%WBl`-0f(l2Sw&3uW5Z!GiyjV*sp;%`7gxxM9jMpMaU@_@m%#uY6;C6X6 z%YhGvl_O~MvCBh(xl4IcQD7djQ1H|KqV9bzy!hcJ#k9_+#aDN`dlIQYtmtY-T|oh+L_uvuY2lE$;>>mH!@~jy$23+J+)jUq3*B;XuZ@^qew~ZdcB`?==$F5|h~XMJ$%5N? z1e^%{{cl+EZdq&$)6QmGeAV#^j~c^qYQ-|Gd1p20{>*FRKgn(cmy{b8{~58;asbNF za_1xSFzx$P2d;_9%AkQWL!=ws;WBrWT9`8;{xBeWY4sSYuQQFtsT?O+x-z!ywReLD zYl`On2uf@SMnX=sHaU|AFK*&4qTX5+N6u#ZPBG+v&m2RgOaw$lx~-xlPZ>8necee>m09Qk5^Z=n zhIJg(N2NeYVcHN|6B9 zPh>red5s}r!@Pl%vg4%Hqu83Xn3qxCWw^n@dRU+U&Vc_Ymy61@s2yy(U^E;cH z_I-g1*NBIlXtQ}i%rn|8W40tdz~WlT;#-p;AzE+i@t9$-zreMRxlfmqwIN+D$)e`g zYnOSo7nGgl#FCM-Yvux{OE2IaN?lx2%B~F}&A5-ODnDLUDbLl$?`1l*n`XHt&&jV+ zgcH$oFwe=H3`$P|FO-;)ExItsD7*%@fG%og2z3WmXPwIxgu`VJlLN2yWl)`f(|2~PVi$}5qxV(GXHA)&Z z!V5(_$`dA_tFBlsqR&GqpQkW4(g3??!y%kGfxesh*=Pcpiy=R5osP+UL%X4z{~gA8 zSXH(2)sW9wshxW$>Do;0v*ULG+?uc+oGrQXfFC+)W;hx#p-AMGIemHMyZ=j%D8+$L zEl(RCdAU!3S#Os<%jP$dFD(-M=;U&8vT9;UT57iu+EohNRv+QR&X0Jca1Jjw1_Nx8 zv0&BOTN{mitXn0H(4|jK51PUOgCEPUCe?wX8SIeGNQ-V9!E62D81gpk2QT9z9*4V@ z5D-TU)%ebq__K5OM3=M^bK-H5A`6Ubq?N~gX zG=J_B1mv*`_<56O4_P$Tlp@WQa{1rCCv=L(%!a9Uw^6-@hZWFVeT%K=V7(bro$a>}ukTjf1)Qi{RC@9$p0NqU%_k2JxKdqQr90Xh)nmSJ^ZVx&LXY9huNV-q zfIKf`NzW8-Q%Z_~*&SmrgE`dvU~2Ojaw=nCW&_9Sn|UV$=((RW1srH!&emAWh&nMK z-FJl)@;ifj zU|({>8vA2%s0?fOHhGayu+SOBTmR;H3}tKKH%iMZ(*z4fd28cV=POKF zjhQzx7E+b!%!^~SFtnz0mJ{Lc002plp33qh{`qV3LaX+as2W|4fw;6{a^)>|rHlCd z;1&0%NmV%ax3NWE8ja@(E_h|n)i?sv0%)2=n=C)chusqsxqpvjc?v{;b6A{U$%w91 z6jm%6XWL5CR6saagfth;Din`>F51c=i-&2Ux+;>vH{D5`{hL-D1Kz8g zP%OuWtfHEF^432Z!7;2Z*d2=>zLSO?6oR5xvJTNpyFM_FYCjPlh0x>YDOLOGZhqxd zRR*12kNxcaxM4-P^r^%rPc%beDZYc#Ot8X!)BmxlIuo@ZmgOm5u2gUbKMW__8v+NH z2Ga}Nr6~-1I7=&OwGh_=F*B$U3Um?GfTX6ctVt-=3zbUtRGnzMobsL2`x#K1!AtVL zI4kc+$>K#xqtJhV?Zw~bxI$ShHrQ^97qQwTpSkRP(*#F|!3M7qqdaV^cHPrS#jkT4 z>=T^5`5^qEzmegDQDMwqLEeA%<(ci3wX`c-Fy{^q`BZPw@O?Fj%{#A~*ju_XBmVJT zlP?c|waVq(?IOyjQW@1G?y_@9!!~iiPh|1GQ4m$@uD#FJJ2ayAoBF!IrpZ~{ zz-Ayb*%!rKDSfkbXX#xT*#LH~g_GJuly+B=CKXm^DJA{;5G|Z|0waEkU0zsY&AFkm4U<26Q|8t~cqVNl;ZJ}y# z=Y~D8ronqshm*(=3)Vi0GA)_QIJjLa=D}&i((=IoEqX{cD3*X0F7YeGONo*XSmi0< z7N=Z8(7q-Ajdp);fn0Y0Nv+7}PL|&C`9b>pKoS9~gfCvV~H8 z!sk1gFK}Ff*l){Ew|mlRcoO;JOYQ8CU;Lcp1-?5mwkn2TsY!&mdhkmQoJSfc){O?y zu=&PIRA9d3+>EW9R(E{&u~g6Wa}xg}^FRv+2N~$k5AfM5~peAxn31_!}>1WJxQ&Ca~=`@ z>ZHaJN?goL#kE*=DWF<`W?oMgB>$d2_Bov2dj`!`JpYIm)%>iRfqxFzh+ZmEtN+~w)q^Pwox;uR2ymZ@v) zdY=$TXka9_swk8t+rl(bKPoWQt2WacRhyRO@Hy6%tq$}jCWcdG13nimtf{tm#}Bg} zY_eYbyI&dtRd#xKpg=gTx%T}4(# z`D-@?)H-w(9nE_;B4^%d%yrz%RtPO?W=<&yXyAHRBinKd8BD3*b2qH9`b?9WzP`|s zdff@NBFm@GmKA#+5=Pp8QL{u36tYC!x^-HJiFM?AEA&L^>iRu#QW6X-@xQD8zdz(8 z#HQ(SmN$skj!>oU+nlFk$leyq;riiCb90y=5Wy_{4jNi~SS({ke6=0teqX;t6iyzG zCDvkLe~4^SN16ArhZR!y$=v;Ejnwnj=ly6?n>jcm4TE#FTSuC|Gkzb0xTgzRGUNS> zWo!IqeBC4hk|}Rpcnz+;m2mbn^HQF&|Nl~e6&WHxV1q3e^N7g@D3u;!qr}dVZ|Muk zp&$&zg=ooi9v7RM`gpif0RdR=w{x?ez{y$;qiR#8@MFn))tgxZ%V(w+!P2b zDy05@7c`$oeB&p=O>E*C+V>hEJcgR+lTDwB#XAtAaPA!3(S2yvH%CCsA*`)k zUJb{NZh7_zpHS~>5{-Bw9?&)Nw>?=_OCU5~rq=Al$>P0DM>jtm5o93gEsNMj>eL^b zNW5-sdVadzt&KBO7m&hoc8Vr;MrD~03=hWjE4gl!$PNDm4k8{k`c7sjTUC5`STu{D z3=ah%EcCxelyI-ne=E!)Q(J@cn=wm7CH9}!1&z=b4Vuk~6X3e_WB zHVyc=EBtPf;Sxlm=+b1Y;|{+HSF2-J@;%Gp>B?w4jTLsi-36mmH`xDy&#aBZZ!y0! zIYTQ0Z0*&=LF|TB$r4Sxg$@%S%8aPM>ve_AI@2ENInkSGA zn#o7x6~vo2i<`7IDispsgWb|T{^Lxp7IW-+oeAuJgi(MKsFY}F*ADd~Y|`mGUm%RV z&m@Vv{U_m1j4PHZ;;m1R&7(s1h6P(fCr)Ai+t&ioLeq>p5&zJ?6Q*QDp6PYW`Ib3h zt92;kc&<}ivanQbt#N0z-@9UhvgXgtZOIP|xWCM6F#PgMq7g`H0nx&7Mf&p_pFhB? zCvK+y_lm<-m!@cfZWU)>1?Li(2bTv~S|cIF-yw`M-U@i%RSvZlJcp zPen>ANK~9a2_e{X`%&4qQ=|QqY7T4U6GeaD-qne8XU3jkB-5FDafoA5p-ux^ZDO;I zHx1wc?P;Dnh_{&jyF~8S<_zzUBfAtlovq)Hqtag?MZ7kBMlkeH+N6;K{V7qmCO*k0 zN75LzHmi@4_NP-$23L~6VaEDPRMt7$PqQd0q*$l?y8U_j;h`V>-7K%9PtPE@)oKGT zDf3TX8Xo~aN|D~5b`!>-dW_OiDSnrMF`hvGOHv{BmnWCmxzmsupACxRBo?n`(iz10 za+0(5Q^T-`KoTD9H(#$Y3ZEoZUL!w$DFpI6;y*@5kr-mT^7FVhEv6d&Uk@K z(XR7ZbAfFX!C)COqF)9MwF-N*((ObLrx|lQ>SWid%1ar$?%E@}+-}m&OBo-pPdkea zrry(U*MnQD8#%2XP)}E1K-_#^&A5M6k4|J*uXZ%+HAz}zZ)OkwUchs@)#uq`-x0+( zyKHxTaOwHEyezPS<|IiLVY7b<7rr-)_wcYYOz?2k+!D&Vd!8kG+C->Ta^xEcUf+*a zolTFW%DlNz{iPQwE^W*ryiTl==Cl(G{T(>*ct-usFx+srHabSsz3YdC$@T*=DXwGQ zQt1BcxzAJ02s_O0vWR}wn}YjsDGgfi9+)BVI#mY=+`u8KRgil2e!{`$m?Q7^K9}08 zg}_~HrdHD)yxA`=Q<^J?Jz+7B9&JH&piUdC%4n#MyB<(3?7GGpx3p{*`@={IiB|AU z@@U+|h@9CYZN=RB+qPe|$dga?A4bbH0egW0!=#Y>3%+aP*K<(mmfl8vfTv zW|~nECg^3&T^>hknVr9v)-;+SlYq_3SMWGCYv-#CkNuTuL6|)3I)ZjmP*p1ND`#3U zCbcEAHM{h*OXG+22?i7tvDc;kJyUpNZ8rb@!O}%q;^(^%*aP9+doy> zB=wRgV)sXRZklSo$2!f#fL%@unXr8BC-Y^`C5DGA_(h_3-xRM!HbmQZhuV~0CnnS7 z)QHCk3ff|-jK{ntIEU!9LN#=|ehhW(+B5j>EWkhYgh8p$lH7@$CY(=*bon3g#qh5m zHJSKX^j4QaoU=`jB%2!q5AW3rRC!Q?dDrBL7|VF}ZwFOb=+xIgUw7tYt_YL}fmL2h z@_2qeifF1lbO%VM4&vTqe0P6?soSW|ajM74=4pd)z{}6vK>Xpn{Uel_(+t_NAzLIWXvB>{6>NR+)Fv#SNOth^UGT!$CqG5@{ z$*gNYg^({C1XJ|wG6n*u9s0Nb{Hg;~q^dRg>x8nynf@=mx{V}ZuK%my5}iiXNa5V! zA#|Wx7JfJIao|fGO^dEOnA@*~0r^WV(uGzg`G{FrZQ-p=Cm@AO zjPj4ObP9ok{Q4T15 z$9h0%8tU8!-e7GrtL%cR`lVl=p&hCvld_1Nmbk4r-YWY+UAFbvyr1v6P{#nH`~6;{ zUtDt6Y9a+q=)3+mZDhRQZoQ73X9%-~7AXK5?SlPvs?q6P?E3Ema6Ul&TOUQdWeIhb z6c#=NE}4=#Uf}h(0adph;1zxUcO~lcQ|cR&;U|E$775tj@&NWw5zun*frsqAAt2{Y zW;22WI#mrgJ`HQMA4f{Pru9DNZ*Poz)o*rs^)@x$%PZq+FOPwcxw%3v4+YPyBpm8V z)?%@Q2nEP|nR_gZL~w~>xoViuoV@qbv=gXbL;T}VbC_`tzXFP*LhZXSG&mwY%@pkY zKeRc`t3L}H!@d5v4z8MJ@FS$p82`9*ce`k5bk=$Gtl}eN&+nnClc|sPll`}YP)M%} zry>$qFYv|Y@24csM+r@8!C7)-g0SIgy><12Hls(oyEHB|_6oH>FK(MnnaNfE9=0zW zcbRtUh8!M|IG1?!I6ksFjiGOz+x{cZv+2XYLoNV^P>RJMV#P^Rs+Z`bhpMl#iZrA0 z{xz91fq{@$Ilm2CLHl=iKiMp2psug4p8#O(&pPsGxeNgx9srG*&9W~xuntgF6sxq^ z=c;sC&84$MQk$?Umkqm&KMx$x^qFX`A7+QGB<&dX#)Ung2H0YD{ku z+e%vUz}^)A{*MarMm;$Oyl(VXqka$dV1?Lun%|Fl<#!~}Ao23IftS zD=TYUaxzQV6Uoa0_363fT*E6I2Yh#XOI5N z@q54i@qQ;p7BHkd2~5^?78Zut4mLX6?8kONLjvzlCf)!}soba!u!u*errsh+N@2ZU z{Xwy+c})X@jd35RC5g5}S$rFgl!SC^HKmIdUSYHtr>66D%qRuRXQC!D$jq0Ru!*;` zsin%VVR%gbz=)yX!|d9&ghRA>m z&9@gZ$_(A#Kztv3Q5t$7sn7^($sKaK0%*01T`jG8t7JzqCk!@jHt>WYRkw8%dyU-B z&w+#?7@iMe!k#=k0?$yPPwwn=ZXdpxN3*cMf$i`RghC*Y2b%3m=XBH2*jycq`ke!d zva>=VfA!oSp4>?>nG*|el_J*xe6IOa<6aVNyz^OVq6d$6^L{TF6_rK48bvslfcUSm zuf`a|-Pt}kEiJsiKLH?^KES0lKlUoweB2UW9D5Yl0cR!?pMMX~U$wX9})n@kp#E zfPT#k=#5k5D4<|qi~$;BEAXy)-38MX;41+D4>gPd5&crDW5Ek=cq8yYH9naulTmvF zJh?^#5nrX`8s8?aXNqK}0p73E6=Gu#bn@K~88&`F$;b;al_I4ynlc}R2F|r1_R2h02NOQCN3Ppnmxj=rh)_9M7 z(&qKh%LoGc(I=W$hHaG6&AC7a9h6bHRHTL!1eZ676897CRbDdu467BHrn-a13->B z+;hcdv6y|4lKY;`hs*DIS4-eGMCfJ+{Pso+=cA9SuX&zUS#<;KBA{wx_yZhs>i~S4 zTq+X;=cac9=?=hUr57zDAe$ciJKRWH-3kfN@1^AB)nXIG@VOj_Ls5vTM>%l;msmCF zeB;sIyw5O5c)>tC2pN#DX)#6oSW@8AJO9tQkqTY4I8KZ#?Et}u`V#V|t-u9>rc6wr zze0*rl`ee$pbU`CGWZn|Vgt5x3OB?6d8tZx!C&yQ6!QAsq=ej{L@O4H8v?3*)>oF>;T9uWyMSZA8nTl zL?B+qZ!wTO83+ve1MnWdO5eYn)&N!OM(ry<6_r%HIe>6cgd$He1NB9mev-L~9I9PQ zE=qmP3SRqnDc;N5|NpcAJvGC%z=I9<7YdO&Xw_v=@7I*TKe@75slV~#v%_qsw~~+- zK%RwC1lwuu&N&bm(#QvrN8odr@314R>Q5lor@Oxi`8(Q?YCYa)$>e&xgJ3BOTyq9A zsn&aSF~KYURn;;(`hD#VtgtZPfsmq+4BDrIT*3Z{7clr-TH7RF<@1lU(?Vf>d}x|D%{(mX=QFk~-9KOB1Gz@qHBB_D$IYJ@`W|;mlCw@nt$We< z@j(7!7|^wlIF`FaHFr|HhHH(MW6ijU_uGSG5zq=7FEaR@nReWKD|9n1$bdD7&qKNw zlMQFJ608j0Qbmb@02wH>jCzHYHfYl&55Ow&ouY*E0VJ?pJDgau23U@)tQb1>`QK$v zlD4gY59z_w5KvFX004&VuEz}(&?-RoDT8Z{ib9m-Hh0uF1zr7@&e18L8QrRtlPLyr5n*V~re-}sdc|LHI`=#(JwrWZIr|}{hTp{3oCUFu- z^TEH73DvNnTQyls-2A8qW@4ah{>UehBoj+)3lD7GXIz~Y31Ie?=-Nua2=~MWd5}o} z>bXT`QLmABUv7X4ixT^*ra|I#WeCM-t6km|FBj?Y>e#|Ye;X~u?(K&|Wce*3BGv9I554bjgpxY53b#~3zohd;#>uftS3J(Hn)g%pNVo}Z;-3tM7%X@hpz5n_p`P?oVa+=a zjz@n<_Ol)UL>jF)LC`;3o}WKLECsgmN>avBq@Qvg^)*iy?5@R(>W9AP45F+yZ*NRy zp`jDlj2N6QG(GOdqFJ#edm4ad&@ILliNMcBS|X^WQgE4`ua2fw1}LI7m88=*Q<=h( zmpv8b(X9%C;)X^h_>xRUX+UVp`t-@Y?z0f!LA z(mMO1t_d~*ckeA+>#an3N1OeQn9W+3j*mlki3;Ty%)jsH!G9Ze$qP733Y+Jdy+Jsc#Y9rcy zcmbtO%q$Y*0NM}Gq*QKGxa~!~u5>y1ng`_#1_s*4{UXy``TpNE(5zg^_cpn<_vvyq z98UXhc6O^+m$A4(qQDYDhlqy$@Ov6&Ms~m>q#gXFP7Wdv1Ns{2+>9vPJ(Yjjm_I_e z>DPa=W%8bpvP7#`2CuBU{Z_20rkcwP(va|igK*U}UDmrlndfiJKTMhaoHnTE6{I*e zG9ujJdX5>7iG@|Hs&6BKijHo4D~5@QStyvJiRTAmVQc;4w^;xfU2(B!{R2FZ^sI0WB7A$tp+u7$z<-;8LHUMw;QjITSm7OsaPh0Wb-BT!Vl8zR$iRpK&_s*3*oBpv zbdbfnL&CS$?zEe{8huF_DZh$8!#{nic3e{7}uAhl>@@STeV>XK+0b{16^=+az(_;Dc(hIYwov2FEKEn^i3s^9)VCi|Q z988|;i4;3B`}sjFR0YL(T>yP2ZGj@iv9U1~y(@Qi<`Wg9~42G z<3MTUNK>q@=Ag-^1|ZugA}XI$z5{uwb&C+6CJQh?eg$e{H4r|pQ~m8Bzys>{^y{GK&+vQ*Bs$evkB?8!+uAWyGPRxM+LFvPRiEc=m?(VO0d-KF0FcHt$TOUkoD%`gBS}v0Ip5( zIo16?RJ~PLR$aI*EJ&B6gfuVR-O}A1Qj(HVQqn2i-Q8W%-6dU0H%KcX@ejVW*WTBE z_3=T7Hm>n6K_zMx+QLg_phzmcjt5+YDIzMhMq94=#-z@yNx+k}Lm;W%xw zggX#lhy9|U+JRxEFqHN5eK)3j@N>!qZ-mhGO5KK z^<+Mauc5aW75Wz9yXXf}={B8f?<2mNZu=$hI-z?H5ha&QDGL5g@yINxHARFzo^wFJ zjFfk3LdhU5?a2Cy*7o#>+jrKErDf;+sJdacsA?_I7Vx3&UVMc?ylv z=1a_8DwSDQp@d>obCB=YNJIOtuQ38HprO%~tC;{A|60rwHJfzK zEkGs+{&O6D)v2%4Gewfm^z%3Fg#}iJEpapZNiZO5J}*5+hvCM3#ypkq-MiLZ><98w zn%B3NZ#~Fozg2Q0aK@Pe%q;Daio(HwL)xRZAg=m#U&tG+EKA6D6Jui=12o9Moq+z> zq?xa3SZ(-T9Hm?WQl$7TXz~cgppYcx;?lfpyx`1n+Mf*5mXniHLq&6osN4{MOfUVZ z2Z7V0#kLQ(7fE8g>op(y>>GZJFRWDm-i{V(YImt=xUOnDRW&6u{E)JG`l}5mb#PrL zc+?yL^IoKzhi5`yBU@w$P+HS=J&enLRk3E(YI8=cYcl5<+Sz9HX>U!A4g4Cz3&`D1ZOef(Zv5& z=_`JvQ@q`H?fKo&tadd#>RWd@h1Ax)`9wxgH`{ynM~?!}hgwfbd;C0{ujJrMZ2`Rl zKe3d^p06#V$vzOK7GWYF2pa|5BmDMiW+P`*ST&5c&9Ns-QUYqBEO8E;Y+k25N}}hC ziBRq;$U%i*JkLJJrA~mZ0yBXuqbPZZEGMxXa}fwH= zdC%b(hu#GiDF(X7-f!_uXqB_D6y zH0}>arJpXOfNj2l09%GzNsIgg>%JpzYEr!e`y|>|l)VU)Xfj>$)WJ6xaIaA!3COS4 zzG72UYl%j_7lr6YNK%v>7%gX)2V_Zch+1d~(@8hv6wiQL#}(irm49@?hSNcu{1d_R z!_^yYBo~lCLQk&j8O@ON0EC$byc*PIP7NnX~rS`N;ex z^#$8W#o3yf0>|?uenM}E`}e>>rPfk6#+B`Lf`>`1lw_C1B&JSvDFZWZj$Ke#s+eU_ ztQ2xMj3It~7Cl<1{HlD>*K*d#d4_#ryHF1ULptL<5s*g}T**+tZRJ@fNv zENi8hJKz=ffzZj6z_kT2hWtWrcMx>g3e*h`OnIv=`cf-)3&vxW(!e(gkg0=M=bd4U z^yS)z@LfwxTy(Iw=cV9=U?9qPSjT3()WD75BxF&|P8Jqy|CkmJVS(f>lz0SQTJ*qZ`4(Gb~w`(G3o{mfwcQCCwfiZf?N|>kxO(MdQHKSMflH) z{l3I1K=mLY)!o`(zE`G!)+fTo2>RkM47xh|Ol*h_B2}wk=#EwzFxVza*{K6yp;i%i z#EO3NI!kjBmS7WvO;O3$ED2~=C&)g}^%iLfO=N_&GBtF{7}Uej$|ZfPF4OzTB$Bfg zV`+-^`ZdPVj%=awmFp(Ft5ybLl4jI9ZZxqFG!3G=pEzJ7Q+pDZK%FSf$fL&Me@es# z{eW5%h}wKFEit8a5-x@>d3}iv?;le?nNRM)V(w#rr&C_lDw|rdik$Rsx+_dOPCZ|? zvOTbPq+w?3Ee1yD?QL2i5!Dbf^e!e8a_ZpcKz$?OPOoQgfCCXXnwZpDt#Cwm?-NNB z{Z8S2qd`ut7o4lPM+^SKq=})UP|!oIE)q{IP9+^*%KO^LQs3Tf1bl1#s~a+eE%@U@ zp`K__;$6asut+x(^tHT6J~F>*qpiaXEsCRnunq>`ywtc4CTHEj0+3dG1HVlcMAf+Ij zC_)*x(@2|ocn>a;AXDp?vY@#w`6QD!WHgM6vW8d*gQ5+!2xKI&%>m@jpuxulyWngQ zAaJ9F?%q0zT(!m)5-!Sx%-HVoL#9v-6YrncpW!Gl2#MI^Z@tCNDk&P&UF_VsCoT3BnHiUVEXH?cmVI)EUEC;kqFIX+f3l zZ57SwjkUx@Xf#LYxUsvfFPI(+B#@*2+kj*NK=i~$9%rZeoioR*N?&g086n%943qnxe?i{sv6Xj6M2^VjXuNEA@xJEp%6;P zHX7$_&Ef3D@4?8W_n)&a6&y6au}x@5j|y}K3$N^FZ-ZL)vi92h@zk=5R}^NGgpjGU zpc-O9<@vI^kdDV#dG@AHekr+Pzll6X);sh^6ItvCQIhqkOm$1AGEmTAPAG)g7{6;W z0*d$Jvr#4y_)Pcv5#4QnEW2OKJJUf@2jd4mpZM!G3IE4yGmMu9rkV4deUn-M;S4X( zI~gr4;tMRuR&H@vRB41e0`SOpJv}|&K|)Fb$-6N&8Go>TqN=Eso%g6OUK|+tXE^?m-Y+>pT>}L@{{?dM$z0Jt zy6pS6ZwLU0%$-9eTvCaJkn&^d4!r<6!;^WxaCAClR4)PcsY1t&2{N8<$+wUw#~@5H zMpivc-{g`QR3wl8w$h)18kxi`)saE3q-1oq%JvMIWhn{aZVl6wqlxs3;Reex{>8Cq z0soVqdJkW&Y4&{nF-{g1mNFSA-5dsx34U#b(NxnVz|3-&Ot54UQKPz`gc;}MmG<1h z*g07V@_iOu_y5ooczFJC=X5>VcQg^>-v~8Jg0iNDbG0Wb3zu!!LE9-EC(0VruB_)V zBi_*h;Mi2NdGSdRmZV4>evk%oevYz#=l|+oe#cjskp@p8A$QyJQ%f4|axHB(1q>@P zG6bttxZDoAknD6>dZHFpke?I{I&$c$IXsGRaCMuNj=SQ;(^yS{kFYI?Ac>NZ1VQoC zN@XGC>ZShFS#F}0vt=6e5^1c`=&?Do{#W`A12`|Bd(X@i7&`MoLzzo6E2uJhX*{x; zptf}mqiaQ@E|7XRCeb*fW{Z&&f9bI0czvKUkM7svdAk{cTe_(oCkt)6*&AfUt;(hR z!$11wcv0z*TS$lmR5*5!|5}oX5q~SE`tSPCgzO59h_8V7c)5OGf{1&$B|#e<%UX!n zMc*|4RnjCH0f%&DpIyoLON*@5vqeBi&zsBzZn)zf`rnlUALd?)cr z@1gjEtnRc)wr)gdQ7IM^z%Fj6?tBurUb;=%MEJhWGyjEw+Uyx230fGZ6}? zXF8*yZv5=u|TLK`Uv2;;7B(mey+J~qmJbgN{QBj zk^2^4+H>1K*+<64$K@Z^pC3ueG%L&O+d(PBoYI~uhxD8Z?RQWF8N!Tz8gBZ@oh}`R zZDMnl86FARK+4xN?Gk&r9l$Rce6DC-25wN#5J-+=p=_DS{uhPxJj-9LFH`7S@8=HH z;tBUXeCM|NBeqJFpQL27dPv3cnj#p|glsS$0n62H`ypFQc#itL>nP+SiKuX*ui+OQ zV~{hSTQOCf=DqPs^(~XS;agb*=t*+r%@GdhVYaCa>uL$<=ntE?#%4=L46SHU#d)El zbeJ43<$DtDkV9RL z@ja>p8@@)Y0chG|2Af1~!bgq~i&T?oQjsBE8T9Gb5ms!NIuZe2jV(c3OlMidExvj- zDn1(wP@K0&Pe>X^73Q3n%M4`J{t#{kdm zoSYh^7cV=Etjh8R2!62>4uC^tuwwTZP{`!dhz)}bG`In}7+YlcF?Pe`2YjUWrxR^i z|L6jeP=zWrLbB0^WoM(C(_d%?ILiATpob*PreLd5ATY*qcas4$r9nfhH=L*Mue{@W z1yZDr7~`ee07m!lughJ#leg)-am6F=rt==D+Vh?^>-#=&1e51ynm?igC^3rORidhU zrNao2_YW-O{9pWQKZ<1s_?Z$I8>8T%xczq89yMKT9W2TC2PS0ZOIH(1$|0*xqia_aEI3Rp zbx+H}NcH3|hcgm|VrJ=k`bDLKCC>Q-6(x(ya?BkCm*ixUK3t3`{OOcH)-mWm4bbZ< zY)i;!L-4DURvAorwOdC0R&|A*$#TdK>DE}`pIts(mjx!v)U36;Auh8D+@*GvHa|8# zUH@LyicN>`$~+{7fLll)LtS2}lBjCl@MJp-;(% z$NsVH2FPFCg!rgq0|FqE&yd3c^>CVq%=OpdoIoLwJS?7u8VcP|i3=Tp`j-yS+Z*-@ zAojHqwFBE6lmwx;N!)b;F$I&~9k%B<7^K?wv%V(Kg+XSH=F6EcxF}MkdZWU)FvSL7 z5vk}Jw9{mTSoaj>Q#Ca$tuu;h^1^HYrHo6Y+U4!FJ0cm)O*g|~MleQ>&k%XC-qoc# zbKz=z#uldr!K~CUS%|kTV~Y!8h&59tlAJl-=n44#bbp+96p>rHf6xa+(kBHnyM#BA zdGV&CW6G)Qh;np;gs8^dSL0te>mpI{rL!<^;_O%3-Rm!t>EC<|m*;mlV&Jk_4@sEY z5=_+Va5b?Y9Dr%4TwH7Iwmm4D_*OYT*mGI) ziAkaQq{K7ElF5H^%1HP;GE;f?cPdlb^o6CiVkjWBLlTV$tB%bGf z60qi(F!MS#Q8{PMggwskPa&`SHijS3o14jt$W>MebYH!JQ>$`sJa`g&a1%;y{@V;7fV^7NBGfG0M? z#02Jph`@4Vq7~O~0p5m`s5frM;KfEBa$Ijp}=$T6*{bHrs?r zvGCg&Ug)EAizE^Gl$C91Rjfmr)_Y*BO-;EvHv zG^$&EB}1R|xsDbZk?SOCjeCVWQP_D5!$YAaCMWM!Y?lDe5~Es1i*|%-VT+!-j+y%Tq~%y&D{uIqsz-ssn_K6)umoAUmCCj3ZL#&RkREkNgJpFE67balLP-@fc+B5>T%)+R|4e#7R+cZYD7?6sh7fiuAmU9(Nr;a8b7WY9t>B7&ScpMT*@92ZkO~OX?Akr$}uUrLqxi zD!Cw%bSf5YN4fd*JrRozmrC76byv-auaQ3aXZTZCQ?w&|r$OIiQfAW_w5LRH$fw9c zm7W-w$+9eWcjZUlkMcwB`pJi<3K`nvjr|Lv`{%?DohGC@ULeD=v)ke&sKp?xkw(jj z10+cOpaY7&D1jHa3cBH-2}&gs(+&?*@Q&xj?E8XQ#HChk^+ab@HJ3axq8$_&!~Xs9 zc&{b`rCf+K<8vERMUuHAPLUV+L1|J z0xgzbFNq`S4%X2_b$};`xyB*+$m|xa$kZKc!kZjpsrk(#UfzTNk}Li9Tt7+!OVocsY0_KejA*OuO_~~zbNY)d-;7>OZE4Zh2X-?y**%o_2x^-{{7RmCIIy*t6 zPeCZWjw}c7-uY(VE$F6ZP=GDzHn~xL?`{h_MLi;!v|LwCD<^K7`1ulX{Mfg^!hzT$ zYNukzBv5{KTPHRH#!n7I*>pwh4O!>6sX}HW38zhPbtjNZBv^LZ4??6mD013olbEHn zSDU^at?4fwUfyM2{%-!}dBVzyzxXFLZfEozbeQ(P2d5pIW&abee?}#ms4VI8Ac&G( z%Q$C&9%c4$HfcT4vC2YeH+OVhkf=R7Ux}MBmd4G&LODUSN8TOFBq2~oQ8B7a-9&@% z?3k0p`GW}!TOK9OSs$%I9ZuSxc|aR#;AC~Yd<0~GE|w39;9ORRlfy5KF;?!8FvFE0 zvK~n5z^Vx8Qp=TKn`mt-;l3gdKAGXQnxoOsEicXyA{Z09>s_rikmf` zh87Xo_o>!E&N;(>)`8t`##tks2;*-Ch(0?U8ez$k=SDYF^L=Wq&YIX8!2-ILPrr5| z-2i8;QmRUc9I|;z_^DbP1Bo&BHa<2cCg&uF#*u79k~|*g6F43FN8+gy@j?R$-ci8e z7?|j031s=>>l*Q5iWceQ7QnMH($mYOUe{-^v$0`+H$w;)XKB>181($P7yQFzn8WVokMQ489x93+p-R0vk8ov$G?^byRV2JC%AKBlbIi zxVD_)HVQJ+G#ptE(%F_rIPcwK6`37qK0`7o-j52vTAAAT;6Qr68f)(WrK{%cgnS)qn>(B z1PqtF4Aja#J#%LYAYyx!ixeAivE%f+MS13Y8#r&0-i zN~9zXCFcpGu;=~lD6m6m(i{fKo5&_mt7Z;5io?CbJTHhB{5?|`PO-1aJ){gZXqUoq zSv)?onmeyrUzOHriJ2uP-0JRuX00RhVi(v==Q2e=iH-e+&pEn$D8QB_PJqa%Qs@q2 zrz4I|BM8mQ8x={bI=Q<<$}9VyZ_JBKST$VXPTN*oC4xV|V6DBYP`BJ-D(m}h(g)$Q zYLLNO^7SH*GV*jd$drH=#pNw3GP2VUChmmH%yHXUXCKhuF?&2D8+xl077b|R)y7hI z%^u_{W(D;~!rAC{Q^8@s{y*h< z`K~U+hQj_|7aW+UV4yUKrpu2H5@R?&t+hQ13m`*wzpu5rdgoKY>Ab( z{`tLDwS;wPBUZGi#cAI_olRFYOSI??A$5@Bt^joArg(@RiEQp3VEHQ1xUgEz6d-C= zXm1{{G!qT290E}=(MincAll1cZJ1-~BO^T`bd{(u?Oe$gBaLsZgNgW>%R7`rx)v*s z;=PY8c|)s>`{P1|4htTNnc%(wJ`hY+ zm+CqQI!_h+OkBr24^?XjUZLMm&yfg{FBhi%(4iwjZmqjdnX zaRIuu$)oi3Gb2D;_2$~wGYtrNr~p!INfyN$;XFoeZmn2P^(Nl@BrLXxkYNt$YFd1| zpjW9*7I3Y)D$YaG+|--k zrKPLqHwen62CsyC_x8xcDt|1mWI)o;JbCr&+*8H>$KUy4^Yny+(ll0qm58>7dmI70 zNiaa9MHBg+0H~Lv8vY_+lQ4+2scZT?Iy|hro1>-Y0ION4Yqre)I3!(Ht2e_(>~TOQ zQDZ6c%#!3*cTx**+yUbuyL51mWsi=46z5usk}v03sZKLH&j-`_WqR@R=KpE0XtuLt zwtjhDO@w8vG}*qBHXpFcUbQh5ZI^a&Py+aos#Hw?)xLSXutfkbV8_arEDwq$KPr6s zRH7z_m}(B2y_RUHm}!a6@M5cR@*uc${T*nzN|nMkqyE@#K*7R-PL@cHVehVwxAG;> z@DE-`5uH0CbgPta!3WD;X`zC6PTj4&2%{6fs>lN;?cGk^TYjw@6h!DF^SA8~OZO*ZsOz z&!&Ak<8uG=zW%?jAJ|r5pWQ+5@Y>ND=7Z0*(8%;VjPSniH1dfOOt00av)gGH?|!ce}lWmPq`M9MiDT<4hLDSgt;hj+?&FKGDA+Jz5~M!ce5M^wabT!lVS0G{cwkQvp(H?W zq51zljQ?wcPDfDwC<(g)ZY2$TNI)5V*|78yi*&;DNmh@fn|eX4y47M;rbon3Y=QBz z2|#0OL4n(!TaYTlNSO=yT@f&G4G^|-<7Jw8Gsy3Qp`u+8$A~8*-Q@$5Xu%>`-bU7e z%$XN93Ixb%^b^mUq}|{PZhmJToK+I?3T;l;w33A`cV|V-mFA24Ngg|lIrI4MW*aEp z)y%R1saGRck$VJelj{H2$Y(}83~-b_DNFu?lz&GX;fr$_fF0pdNG;;zGctkQ6Qv~dVLtjj*s`RMpf7RwRr1Pb z5Xdgd=FER$3635uR#&%gR$Mp^wG`e*xyQX~*iBR29}MtBgNlta>}*)D5&HO$_V5%2WG!#LNTD{Pcw%Yrn{ zVqsS5(*)K6HhnQRX>=M4sG(sFp`DPBK6S>o=gre;Ejh2CN=MG8CWC2P=^kchKk|GX ztg{pOL?b^vzmc5?`#g@4Ql=TQym0J`f3Bn?v2iGFmSCU?TvfPD|NCtLH0D}N{aG=? zmy8K*kjwU@*YtO8$>Zcj6x`ypC(eN3w94EL5BM{j!$7MiP*PHs&h&uuj(&cbI@>c3 zbAXnY_q;0u4QejvbS0f_&Yu!Ze0*$l<_@6x_UCb^B!) zfs|1Fd#`&V@S(W@b>+MXgO;gpDkW8h@eJ>mt#GADC7;)~USE(8&asw5aTE&1KN-N}xnS48Jk_jDLX3hWR0ae|k0tnP4ReU~!F|KK;_%ny=!$7rMj=St|GgSrp-~>& zvG?^>D@=H)`w5|AakRN4=N0V!B-!UP1JmboEJttsX#WKDoSav|aDOl^uHzBtg3TCt zzP?S(ekPgOpJ+x*im^0n4K|G-RVIgyim`+pL^tKKjf!Ymto^cHc$GQ<5|q%yX7Eip z?ZP(l;Z+WkNibq!;hdCN>Ik?+?i-1Wcrr`A)s54qmq+hWa7l1M!{RyiLjTnIQJ+za35iwMaHmrC;k&V;;fi8X#VOI=oF zdazEm5+xpnXsAX!A-WbNA&|&t#soYg%_KJYNSb=8kP_W||Hw#0AX5!sx0s|c2E4DrkGBOi*uar)Si)j=D5#u0Ecqzh~HbikVy1k6#Tp=#ky!y zazqYHXOzei3Y}q9P-gx|J4%K;W(PM^wCq3@%jU9M5*adca5Xryn#u24rl;dL`sX_d zW@3MwEHB-ow4K#^*Fv4>A&($=WSgG|7~9!W_UQ{@FzoxO2`dq%Z&BcOviX7{?nub2 z=Uov<9?*O5e*p)Qd0H|qc6T)Q2wZ?_%5H^z$Q=IXZs!9Q=y#mr{Y%f}w`t{5Q)Am# z$P6ji5K1zWX$xte=XNN=X}~r#Bu4`m!je=sE8E1k@jcs^l|%EC*U(02D5|4uU7lDE zbxs?}+0i-q@E9r11f@i+G^&iozX<|=FGc($;EZ*gwH>zmfj1}yZ}jk$WM)+_1iC4k z3#Kfa@GLn!KcRXsz=y)|4-%g0WMWEtjOE{{NvaIMVHPJe1|`lOaUUKOh2y% zCoI=v$UJ9{%j5O-8X=%IA_tQ3tvE*Px^`+^Wxo{>!S!zT7N8!w{C>F6bJexK*PD<~ z@MLDQ5(t11?3TAK$CclV?H0Cwn^VL8mtUgH1v3fc(Q_v1iM6YgI4Xc4i%L_EMa5$g zPSgNcoieM0BO@!T3Wn-S6U`;%ECF(iFa=p5h7!c*AYkx<@I$^N>Q7L-mjkr?c)mM? zB>AIO0E4y4M19A#0h-;r2P5BMU}5DGn{+~!r1ij+X-bSpv$qD2ka>AOC&v>&d8kvX z(k)Ae{Q)dE(U@Q@1?}U~-c%QzY)M?tfIQR;ve$asz4L>TVEX{KCgS{d86D< zFN5=L&fN&Fhi1tG#bgv*s3P6ocLSZiV_?mg(b;)u5-FWP`EyM6-S(YEQEHnlzp7zo zR7{p2cB*DfT3CB#SD*711o~JHO>A2dxXHB4vX-sAg)eV$jY{x8gIaabvE>@XJ55YS zCrK*D<}=~R;7G0uOt#93*}!Q^O~6J%+^yHT9?$5+FrL?r?mH zxaRt8#N&nZag~ewd`_m&99dofN{BxbtR3;?AY=p|s!ZHXjc5(10pP8UFrFrycMv|P z=saa94anwNapagQjBPKJVJA7U78J-;35y5|ArRWlBLRRUL#s%mJQ&aALp}NT>lj*@ z#*@c?v315lveZC9{L!d1KE}#}M-YUtkRtGWdo@C@z*_XBv!FFnmKDZgT;g|w8UazZkab<@HCi*$JWvzQMy@xtC|O;rQx*~`DDVx!3Mv27;Q(CJ(vbKzL% zK%;hkan0+bF+w%MF&mpN{qiyv86QvioFa;o|L zzgMcooxvHlt5av`Ptxx0Ec?4SUmZcH(d-xc02n_^4EsR&`mAnj4SdVR^^#Gp$bUC^ zqsiguF-c`;WT{EhM+5~fr*K^vHZW9#5+dS|OXF?y%SB@GDqZsnRd#rIijF^x(w%=4 zVTc3+OEkvNJ#P&|robhDCPoRb$@1SqKiYOycB8STZgIATDrUog!paJk`5WWu9LxdT z)ZX;9nSP1D4>l(fV)f0RwRS#ePi5`n&;kV-=>{9jZN}7cD^5`$49sGQUkS0G9#=Cz z6?th==f~G^olf}BWDAOhtkS|60X3c|Ra*dwM6GQuejKK^*60Vj*2JJ>rF)8R1zYqrv2 zx?{H-_MlcgwH{bpLsuOB9(IET6Quj<1sh z7tDUUwd53aeYUe?0Fg`-7E)0dFtQ=S_saf(yOF!97ZxHXA$0)fHDRIh(bz^_taVg z;stZ(0oL&OA8N}U=)%BA+UvuM(0)pp{~ksKx{(+Z*IVGg%T5kj9O3v-lfqsf^XC^q zLDv;=FC1EpOwnCX5ZU!#;%}ZG1;!5#haWt6*@QlL_OMqD_cJicWb>2YlX#JvroFG0 z7%9sj-%w?XBv%t7WvjNz#kPPY%RL_(6FbP}NB{F%>wIdoA&16Pm$xIw^n-yI>Az&d z7!FI!JfQO4iopMNGA+i9g!6k)hvEhS^GY3q+No1^JRL!en2M%S1jVs~MP z(A-`2SjRPdgPG++%<8nqGv}aw_AmAU4E#A;WHs(r=DU89(X465atFvzXYrOc*fHtl z1aKu(ua|j;f7QOW{M;Vo-{2hh2Hz%L{wlg8Fs99e?6T!Khm;dF)zCYrf;q{6bXJ}U zyP){PC?B$7rsh%O`11$w8#}8+Mv5{hHmukp3$qSwSIHXHxaSLA!QPb}3hyQw_Sx?N zwrTglBks35q91VH0eAmKFsjBOb|@X)%=KG_-+2TZLnHGPaYb`Rn#LSSf~iSwVKaJYx~{(RVK;m2swf+5>sC3*2dcw}|_;)MzJv+76b%xg?XS(;;pu;c{yq)nta zGJjtp7(JPy)W_E`YSm<_t$)M~oHgcCd^3d}$6*mx+leuby_ID-gK1`WX0T!OLWKX8 zX-KEN*N;_KFF1^qo^`QSC>c;nb$t4WX=%pM@P#OhEuM5E9ic=qYb=SZv*u@gMVp)T z2xj}=Wecv5A*j?>(W|JKL+jiY>pj@ELEXIL( znAbMtY(6rfZ=gf5NQVI3X59g?eF2oQibBtCK`>TQY&COA+l$Zl(>N0&=hrzdcyu zMEr9BmaWy~2-9jYm^jG@W&KO(Je^VMFDb=In4fbF58qzRsqKdfC3#$Czc}O)2%zFD zdAlBi@o-DpFefv5S{gWha+#ytyZjFOR>MLl-~i1Jwnplytf1`dZ1OoORN@HIvZ+jz zPCH7kd)h6nDrZM3e5U^<8dr3q*U((#w;KU(eiX{#oO-bP-5^Nb(ZMh7IVzgDFY0b4LDfz4*Y#5(d}_kHH}b2%&!)t4`=j$v)HUsi9g&^ zI5L$7?AQs$lHdo$2)3cO?7hQ_n+ju7uvqjf*zQBv*&zi+pLyl(R5oU%WH&$`^2RH_ z09(~UG97aWHHQnOLnK-g&0BAogR|oc)L$eY?vG#kBM5bl^lVUXZ*Co&iFq+U3q0|h zo-C;l8FQuAI!qC_UtbK}ZfE=aC^eJwJkNe6?OtCm<-8+^x!p?_E#eKzLwO5c?a$=) zK>a81cs-i}(kiU(V?GyhvtG@ufJk5hFnc;Zizpu_u{<`C(8e1|og{=tHqvs1@4=q|w^4sl|QD zBYytAeFzC@@*s|{;h3MS;}1OeIps38dXzb2KDLk8GS)OPof4Tfw_nG7rpBs%Cg&+{VE&JJD9$C%%eO3xD*7`%W+}RS;YnB2F zrDDfL19&Syn#>8XY5yNwBpcy;EagRi35;WS+;$mK@z+SHZ_y3K5-kai7GTG?*}Jrn zW_#p5J3i7&^(^g5!WgbULDRs((n!mQtYvi!zYKOmQ94g=l_?ZCDG@3RN$#klD6B`c zhWXE<`+Sj7_gK^&b6x_?3Mpp~{F2nlxg3w?q>Q!@15f^8dE3w!bCB;}-Ve95SCLPP zx;#UHJz(Q3`ezD{DW_$W?=?r@$*&;a1k;qq$5>po*UORVq@GK0tV7Yk=q9C_RGJ3`|R#D}~FG zCk*%JQ}w4Xd0a`v z7b^u1nxYDwfLk2;j5MkL1=V-ua3IUK2czU^}hOwC+dq-c<#kl*K(evw;;24`ZE+tt;k6IY|ZU* z?JILC6+8cjSj$M}Hk9@Ddn=t}mF*c|zZxA){C}UIh;SDI;6{Hfs0;kB7GMgG$392) z97s43oM%>YIQW715{dD}A#^xg>+frVkiu0h^>@5fn0AC=$lP^GHU8EjqO7)~*<}oN8qaLP~v|J@EJ|$7DtlWcx&Atosa=#!{%*-a7}noStAa5v0Bq za-u@%=~0{*=s9-%-TRuk^BaciS}5xq0sCps(EZ8HQ0_GlFcZDz&7XA)PLV%xixnau zb?*QF8(z{2)sWhnK_90O@ozV}mjujvFLP=0U1+GQ+P-1^d;CkT${R${NbxkoU-t;D zj|O;E9i4g^ua^T!H&CF#OTz+nfknl`9%)uVE!W=9JVIG4PjvwW6KCLwacJljl%i8_ z1fHU)?UJt=g=Egr7&Ar12kv5*oJ5bI2#_s=JYLQHdpY1MwS0G@mO*Iw(Y%=WCz?7( zr`nqzWEIa>N;`YIxjG#l+91BDciDbClr%NT=>Xk$&1XGS$F$@>RH6xkB~?CbI5af0 z=SQv>EEO{;3!uq8fv*Vx0ihcpkjK*~f%4#Z&d1GQ(1gMm9+ME+LQGMDkg{{)SWHnJ zz733tQC2V2MD>nM#858hLVsJesz=Ex(yfMY>lg~O4Oy+SrbM5~V;b!;Rda*Z9pLkj zWzJ3*Kg8S~f@ak4Z{Nf{Jr$ZKm_Z^#=ePf8?0*aqUygV>V3m^G z%+A)Mfp&^C^C9ECEt{Vr21~j;&ssU|rBbHocm^*5jSXg8y*2J+1W|a&5Q$X3*G&&I zN5eS3J!fju{tq%|pS@J$t5Pi75Q~3*C2p*%Xq_fkgvn+zf|aR2$gdwJ`ta~@w=dM| zm%8^hQ$+)VEUCwKy3x=!phm87+E>(b-%ZqRaV(T-W_)?6h&wh+Yr`Uib#>FoXUF%T ze?e`YomIDXgWQK>CTt=-YoR*BSz&jkgsdEH=3RV4Bqh23T^ zTd|2ds@oAQ2nb}j5?JXrDOSpVJl`zQsU%41cwoz;pomS3f3cGrPb4vFU-!M?@!mtp z^4fqx>AF%ZUXWXFM8L2Y-`>PnMo0Pbvd{KlUpu96kp7rz ziIE?r2Fx)p4{$%u%&2~mD2bIu2FWna&Kl}vkk59zS4Z*si8xTdCHXuKlmnYTsVrDB zYzydYQJSXB7Ryon;D$Uw{Tq>Mmt1&fiF;v_oDtc5H{rZ=a|1S z{v{o7P{LGK(uJ$)WSyi|qs(abWI1_fplhAoj_@yuV3^^W_qz*?@+97ylYa36h7Z7Y zl8TW4E_71mw=lpr9Rwn-wq6{SpWFcbM7{u$Q;4#ftQV^eezvf`5T=8d3k!g=YS!`C?Ld(M zg#T~y0I!zxF&vk95NKg9fGSc?oWTnI$81F4B6yPa%Vx=?A&2}ofHvI*o)c9kL*R(O z<9A^I?3pz?k;_TJX!LP#NC>?R5A)&)p!a9LGETrSQO`PZ+ltK83>hgZF)I+a@+1vz z_i(#NCN>)^O1zH?BvoER-{9tGX=ypi&YH9I_V#AMgT}zdF6h*VdwGSC5k8sf?hv_> zmVI?C&b|2W%~Q-~O6``y_%T{v9)TYq>vhl5fee;Z`65uQ1)PEQM$_cyg5N)RwtH|6 zPG*KD6T8EBwYe}lJps!YJ|8{}m`N2$e>8tKEY!@0wgrOO0C?5hz5)Dv>%H(0?%6R_XnC{~ef_gnKP}{9$u=pMj;o(kT?2S|2dJ-mHDC#>6QA zF0gI>FGqo(O-Og}WUb>Jg=9oG;9=%X^pfW#4-XGXd3oF~O$(7sP=L&R-Rp+vCGiK5 zi$DvNx+Q}XePn)*>F68v|6c?$Qp8)t9I zdiVt|V9>I7{3_47qODX^k_$Kw61C3znrz){Z|$#7f3{RuQv6$Dg?ODqZE01|ToJM< zzf5j^ap*bZ5mxkcgB&<`Bb4_g^QCI=^%hf-APj31g&gM-Bow42CFVt1Pm?gPuqbzA z#hH@pNMS}maGXd1gM$Flmz2ZU;6lrREyf5C_McW7O7Alva{xue&Xs@$d|VJLWEtd{ zMLk=b33TU|%?Ld}`N1Ga1O(?9i@;#uNWgv_NG$sPw~>J^$2gakl#a$spVnAJ0v-8@ zmORYg_c)J+j%F(?jGRkdH|7V*4Cd|GS@#4Q86taX96$TnjexuA-kMYARTi?-zH3(< zi1&1@u<8ThTi#s8{Rp@i#h>bc+Gm9#T3{=XIkNQb#S+p6^xnUwtcKA)@hMu?5V-z5 zSx)-9PQ?zQTk1j3&N7hERLaCt*}V^qelgQu?fwvZ^F@q?tqOqeOwbDQ@3`$wJnw-! z)%Id*ASpQ+_ewGAl77p0pXJdn&kpY76 zD`<3I-!p`R0O!Vq%1V#pg^&N<9%ENmbb9@QKTKF@?E9MuMT*W0Bfwg@63MOmXZ7=& zRoh1rWs41Ax#eFQq;pU{y0yK!Hz$YhZyT-Fe!hLaFZa55yaByN@F(9r{xDRE!mh@| zn;1Uc1k8l9TdcHY``B5Dpa1=lo21+6Su(pz5}l%H?3{rBg#aTXoHqj=)u~m!%uX!r zgj%d|SYF&gp|doRT1QL>{J%8G!!&|o~Y#ye0-LtD9Zq+Y&SM zYJ;PTLNjH5JBL)2etqfri0=CehV&H_LW+c{Jl0f7pwehBJ(_Q0Hav zyNdhDX5IVu1lTaIsr6WYKQedL9K)lbnA5q+ckq2V3Ho~081P1b<-iGvpT+f9^8>h4 z#EXL^V&lm`s3t-?ICK;l)F4UZbpmSF3t%$*3E&eIZ;TA`+F5`_wB z93xpCN3=tSfI9vXp&nkTRc)Y{a|Pr^F5rD0QZH2T;7eT^zR}k0AK(}p ztFO$SZ2od{-9)5QZqkxevqz`KIzeaP=BM$sBs=l$^Dr44)c?HE!0H=|J}XN(a2zef zaD3W^G5bA+44n!Jytp^17P11eb_Q9f={(b&x0~=|Yc)MwPj|(S7yH?N_mdcz3QJ@Q zWvixz9|PVr&A#Jq;wsKGr^P7wn3kStJNEB-^Vt{kjF4LN74$#y8v9LAT+2cabIdrT zG;vOxJ+0P4@y6F^JlzsHrS{$|MGaE(uxi)~Iu@CYL$oAb*RlFe6PzxMQk3J9xg#j3{+eR1%w#*m5qw3Z@t2sAS_2A}(<(P5h!tmvP*In{RbS@pqYkRAG81XD`R_TUMG_l|b z2;Xs5F$ey0S7yu1vLULT)90tNXFc5_s~nt&pK4dHLQR0`tmDtwde;)NqC2l_Gp0OuSwh1yAF}ay zYk#IOqCf4`O|@}Z_MY8yZB3|fUAHJr61`iImCDKv8#g0EN8x~pt6m*H_h$^1R7z{W zuiYD9#z0aKtEIWM!+oGGl+^y>)0|k%I782Rv6{9+8u^G z+_{=Z^ExoNr8tC&WbsjSY(L};r7#3is#Z{fycmR6s-hI!*0Cw1>rXU8QQbU^JJaX2 z6O?YmiUcZ9^mhok+F~CBInQ`bl-Lq`B0G1Az&Ki%kH9_mfTbxbVlUk z4o3u1pns#bTe5z(?#(*ow@cCUcjkZeon4dAq7vXm5)Bkz8a_dYZ!VETTDGoE7FTmh?Ku@9*QhH0Lh|) zNJ}^a4eCpy;k4i z*Jkm%ZAV?9U{fc@k&S>nyIy7)>ob%GkYs$oBp|EtVGwje6~{G}_z&_fi|>p=hTgtI zvTB?VZqQ9br7-{Jr*M8Bw~}n6DjEt}E#W4V7kOCvI~0H9=*!^ebgzYZF_0)Dn9=6j zOSW-hK6*nYPnshaqOW=s*c+BK@jX6$RT24&W89evtdL`fedICTj`6)9kKg-BjfJF( z%_tbo*SRZRfh0N-E*bV`g}eFiPnYng+!CUYNj{hei`HvwK;>R>M=(i}nc3=|_~SZ# zD0PgN>y&5W$zKhWyNAo&yb!`VQS!4YO>M&@{$rU;qk^t6KcVNBY-`rUJ8`t4H(hgr zpV!wZUu;|K>rlHbdYYGO;@&wDXqco6xulZf@`2^%?{=9Ida@b0rFdOhmYZr%an#o? ztayrl@u}#z^I2S9hOZc;Qnul6Z*WtE!)BOm^KP;45P!@B8*c{8 zXrk}dNo9s#M=USv3uB&a{cDQ3#uC%(Pp+Eu1(Qa5&u44Zt7ker{$@bHEQW(o>N� z6f_Lwq_{<|c-iTeG?hF)UHbIAL(xWQmmo*R@#GcifS^B``PCLT*t%hiE-~nrg%QP# zqpTG9Vw&;##K+!dp7lXXu6*krp*0$9hOZL1p|{nU$aF3G3!>Ffp+|}QuE<+aWA!FO z{H%wyN2T+;W_f}RG4a?RAV_82T4T=J8l`;ZoE{zV`zH4q$PbjoFnKpm&u27u zyn2+5{q)51W2SohOXujMfA!H+;m*Owi0DeIh*!Brqahl;&s2JzEH4({wQrkf&hYrK zanhv3jEP>=zU|3Hy=9)pSQL7htnm8Yx^n5#Fqg-UODUKBNLO^if)Ur{Xs<@Faj9VJQ+wkt=Toku<+-FP*AmO(&Zup(#^ObJ+JY$TEZs?I7A;t({k27u)EAh5#o z{f-WQR3EE*EJ;W>8|1omoVaSXtd}YLqcSRIq^?>qlYObT>y3!~u1da$_40cGFo1qO z0Zv8j{lNHCzTkS;!I+d_ApLUHK6ciT=4Uy|079uBcMT+thpSq{S<4i+uIp-zb=frctVi z@pyV(T?^e_z$07Hn^$d{JZh}paM0Jm^@Foq_o|4;haI=ad%va^_98tU0X@3;J@+>z z1pQYc+~in6(0vjGjnD_*$Zt#1_aW-y<@IN}Dtqrf7vRwDR68SV7IR)swI4LP_URzHXU?$%ek5wrB}X|-rkBQli(F$IcGoAM1!;bd&Sa| zAaF*nR+X*d5pg;W@^NJ1o3gPzy7_$JK%AltYRMPA7toe!8(sXbRH);X&%lM ze$14nA0InP^n{+KLjY_n)~$CEt#O|Dy*U5yoFrXHK6*`2wjIo|4zIRyIJ^xt($*!= z9b4wEH=-W9lZFX;9qsZpz(Uj`zM+=psi=FENa0vw^+JCF<6x3RR1m=lwiX2TtRj%R zfQv)Gd@W#P#GtLU)up`KqCM6GqJHJTo*Hy)Jg@+Kro?5cbr6gmyAF@>%goofw1|J7 zzp%q!*dZ$JZ^fCrEygx3+T`ntns7Y0xZRSRN|jSop?&MygJa|A4K@JW2-{F>YDO+G zK3nxB?O{9>^$oh%#JCFHqdri`7l8mdaMuL%jmvj87lz>)TWK%`>@PYwIaw#nQs6%` z2}^bWKR}WJb}^#tPVL4AngVi+_(8AWgWYV5*FwIIa@T!A8R%JnU?fVCo(jV4Lc4)PT3MH)gn;Y7|gF6v%%6!L>G5vQalz;^Y8TL zYV+h8RJ+JEIA8hWT9Bk4BlHw_q2YJA2>n`{T7F@8chJ&@t0&iyLWes0PQPIxx9Qh$ zqCVB3T=kV2H*OSXzG}MRs1)W$>CMwMz904;Jw9@lz2lXuwLEV8;qK>vQt80&g_g;s}3dMKBg+BnOH&x>`pQy-P&1~gx0o-xy3J>EJWW;*2wm$sKI(E@g{=Hh#KC;@w#YkAJS5(tu>{ZeXA zT%2+87gk%#wJY&5+>~FnJBk3%tfBabbNs{rNHfPYJ_h!6)yERZsTy)_p|u2?w+t#; z48RhYO?!VEOe~K)1Yd|rW}-_d`Dmxa(S6o-c~3=5M^UZE*qlsx3+Pu@IF2_^vliA` zxVu**IlUv~uCW6J^_>=Z19gg|hsB`&+B$cprd&wUEm-I?kN7xj5KE z@yQ*dslCM8f8Mdq#AD1W6KgPrc%DVjlZXVo6S1efU|#q!@^j|5Q*hV?A~sgsy%zd* zn6v-}vWmdZA>*tZC?F{7k0`2-0*$W)J$cs;}DJ4vwOwx`ibemt?eFvK| zi|oalhRq;i&dIg5bP;9#6EU$Cb92$c-uL()Tec%u;q3l;zI~0Jgbmlr%d4-*M4DCR z0ho0Kvki+9M?25SquDjGzCHvFJUd41izQhX0KnDz@AehOu@WPTX<^(YkjMTZsnLE| zP;~7h(`Xm4=ZgTGswu`rPQ-U}LE6aj2Ti?M)myYvO#-r^Vtb3F+$Bm4mra$};(^ia zbVxE*w;N=WYJH#3&N9gj)^J5TAD9B)2f+w7_#W@;H?yHC%`K$uDhPXeByxasfyOB4o|4;4EyQ!^Y0aTi7eWy z9zN|MStSg}J3~9tD2kEIxq2^a@WV6lA9l*um~*O$TP&gn^Ak{sKHOid8%Ed?6_{~r zE`HqG(r>Gpj<3@1+qLdVBrx7y@0#rNxWN4Y>>^YnzUQ%kD*9D`Lu!Yt6h7UD4Y|F# z13F11AU-t^22`T(xsmEpM4)O!0~t%wP`W>47l$QP(tL6j#uhVQeBi_0P; zn3vNC*bL<9mga@pyk4vrhUn3l_W~Eda*0XqL)L=8O{PZn0iR)EsC=P-c-PD8_WeC~ z57EVoe*0h4)YR6}yEQnFs&-^u%tdTB3II592PbZ&oF`tpnkGHM>9I%6=knl z)cmWyqSu-YKTQ=y5AKP};mbaQQX2d?E8}(HS)vY!j*s{q@yL3kx929iJLaEWY|Qng z0YgIOndJ2_Yoi=iQ@vj9=MU-5@^ZbdjA@plcR^4*L-XIpoL7GsJ^#w1%BImS%ROT{)+1wH(Sd4cUueZBszb|doJPNX34l<dIo~=B-iA6NBKfF8gYU37{Ucdn2vm zbK=mucA-xrB|VnP`-fgXp_>P2NNAmg$dgfj^24)HPopRU=PQuN_`O?+<>c1N_&6=d zcq&n&(A*g|F8)&Vs)U#8Ickk1wqMtDmRZ-TaZ&SZ8AgT1Ouy1k@jO0}=;W z!|!V7N!)PRIk|K9#+Uo{atx{5ECUnYgbAwNeJJACNc&y)0xX%RhmxJPg@b*%r3S2B zRzslzww#vh7P)?_Z%Gqw9%-9Br)@hTCuNE^IL?4(>Gv(wBZ!K9g9UA|0HBbXNT!H) zj=X{#U??FC{`h*es@0i_6EDv zXsw4wA>I%SorATtwL?4TGtb;=9rfK?x^1xjQ1BVgHzZps%}lhugR^zwntsQbUKy#0 z6qyD{0~>_UZIUE9svD#>YBrQ*dp^}!5Uw=Iy!n$05Ly>0nbof;iX4~Non=x~R_5U~ zzWvaDobYnF`i*!Qe(G1#(hMDUFO!0tSg1y6C}sW{DkeT{?h4!i{J3OaN=4kuHr6}C zawjH?ax%3PGaEi{C8jfwum@}g;p>leqe!WH`W!~Af?A$eUK8K{ENDMy`l_dU@go3` z*G`;1iD_nh-e3Rj@mvpBY&ZFTv^L!CMeK7zA3wsBKQApxwX~R@3ZcJt20v1X08zpfR>BmTdel=5|7_b82P8DzU(2}dC8nzJDX5? z{hE8(rO6Ad6U|_>JlFo>_}SpbFhqTJS@7Bqp4KJ7|K)?W~_KHS0!Fn+#9*0MZkO= zC&XHxn$rm>!D!x?Z)C5%wPf4YpI02^GL?5@XAO(pE}QJMGf3yXWq2ZE#x6kbv`w&5 zM;`IxEORyZ9e*x3|86Fqtwn2|#{H5c=MU$&>j7=14AIh?cH7Z#4R|M073j-$T0@{S zTxcSAKZd!NjYT8EDpxDNMEZunQvccwDo>% z%Dq{5MrV9$DJj@})E}qfYwgM+PiCSilg|dhP~*gzUa`Q#Lhz;uIy25Ac)8{(67i%{ z%@7(#i(p-Kt|aP~s51KW(Rz%SI5zB{!>;rsaAJ@A=aSd^kYuN2Gru4HcoLSCdz*S} z){y?G!91b_ZncW`1nrEAHHW#6emc@?J!(!@B?@^rf%27~@FynR-&wBR@_9}dt(F;% z0#`8RiMg73{)hL(`hz42F=iff6EzOE*!hszNy-gd@#UiHO6Nh_dPf|j3O^Ri_jl!J z7T6g#tlp5BWkky&?slIEjEyeS?|R2TLK9Fq=G`W1mYt3kvE6;EaSe$iiq7ID!#4Pp zfA{qs8o{TzHS)L&xM``nZu=u3Nt` zBZBxhnJKqx+Nt4lf4>b=OT zc(6k5(8MpXbh1Gv-k%1~?qK@7==b-;OoC-V{5&hjnP#e&XT#uSN+Zz2=q`Lk1(#%% zv1hAshA!6yqe-fZP1DdG^XXdloo8FvtZW?2k1;5T@hj|1wpY~7FQJQAGqg?~_j7bR z+$;<(y7S^YI4#izdmT0mDyF!)y1MRQ4e3L|m4T|CM669pHjcjy8bVc)_Xxp(?&d3~ z{?I+&|9oENBBy~><~0X9a2KBqTwkzkzmls<&dX&gbME7F1Vvo*_TGy7s?UmSd%awk zb#dOK4DJV^BH8Ohw$U!LuSxH|lOzeY#7%%QRs!3jmSjEd4x={0j`il%jfXbo>{?|C zGbxUH+p{JLo2Qyfd^Sr)P>M`saeY4D<<4)f_zgdO8?-cgcE8pw!aCUcwcWLr6M&#h zx&WRNc*$uZgjM8+O=qGoYRFUaOs=gZWa=7B;xv_|R=zln?WV$dd)0DHu*ZhaUXQZ< zKvPJC*&tu8&g;N~YPa0_&$X3FhB`TVg!RTCOBWpzdG{Q#xXwFLq0_jBtVJN1Hw=fP z``4>o1CJw^kaGs2*X&Eyo?LX;bDJ}@&P*g*V`St)H|KC60u&B7TM z5lCNT{o#~;Pv-Mi`C1+Bb^~{&7BHnQnB3tC$HsEn1VLTzTK8z@qXVg@8;#N>P+Fh@ zAV#+nXHL!4y*tye@T0}PHBoqL?C$fyZ{@a6#KsyOyJksZM}lGI*XWM;!Jg6DTXxBd@{yb#2i|QL z1lUdQ@{u6+Z{FEqiP0I$(X=pc(~H9%$nPIVS2`rwAN{z+q25r3foP)IjYV-TaCM>< zPMsy$a=B|rWPMO(T>&6t+XeBF?#}C1?T*eJ!wr$VGd$5;u@-M@P}zC>1aU=~MaM{q zOKFMS6s5M1)KY5P)qI0zG*Ud*Y8{P-`8*tC4pTf$*UZlInFP8ut|`>6m$U_LfwqYE zk#g_B{`$)^Z@uJ|0Eq~>mV%O|*T97&_=L^bFYh0pp|!z(;w_#b>mU%JZpH6uY^+Q z{Nnap+~#~scsaYafyv(XJmt&v2#Ow4L!wW6FA~3`KS3G0BIOpQ?+`Ha;T+Vuk6yRM z4Bd?zJlNIkOmNhjzS^ZT^obS!8U0IBy}lYZ_s%4b`1sS;zo5c=s;H3bj5?$}Y4XC( zEjN`EN9xVZNrK@Z9rqoEf>O&hR^FthNBc7p`SPUJx+(Y9La@4}eWjMl{iLA#iI@oU z7ulV&5I*4XD}H+8B<_=%csIX&lT43gJ5l%LvQGzlE2`8}p$!j~A{`rweV2yy_oBI^ zX}wn_)z!K7%KHsk6HZ3lC6Uv>^|I|Bs#s|#4ON&ql_eZoYx635+GPGmi+%_1-5Yuh zbK4Ym0buQCfOO%XqP2O|Da>j`c3UaoTG#x>%Eb-g zW?G^3Naeis^yhjI$HlS^4t>AK_HAUP8*pJX;d?6!{lg4#aa#Ek>1Ce z$}+}$sfYtU69T8HN5pOZ`)wt8`C`ZkAA)Svd>~o~<_8n!fP!Q>s8_W*G#K5stqe47=9-EQX9yVn}w4NJqr&+rged739H+?Vn!nbYKRw94~dOLKWlDKwQ8 zd4%IkMEG}C_=--R%2jD>z{Fo+$YAkQ&BjIzGS*Lu3iXx0s3ZOqY;EF5dz_gtVBBY` zOk&&%q!U*IQFK0bpL5+!hcZ+2k_h@(4(@*&g)q#n+SA zAbuZ^zZSr~9d+#zW9azw7S8rL^5C*q_xb$F_?q1Zd@dO%&8(sp+*cC}QcLseouxOQ z&{7>vEc6!V`5qXU(O6&jHfi?oCsWZuW{O?2c}z9qKKqA@q1n>oYl=MOa>3Tc1&=^D zz4f-MT?vVbeObpvUn@=S`rbUHxP9|nuTLdf`F1nY9qy}(gIbrkxU85LZiSa20aJ0? z-6YygymMwGIboqyqE5i0fNTF7zW!dzMfZ;Z_(SIa_32OaYrshS{Cejx+g9v|#9`5t z$M_Om1*$3ao7BLWtNw*x>+RTxI-S5hymZB9%dQrbJ{KXizmh@vF=7cw59)>*G}g&d zOp}t}&%L2Djdyrc>A+R&m)oCT8VR;K6#$g!rVIu6!PX<3F`mOhyk`jPy!Y(;t4rSj zyW`u^9pX-lW9!p?XGe-t@=B|5hjhTQre0$w!)iMkR-Ud`P+)lGbKp7hMO5@WZf6SOiN$6s`=zkiz?`{BKy z#Ny5U9XQYZd>*%86KHb6wtCf-DTh1OQ87k*0v9}IUq3PEjaQa{ zeQmK)UAeqOGqW^4~mqBplld`mu;u?|vlr%jBE?9TD$Q1EWx2`;+kLQ;_++%TIRDj9>B*PbK_Xsgq z$Quh&0yU)@vWv-Eqb?9ZoWoy|x#xPN{V($ig+dS2zJH>>xv?=@K*GyabN}4vHH6?( z8uNt|-yxKxrd?kh$r8bwA%ARm z|I4ikC@{#OpC4~4q;k-wFHO~5#y`qrckuOiTh+cxO8&j^WqR~d0_iOZh*>_%J&sEP zW**!lcJu8!7;R%puW&>ewUOuBdU4x6P5{!9+4Pox5LpRc><~yxoZGlBOPe)hKR=t! zgS}wH1LQ)j*WBMg^d&1>J>M(+CO4)OSSVlgo19y8`<5!n zTBgb>XgqJxStfGTqA6)b$Y2)$3k5&o)nmBM(vlI@WIf_gM8foeR{RWjN2b&UP5}!) zg@L!`1FR;4^gxhDXpmVr0OZ{s?#eCb&@kYDcc$86JQzm-*4Ie;Aw&p6k^CmQOT07> zcE=F%n(ijSe!_5xeS+a|V`GPL-GI$M)#0jHw)avvH(1h)tyY># zd!yxiVPAZWH=T16sn5^7-w}l*Nv&KhuE;wc5dNRJn0}cbV;t#J0r!Gfr{?bLIm<-S z&CRPoiBFQuXy*T+Q8U9d*)kkkIaX3fE1GDfCC<`(FnU!sKU-Bw$9=>(TFmN&qV#JE zJsoCYVM0RBqb;T{9i{xMT#ZM*ZTAnoAfE5vl_)QL2Y}+}TGK}uuVrCHLQmr7_{VZ` zvkK7Xf=;Na3(yI+TZ%KHckHB{@Ns>auQ-edv7)`u#Pv^T-D7E8+j-rA+cpd;MfQ+a z&Dp&PXM7jR6H>Q5h)I`=Tj)wn-aj6EL~Ud>ljs`W5zG5tHW?Xv?}1Ko(@5nY}*&%xnQ#?Cx`~&(+oj{6g#tjd zsRv=6rl~d3o(Y#>Cy`Eqthl#*u(n5DapZ#Pb3@iKxBWo2PuA);yrW#ZhiehM#E_SHexh+0K} zJ4>%Ths|w0ue1{i0z?ZtBS{?qQ7wh-*nM!=DZ27K%t2%e8!-Zj`kG?BGAq!>-5laD zRvTQfo9G5!~2d$RL#O90ZeAxOHCG4!Am|J*Yv^3%7ct6P}Gv}Xh&WfH9I=oZd?M;vVvVR zCzkpZTCNi4nZiGm(gxWw6DPXm>w_!a5Z-&*eS_sATl0FM-$5>DMYCiK*~{0tanjvv9k-=am2XJNcz1oAN#ibodO)Gx;UA8xvD{a z+dnFtf1Zy)M9s&k)mHd8X;vc>3P&wM=zSz;y^?o)gkeZunurLG29%j%xGG-UeyAhOeIp=Qj3n(!nkRa0~UvN+dtI|^%N&X zg4jd31OzaIHVPcgfc(yqeXQBB$aUWu{K1BF!=I6NX;OHJfSsh4s`(jKAAEwF4Ey!wif#l`)*^(mWT zt1BX@Uyp8$tvnK2?WyKg)0~Lg2iq5;;7Z#n>Tzc1+L?!vQV`rQAO~~qGlGg!=T3}~ zcQ!D1Xe?FTp<2vv>nn~WpqII-nvvCFBNY92Bu^975mbV)SgU74)*Q-*u~iSX1inAW zo7(I=vCf0Y+CVh7;k-@BC2!h56I#*swY3o% zqEHLG<`PK~k;*H)62^A=wS6v>c$LFyHoZ;~bF~dF;`>hhD{*mH4|A%*KHnR{A{{g{ z2M|8*xVYy}O%hCmtMD!A82Wa?e5Y=dypw*F(`zhycbSBT3^@)v$fZ9+(mD<6{t4I* z^%A-Qv8x1=t33>9Q-%}?O+qMlkh(5(2LlCK&z87TEd=_v0!*lMND;wEo{K?Zyx8i~6=LYh5 zvdT9S^i(e^tY-u{T8#+N;6~8X;~>9S+bF0MxG(K9ygHH7f4!}zi);W_3D-bsc+%0l zb#b%ip=jKP7`)ExaZ0Y3zA|g(=nrQ}27-W>Kn*#7BxC`Q-dHPFBGpVWe;-S12J52? z@ZGLwTvAE`oy-|n+kV+G^zs6*e7TJ3JPUUO4rQqg%^(#3SdAVvy=t-*)z?0{OmT{d zjUGIJaKW;pfK6}F&iaf#6pohGBfr3Yv zb|3r4)Sy-{+zkUc#`hu|?{jpLs$HKnO>iQ^@>vY9WS)hk_g(})BmEy1z`djc@@Qdh zDoA`ZGGUtW3q+jFpImSWUg-0JeCU-|yTV|sW5d#H7=BUFVbt@Spe)8=5pX%-ck8UD z>H;MLNmOcLAbb*MSku%MnxiGK=O|4fi($irok093b@QCRh-)+-&Dh(mEHTOsMCQVt zdd)iqn=klM9z_)#NOTGPZ%X2K!pox^fF@w6U;P&|*0qf?!*!EH zrZM*?g516r65dqe#=WiKbKo|-I^tU38!>+twOnAXp(B>A!_RqH$3q_P!>bAT{Ic&g zg1YuDBsvk1=H)$CBkvaH@o{4bFU^AAn12xvQtuIZW+z*pruiuXP^~ zvxX}t=B2QW#vHDhI{`BRj`sRx)|QI)WPAO_!yy;&OSrS;~0Ruc>tX8i(z3S2f@r^Fhu)nUHPt3;LK7ZT^xPgcpwX@%qZ9odtK(7*vy< zyk#Kag!}c5yz_1+&F6{Qa_d?)EgqKONGYE=Ld| zsLpQp>8?5k+|V^?q5huk*AgK6q5j6CNdJR;1Nr_=nT9MK{cds<4ZSw`{+f?vb1{vH z8`qD0C5H3pZ~(s!KshXlu?Y|7fm27rig=r0MSD)UjrklNqvT1np|$GH(n4wJw$9zL ztI2e`ToL!l#mXDXRZFs!2Vv8zxiaSKLZ^TnS)2c?E&mGHFA86qH z^%7t@pcIes=HJ`7>hld7!{hstae0{+hX=P^5ty(6*7`pC2C0``g2NG=U|XH5wG=PG z!y_@#UOydo2sf?n&e}3hPMDcyj!tOJ&zzfjZxRE1qdZlynGIMfg}=We)-kos zz!Im2^W|+?pef#ay4U~F+&7Kbrqo{hw0O=`k8efX_Q^k*`!n{KY5enRtRaTr?oD)9 z&i(BJzyB%$$He;jGIOD0u=S_f?Ys;B?9VSIV{(Su8D1{y%H&7CS$coY5e(mN=4fQz z7{nD9I!hgXjriZ%Ir+~= zZjiwvAMKwl$TFatH@Ck1b1vzNVR%@!4nIuJlk`kD_phG&$27cPhUtE3mOg0}7i{fZ zrd#!!T>IDRC6N<^mG`M)^3a&aH2!FtfcT#yC=dG#CULBi=$6v@Qr}l#&ir1Jf4v7x znB%@oFw&K@Z!i9%H^&fb&{k*OpU+@zjk_cKe!oDR<%-|o{RF7K0M|uZCfx+ppRW$V zQdUVl1RDAQk{aK8vi)o61;moZLU|e>*>zxLBB2b=)4{p`@&l+QR)$iH8G_^#L-zHn zY=6F{7$%>}XS!YII;@idl0O?3$aux%9ReyaWg-)WzJQ)1Rj+V__$<~&#{zE&-9`zb?a$U z=w_SbEH3|mAYf}!@v~!tM%W$$S5O<=`q4KMK%eY`8&xfUf!7`ief$CHx9%(bo}OTV z0B{3`!A@JtQf@<&M9h$EX_$cM@m|rC3A*swA0BfpJNBF_z+we!dKQ zv$ZsAKT_q4b)`i&K1&;B2LY2&7Yqyg9yAgwxj2A-@%|mK3d*3S#+5XuzI=U4NDL|l z(_{$JWu~U1FEgiz+?&1QZLD;HNL$HvumlyhXp6~tCAQhCt}0EH>y=h1l%&$mlKF`{yEe+c_NQo&88!JRwstRcng`jZQQrQNmAdgws@N#Hs5g6G!4 zb9(x4-}mtN4ccTrF{;#@D<%`~0-jy!b_w$vQvqf9`;&_*fhpqo^)Z#rR%q51gX?qGu;_ zbUMf>XU_4P!A50Znne6s#F8##KY>_zH^#t4K0DMpQ~~G!^XMcXMKL%wL5w@^#-o!A zDJ_J*hPV!aW!mYNT09fmUww}?mxr)uQRCrmZoWZvH;@XxrZt}t|9J-z>tvg14I4MQ z72Sb8kT8$N+N3^qfeLi{n}vREq&AR=RzbvFMQ?JA?EdM8!VqPfV3yDu!-9HT*8B`JcGqI zux z2M*Tq`@n8J0GX1EXOEyZUwPyH6i*-2s!kONp3AOZd?9^i@c005^OX0XlGP$`Rk+DA z#Ea?xtm(w3yuZ~RrgUD|C1YI{O>-HDKu-_Gj}B6>{->btzUQ`wYB$<>j74g6iYQA^A(-o;LsW6_zd-Zh+!uXfj9kc4qofmrRzwNhy}Xo@9n!m05E7&(?6;}0qomj-~9PVTUx zL-NO-m3jW=*|jk%@WfqiL?gJmdJRkASo)ISx&6_5wHNB&+~~Pp|Ea&w*pZS$tE{)T zH)~H_kNhDMS_}?JK@c&XshKY@^4*zI)xDWF9TK{JiiQu)xfc--nf>D$p)V%j1ePtK zdd)fQh`VCthQ+3ju#vHXkBy-h+cvHdnkB!P7C63Yg%G^-v9r^Cw4FkeN~cUWa1h|n z@AI|y!IN=+gM(ky!&ASZrp`OHJ?=LyoA`;fD@yRSmw9sR7C5A$I7`5>abryE*4I$% zIInCg3>QG6(B%MRAe6RSiqSdWV)LstIfxCT?lNojVTj(A+Yk4-^%?-@i3{>Rd?9m@ z)o2bUf-#9P=}_3bm4Y=byEAj_Do^bq-?kBzbuIX^qdtY*IvAE(Na`;*Oc*=byfPom z6kY_(ZvC4#Zyxr7+-J4Vcyy``6mswgAM+y?+fey~3M#y-TZfBxDKbD8 z1V;LbfQCv`-B{Ofxou^Onlx$8$>`h{{%D`*wW+pReD_Cg$NL|IEpq{>z)yHguTca| zwZZf``(-=%{2|+r(U)yt7AzadJ-e(?Stvv<&D;0B9drP(MBW)Xc5EU*@0@VMSvZ&s z=>FKNkf6Z&IbZ);0$kEzC)7`(EwO0Q=;_QR9xTCh5k}I(G@+Z#?XFg}i@9wN`kDM{ zYkPYc$MasobXkw-sRgIHqhcXP8#@y?8X#rCW zj+@gV*1N~j&&y!kBacMTcOLpTi5wVC&P)4?B(d!l6nQ(z`w|x2|Kqg0{Lo(}{aq6U zq4NAlHs5hRw@?i;H)bXb)W$M-etqpDozYx<9+a!Oo%ixHst}+ReUQN(bzakcw8dgY zso5zR@r~W@Gcevg!qci*>Ojpb{iChqjn`w10) z1%x-DFY~a+$0mEbct;z<{Q-yh4|EXs89^0N(CuRF_bCW z%&L68*O-d`i0sQry+Wh<#XfzR?ujBZbGcX&2dOO-A4F8y8rFONPkSZt@bS;9-gkzy46rXmaq)(=Bx!=u0OK^9^MsM)SQ@B%VihkckxZ}ptnAgl0I921zZl6#a`q>+7!q7*H@0KXSNVL z`M2I-^9r||8ija~`;k?Dfg;}q8-fYSrv>BXOs0*Y1HKXFDu`+9;LD$9jIR}1dlimf`q@I_gRI*9Ey$jA;gt(49yGo5sXabV~NH94mmwqU?2~&QfwT^T`2VbgB z$%(q*jE+waJ>)>F;&x80a1%}Ql^kt7&V0eF*(Lv9fyP@Pg8j4D3rY5GZw0V$sf@>06_$TVWShTlhJa>|*^;F-u%a*nIzm2_7_F2ZVVBPl?S&dii9hP}Z-w zvG=bi9BGy%(LMH#@W&>68-6wqE$i{VhnyFePea>u&z)Ujr8)<9N5r>}mw(?LRV=9| zaLC@?ljiI>U3>DR5Bf$^D7*T}%;YR~yh#XUjxAVdD|>9ld)&>;d=J*KTQ+IU+dlPq z*!H8SD_Q5W`p+Si!`yyT8hX@&YnQAY?vi3P`;CG>{!PChZuu4Jkncf4YR z{aBB6$5CbDL5Em&0g!Od>nVR1XmqgARF-wa{AUvoct+gyey+Z2meFB5n@%Cqp_wS( z8(Ee4kz3XiFMe#2a*4~iATT$OSwBJm;(kpjl)@6Q?cH0Y$a zm2!}|W%`5Q?@e_pi>WgwsXg0#p}Ln(al1H%vYblHQ{#a_P5iB8TJ*@>F_pfQ)T zCLNfhcHO*&EqWkxhb8ON$-OtEl?lOoTX`T}R!H$q zPMGnvV!0ekg;nx|O@j9qc`~iKZS5}N`f)aP1FG1U?7TPU(nBE&TvC0bQh|@cT!HsK zSoelZII`EXof<@(6!IW4RK+%=aQflKXXK`qKzP&Dkn1%wDl%TGNtIz_J8vAOEo|?MBCL~DvhL{=wmDW656GDF+;Bwb-186U?WU)Q_ ztSF^XlbV)T-`N}bP-`?_QXO9j^GyxM8zmIo+bOnfuuX@aU3xQw!fR14rR6D66sX*| zqC}je(XWyHiY$-0buJ(h=;cf3ky3GIwSrFZ8yW zY#2wg`rU;K&q&Ws72IX1fJd9y2@SnpWX^iV*(8K$s%k~WP74UWkYg=Yx$!#ft45(F zaSyXhHWQkzGbCwW-I_>tctOhKEk$eEG|$|11w)KmT7BemG6t4j`9%1Wr2c`bMqI11 zRblze8pa1ucsH`iGUCv~9%)%OF>SrPCgtLu>k5<2rr2Cf7-j= zIFTz4)Ni~{inz%~C>Q!!EypNTo>?s_M#(gU@ijBwt0rax`8Pw_{N;@G6vE4xpC=O) z-$gUlYL451UJtju%3M&Wg&KvnGy0N5j$%&WzIQCkH>i zocz=|hSY9b%GZ}(t0VE7^i}WMoTZ;$t4y5}iSoS6tFk%D9Lru~+(C=0Zn#~3$%sc7 z{GpenlS;p??O#|E_6C=`gG(k9^*F0c7=`PN{>Xt+XBSXTQkT1Zpmgi>I9(>X#evVG>S^0Vr#X%Ek1PDH*y_sU^TYnoNIUYA=`k4%0L zmzJs_q4_S621h=2JuNd-M4Qw>FGxn#`GwOZ6L|qWF$QHnZ3RJ8GiM@p-g&tf~@OrAv1Rx!A_{&OgR(t#u>_c4FSevppapc+uY6U#P&cP z*GMQrfoz7EF<-azgRGAHSGJr2-LJbEDH=0#7cd-}qdkGssH-RXoH#v}?39?;-`9DFZTx+!QW8yE!wcqT96h z15SvBpr8MY4Fq4sQWp47Mmg@W4_hWUX-wm%BeOE(#0Y(w;}GmTcvpvF>n%>d?Q>n< zS2YyGs1H{;-!10OF&YI_Aie(S!;2SrxxQ7#;{B1<{?`Hm_yw}7Mr$Z((F8^9?d@u$ zJ^a~fUiJ4{3mbM@Q7!CAYNuIli!Sloc4vg_g=)@c^yjO)=6c<@j7Af|9w_%aSTHUS ze#|&Tk(8yOCE)dW+bMnF#t+tr=dwzQ-$%{uf8G0kL)dXrNMC$Tk9VE76E=DG@PRjV z{>QQhw+&9>Up`$1@P+$3Le{E+q*w`TXaG{geoGqswY}+Iyg+EE96_OW_eYyS*{ram z;{^>!&2l0tK=`Ns4et6yfP#;f2DL!f^EkJ{i4fMsVwrAk7q zgYbzfuv|EZ(VK($tD*8#=yC-tQ`5GD!T)~AFfwix|Cs84?+`kJsLk?>CKGzc%VYKQf1lO# zFR)wr=b=CB%|(AaKn(&9TJ!q;$N&D3->n7#jV$jslH{So@|q513o(g=v>+PFhMZbr~bE90zu>x_aWBRQK=tv_wV2SJ#Qp&9|ZolQTTr_6Cf5;OvtPmV^Y}j}S5h1r!F}R?ZUlV@!N~RW zCj$9ib`eP;L2!qB2u;)!JXni~48EYq#RZESC7bN^-?udwFh!Fojc-)XKKMdntG8ZN zAH1ljDt=x!F?Jd<&BZjV)6ODHu|gqKQE|TmA%7lO*x3nSB&1*sr36YSTe1SN zW<2YO03B>!@XV+Ge!Bgh$W%GpW|TJ-D>`E0ldL8Uv|k}e#Py%oHGL624i9EMda|%o zKi(kaNKe%?p1R}aBo23m4wl>_;5_|zF_hnLzr%x@6L(1piA3}&y(vv=;#a=U)EpVc z)(77)MJDe2%Qy{$`Nzy>Q}-%KBJs>*L#&NLXQ-FP11=#<6`;%Ls<_|LoBv;z(+E6I zpS`|+KI9f_TK`IPE^nfi~cDmqMo?;e&$M-~XsDu%LBD_g_xaGaiR2eqC5 z-U}dUSEpxXlw&Ey+mV}(j+FoLoss=C_U%$h!vFI2%TfNU$=Me1i1I*FJ(KLz)Q6Is zU{N%`6m1|w$oxHc3|SI_+-o)}faFb~C-o>$|%A57rVndqDWhNw-|F^zJ`}bQ(ro|!nyXrn5z6MEJ z!|}nvwbl*<{eO&W1fou(D-b^`JU$H&ktizgG`bnP>i(tQ6pC+bWk@_p*pdzSQLt6= z%RKyd{y*N{Dz3_{iyKv3Du_V{C?F*vQX(DFDBU0pf`GJ=B8`EdbV^HiFFH&*q#KlO zM5ODC$G!Kv*ZY0%xj7f-PWdV8S#!>3%n|?ipG*INvMbr<6Wg<#AyQxQ+~~_ zmfC)gZ=z=rU03-Dc4VEJ6~4m_E_t@|VLi56smB*iO6xQE7+(R9(9rqq1?HCibTVBk zqgWoKVa=ys^}$Q)ch9h$v?Nc5#d*`B^&LlJt}b7z4!R}RYREkwt-CV3+n9Oat&fN}LSVmz`DWr7W68{bHgPrMO_nQ5-D$W! z+P9K;&j8Vs(wocD>ToJ(q@PaK3dCGG>Pz=FS+(rZ5efum_2mfr-|L z5!LfLwN>9XXDAEqgS$}UQ{$)}au*lZ1M8(uVV%`i7s^b2h{+19OSC@yDT*tZxJC&Y_THZW$oQ$9XcQdkKOZj$%V zwtu6P!(pGEsP~m85joP$cfbFofcNL{h)fbxO#Wh82~%|LyH~vz%F3F@p(d_|8*!x( zMki^KR5{{jWwW$;?COulZ&y1qvG9n+-1iy&CFIUrtIpIqKOvlI&&Rt9XPIsyp~MJd zX|CY!>^7P>4aARG{}`x#!PzHSo%8uyq3QkM3=o&_zMRGFFs}qYr&^j*cDIoWN(z5x z<^>`1Va*2fX3}%F*m12v?E%GD)7aV0a$uuKoI5>5djqx|HKrN)*HrsYSWtR~Dzj#f z-CGn~>Yr^V5DXv}S?>dDn;lRJTLvf0h6+Y?9n2rWUdZ}~{RU8?UC8TZmTa)%> zsfeic)~NK>ubgU&tma2bO6SNr!r_DTQMxf^FVE$#i*Q;mZLRe0ydsmE5cb$J&rwae zJJ^#|ZMLXo)-}Xo73y~f0<_T@TMot*`5OPEWJ>$7;kyf@!((;&HB5P`8r`JkAFJ|i zpoK0Om}}LuwUYE!IjvYk={ac?ia$m$n%=Jon(B@{Wb>4-FES=<00Nn5f78Cslil>~ z%KbE%!6l{axqc!hX8Ygsg^fmDUjS4VS>f8+g7G#RNl=;94u#`vZR@+pT4FX*s6*>2K9-$M~0v z{K=Az0ckLA-UH#Kcx!5Bebu!hlEbskdT$aJVFQo8=N0yb%a88oY%wRsd6_2+4(3j> z{oTX=T&wy1o4KK>l*DTwO5p>R z`gxihw{B@=q8>YL+cH(#+CFZp;fD)EOvA}h`sYS$FBF1*nsv|qnohiV56x|g=d|?5 z?vCF78y)fI9O0e9ns6^^t8zRv&^BXE>GIc3Nbc=jx9|5Ys1m=Nwxx^`B5ypwpbD=I zJAI)|6W$1~O&p+=H0Zhd+^-$1RzGlkZJi#&;%<+58BL5I<0pV^q&>OH!i2f?1m1wm8ItW(EDlLIzJDM*jP4TC@<*Yy?CEEC zM3dM>Z~W#Z#%dNfoZ%YK93Mt&q3-*^%iE+Jt zj5iIB9n1f2*lX|u%~y!`kXkrz`ov$QdYkfoOkGbj)fFT1Q&7woE`P$!8$EeeGcU^E zBCbeBZ&u`3xoz*4QnL&Nv^QJr!>C02zzdZt+yYs~_i_>vY}U?`5l2ppHxu-PZ;|@l zJ&*t7wWu3_gWIDKb9(z67I|_e!7IHC)&=% z8P8l2XtM~sN*Jk{5FI&QQ)x2#SX|SXNN_hcO*YRO1ey#t7rE`5@N%>(i{rbq!c5e1 zOnWF@Xld*j+k0A~?)|^*!VA$PL)w^E{OO_`WWBCNS5hKZvn=Fpd9>v7+oCNf9u_Dh z%p=JWoWuGArnN+HE{zy3* zaOd3cdTzBWOSqDmg5N9C623tdd;{oG3$(HVM>{`3Swg>P*_Se~BD&wX?TYJzhVud$ z(uDi_k|aY7!pEhjC)~~0I^xW;Q_2h?ZgMXsxGtxjx$tQ?6f^^Jb!(piuWKV_JCOJ+7C{z^hkE_Sz>q2IhI*y!-te0uD16*0@FK!w(XTOU=9 zb%I!GRLCCMG z7ga&DMUZgtU6fgUftT^_Ii- zs?(2VhP9ayYEM~|cf0XED#-;~! zRH?fsf4bxUK|I1Nxf}q%mEs&694nv{jsE@nw|jhyQo%70#<_uyg0QLq)o6!rWbQ%}#$^)_0WXZCFs-;iFJ=`|K4?P#UKA&-1eq_dDu0p%$ z>ACgJ(Pa{ctS2WbzD5W8yqQ72Vr;r+?rZZOB62?VfFm~y)1tw64}PL*W=|1B_x7o3 zY58y^{m13erEx3XNk~pMK{X@YL-;l>+{d=1(2E` zIGIK!(iP~Hmk}xovd?c2u6!7;b%o);R4#C5odBgoJN!#iQ4~w3T)Y1QkEX@po<^Sfm8xJ%yF=&d=N~FW=Xm`>8y( zvh=n3g(w%Pk_7{?78dy6lp!)uy$X`4h-U%-U%xjg5csBKsV)QY(Pn?GzW&X(Kk!_@ zZP~|x-rsUoMnW-ydd_A-$XV?G<87x_Ua|Gu*HN33-TGLvRzJ}a!AKZo6hn%qiMEhP zr`wmT3EI4#{>1xFUjJ98<-pT*4$Hv)@OYi)Rb^}V$dQBUIXxxow4Bg$l07QYJPRFf zWo|CW(9ue&RaME~lnGt6AnMk;ITQ83q%z>ukK?EA{m$1KP0Q`e>^l>(?LsX? z0nb5)?=;;cxG*_W%iTJNiUbK>$-X?x3un$_SGVWZtn$C9+)&6Fs&yUAi$5Jj0cWGz zxLpn-a^s7LuyP$gvlxz)KqriairrOf9j$Aj@SLsZR3Clx(S1FuV;s>tquoQ`!e^bu z61#XncE=)48Bl|OqaBDFVJZkt;sg*s0vua`;K#bY=d?Py(C~BRTjj&%P`B9gnQsHI2~%fJVzu`bMXxJ9W17i(xS}hE0(`LoO)kg;hrm%&)FyU<|=oa zh0_9ohNW52U+kj)8jXmc@PcLKCtnV%^o)d#nkQsxM2^HA9^HuL-K{@!PG%uI`YtL+ zBtsz{0m_wvCIZ4oKzw}cS2LocuP~77*)`LNE=BJFDOfX6$bED+JTQ0nmp{@cfVbXq zF2Ob5-h8*A#${u)3hfJs+7X0u8*}=R@KFeGp}_^A93Zv>;LBhEQ?#*d1KnCW+KxN26Be)cT+W03@VnVp`ecJ1b!h@dfO4ozGX{3%lI)Si_X1R85@ zlESum**FXg3A5eNh+%`|y?Z1T15&kAhuyL)OO2HA94vc=_4DqPd~Z zkxa^S@r`3iQ!$J@G`L6K5&b(zp=!gK9@tbJxtVkKbbYQ2Iz*9PM|=3@y{OWTZkG@i z0Qh$BFER7H(=|wR-+E^aq=bTIbuu|I)ubt8;Dq%~;yo<-C($X@97V)!4be`LT<}e+IU37^vk&L>AF!hC@RRyZJgtB{I*B8{Al1@k529@}XKRkx{l#8+J z#rp1j-4H=4duTHkTr=XRl4iX@Zt>IgH@7@7{f7 zFXVo(-k7faz&mg$zq+Hn2k75q!S~{`!HjAIhQceLq*Mm9(rQN;w`!24SOyIDHHyrN2h9JaQ?hrh!RT_zkmJKxRKv5^&N$0m=%&HK*vDDmT9s*E!yPUFZ` zDFm|iC8{>j(f^QR%(4r~x}I@VYrdjim7U*n2W%c9U3D$*0N+HOF*DI)FRKGrde&Eqx{+TDky9g8`L{GNTD3Hg`BAanW*(X5)4N+n2 z4qcIYR$R7@698Tj3Cx7*d_=+s7~tQ#@6ky}JE%0u<5&<>PIHGfr7E4v&^43UK>QT9 zhhtIcmLTUA$Txchp#g1-Rj<~9I`HTQL7hIxe%DsLZ#higys-&dOG3rK;pZ)q_jxB_ z?f1NndLW_uexJI!*t>2+y0sn^F-b1@ARsO>J6g5qgK zMmadNl6%>$A~M4117ei{NY4V=V>=)R7a#J#k>9y2W_|sueU^Hj1+a!L$|%0gSNSRz zKZ zOj02b#8GMb42Nr_ShZLlS9VD)F~^c*pwa`c2~;(!GssScN)S&38_TtybQpt%?5R2> z$4o0=8RZn*d!KX~DR|Qi# z@{l&dZ(C2&Uz%2%hrR3e+SFTUA1TwC<(U|athCR&8^xMGB+~Hhy#`uOAZMi9_Py`S zr(Xys_8{So00yxW8|!3cFKu@7$tg1L%e_asQ9W53v6L$I8A&>3co$28#c#4GuV%`} za`T;(5)1a!1@1%#K2e)}Xvs2wpGQCAZ$kLwXNc-%N3$x9hQuyHj%LT0vG+?+wSQ0D zOfH`p=Me!G9X7W`Z&R8!F7lv-{E=+73P$ zXAN6Gj`}2C6}N;H1Pt~T!4*&5yhuM&)Tcq|S&kU17+3I5i@M!{&x9M{EwKp+t1qM{ z@YtXOoR(Y>^p zj=Yt=OFLTO!DQCIzdYQx(F$ae%D-9}ml>dO+5clTHwQ;gn)qTH9dDM$JCze4+n_V2 zqNeI|aPFtfAn$6$ zlx;guB)Uc9N%xZU_3eY{|#BpEb)gCRm`TA-n+`8|P`EO~bZy#od?!-wP zX4Q3nMMFRU*m=mQEa-|rI- zGo*Kp@@xvpIFx;?)~ML05f+^`Y-HD!wUjpJwUH-~xgPN=OCgq^sD_WOxLH0UxJOYe zRm>bTf)R$JWEjj!nP(31QAw_J(^5 z0zQ@D@h*Q{&9%(WskSniDiU%<9CMFR!BElEq`H?f4kR%=9>l>o7uTg1n4aYp&VmKVRVl>eE{<39ddj8Ts~w z?yRI{k?3nQtuK+v+S^VN?mgXWJh6ZN`qy{HR)?NI4Zj6%m)K^wCdprG@!>ps!2Of| zuFGF_`QIh_IcPrT1t(SfFYH41_2AlPj;jw{4{A*1SErN%NM|aOG8+Qq4Nxc_1mp-7q`YzFo(*7B#GwSt)<*Gp*i`dsSR`c!MM`qSvRy zsNnH#h+`3S6A%*G$6g{IVC@JLQ;I_#yL-6iPgiE}Vzp zqnNh6wr2Cz->-;teVy2S@WE=%hv3W^HY?~NPFU9olkjRDkvueesM)a58Cz%R?&y17 zREgBBs;@=rX(J5E~J0q>%hbuBpQZ#1&emRLf7-e z_g~PQ3JKoB6&U`O8EOTqn_&QXd8U3Uo;)af>3Xtd?iQm*`=6U=Z3uHs3p z^|da>xjXr(3RYjZYE++|rnMw~u_*E^#iO&_77x!tNAw=2rKzK3|M{Jc;z%~d8aXxyEh^H+`_s(m^>KZNo2iol>%R>HE6rekmWs`=SoHZ4ML#*(`;+zgLKZ}hXu z$Q7oheySc7yR79GNq2H3AaM-UOZpEMU~Y)4zGmtR=U2zZ-^ELTX~9qZKG<=f{H3W5 zTbS}1*c&z5cXo^dgY8GsK)^83ZKGY^SPk3@FM@u5iT%{%^5n&9xuE4n7t^?H>SwMF zRo&v&3Rxlh)f;v&f0yjgDkj9(XFQtkBZePzG?aVzafdIqvBCiKfNa%d?$xOEHbQjw{$FTnX8C27zh8aFe>KtAyYZU3e2%9nL=;@ zn2@J~5+dR#m_^-rCQ&EbZvQ^OU~iIwA-}XEzg>WzIslsV{tAXm0&9cmQ3hozpd4W5 zp>E#w$pXX~`xJyKo1UMRk=vm6m>F}n{`mqYlHa3-Yp?v`$ksK1<$#yC)?b(kc)hKa zIuo2!w^t?gbB-?%^pxdk8MA-(je5#_=CvT4;1K8ZW$b_ZK}m11rz85asvZ$^`&>I> zMT552J>6k5ee<|Ue_WQq0@emLHdMdxUZ(Z48KN8|bhbJGQbuast4-Y3U#@g?l$}a= z`YD@;U5~?fUa7=!39G-|e!1=4)_iXkx9zl)Wmo9}MdPo5R~I6mcHY?Mo(0{!qR_=r9Cif)?-M8mvl8{lyI=ltun82$LNiJ0Q$^SqmJ5CL!fj&7)L#u zs8f&{6sPMot<|;T;}+Syn-zBVNfo{?f$+jY;}bU&p*{?Xev9 z$!}3L6A<{1^OaU?oSm#Y=;rNHO43V7(Th@m><^pVTkqYwvInEQIT6v1BEX8G-rISt zECT4>U(ocnHOi?tJ^R0lGA;U;XNz0?9Sl53m?AcswMwiO?`&(jwMM$?ZFLP8=oEl# zs*qmQV0=_m?)vL5rP-js-_ELsRqS-ej~Ur($Ao3TCH*t|Neai=8Mue*5avB;)5x=; z>Z8rutK6z6ay_7=q5y04xkxk4^1%|Dj^oylMw*GNR6dW6bp2+~{=r3_9zJj!Ra*t~n1bG*ak%`?a;})*#EK)mPFX20Qz;sTOwDt}`+o zW6@CWM2w7*8EaMX+mOY2A7A&qvpgkab*sPXnf>>Htbns7 zdH8l3c3KMjb6sBbLRjqI^V?%r3Phjlj+G9Bp~?Q{nnCJ=_-KukS6@M)BV@I8ZIYdl zaci*D-026j{yfkYTD8My7UQYAVZmh&ckP@<=*$w#FKw>qOzz$W!B`nz3Jb?mw>11n zKA)hz2fI8S_~JCi1Ha%u$whi4INU1HGvshHZRggQD8N?V=JFz7WJtCHkDf1ks8 z!}Lw%&LquP)Wlas@GigLxjQ467?`cryjnd=x-|6u`I4A+w-hB$M|Q$f8Y69K(Yy19 zug&qF6F3Gum^Js0 znwGa3de4v!oC`XvM!#y-I(=|@bInWO_-KUI=dk4Qd#$hj8aekajjToLk=};E1Se9%fV9r(iKwHMrCturSpa ztDgM=WOV&QTseTBnZP1|fk?qVagy)bw8TbUK>vGppUq_rZY(wF@@ckA^O|DfzS1*_ znyyoG&T$`RUrKP`Sn9<(r#?56zLdC6t#rcs_fo$7W|xJYTTvew?>D|483J#obQaB; z!F*DoQC8mFhh+AxEUqjoQKg(q&-PdPll1SndbUM}GLQM04|yJ$%g?qkcJYpqzeCn> zLw5)>FF1iMC88R3lI;^+T1ir#tPJ%`o zJyH^8AnH2nf-Pb=fhq$PfV|oRyBxRq_x(>7dkJvOW7_@9xh%EER245^AEgN&Tg^>8 z@kI+S-f-Q`ul_~8(EztsM@Aw)`^xu*fH!y8jI@ioBHobU_q2RvFZpBD`d4i6PA2$C z)IB{~RW~4IIYyJ8`T522CP8m+wBbhvM%P?rSG`PaX=;M^{VN<2g@xy_QE?v{Kqvfy zabwlqBXmIp4b@cj859gZ<$@^aNNq;~uW6v|Ok@q}#{=uPv>8HYW&}q{P2M=lDU@+t zy;}AY3`RgT=gpl0i*ytBL$rOmWQ0gK(-u|*dn%In{a!Epi<$W+@Oc;0B=vScsE)$g zJ@R?EHI+c0U#zVmGS|*Dqe_@LVy3G*kK-$ItVWt?+k=(noo!gtEi$8x)JBu|n!=`U zt)Csays;VRfW&^AG*xV-rOcq-L*xCkO)R$tuU@)fEk2`KDwnU(at|7OCuC2W1QNidJ=Y)U!Bp{ zFfPH%&qYAYWmM13yf5#nWj!lP1J#4avid$TFW_Z=PrA#+npCHqnQ&$4+*vFgL^4T6 zfqMka)Hx#~@g|<9^Jagh^85-4KwO%Ha^Z=>rnT3UdQfN~&f8TB4oewcA zS}LIrDtN?T-6ZEf7PDX^Qla>~3#nmp?PqKFjqK8`J8up}zkgPaLJLqTl%f}MMhW_U z*iScwyssL))32^H<>RVIahZ7EFa+hu5uI9__yBMdAOdgFhIIKz z@DMRM+6X!*sc)z>?7~Sgb!~MnB<6oLH4Lxq8FD&YDLoV2)8@F7x{52a( zNA5RrJp_t)+Ju{~KL3`$*w`m=%sJrp=}Ng*y^i!hJObj=`cfJ{;Yk_UQQ8Jd%PbQfmKW&d&s~VlrAdWM_+{^t z!YddXs6>0qqBm1lRDP+N`F(vn@x`vwfy<8-qZAwU3@WV>B9wA-f2Rm z7_QEL@68xQ((7{YLGRnOh%dNp7BqT^Yp8PUTaD}tH-&8j8 zMPrPXNfv^KP#=}U1nfzp4lJ`?#>iUFjw){*^d4KJWL?i9H;3-f3GZ&2BzhFm1<%~u z%o9}!CF~dDUmPfJTdU`(T!i_z5&P#?FMr!<=?(!tnTc$*4MR2%j~aDw8m*L>M$a?6 z6BTCNMPQL4+0^>hT7)VW=c$kO?+3nG&rhBbb7oV|iMi%pn&!B=tRhGzcvvb~Zb?$L zWMX0hT%^3VZyf9xg=H$9((s(*9Re2X&FoPYX|<&RvsAPV{{8F`xkK)+6mFb+&I+TC z()<&f;ql$DrQ%wy2o4s5k=Xofd){5Q8fjhJXHrxfMk8hJiUXj;%)n5cj$@bF{O+y@ z6&oa+HLMt$NQHL->&UUUhT2vC+XR2y=x3j?B*Vj5^{p?{5BKfxZAwp&Gl)a^FG@uv z`uCJ!`R@J2j2CDbmsF3{wx&VL(?c$Tv#gL3AtQ$zP4+no_W~tqv%tFnc6DGOxiJbi zMb_O5CCgTz+iOn&j?}ptEMR<-zmJBojRwrssPGJ%q+kqA%*fCzgz<1&SX;R= zI$3)O7Rf?Y;j92+f_t4b@WsUl9w!ek4;|~RFj={*vMbL2qAXFNE6KdUkW^WanR)!8 z7;AQ*rmo1o?HfK(M$3XGh7V6ta=BHUN}zk zD6_yfJKTSJK3Xm|d|CzsAY0F{2Fl76Wb|UhuC_gV<3$}XR&k5wm2KU-U`cz_X zTZYTK1SrId^7Fs?`#(;@EGpn1{SxH0Jc06&6G31!#a{bN4iXDX%-i5c&D0<$l^4vkaQ6{ zp*_E&j3Bf<*pZ6$&!Q&%hWHafVQGpH<-EMECuxI@-EmYhqw$dW z{ll?=ZUr3uQ-7?z|Lp{Misn5@A&(k1U%#`3*9UsdyN``*u0MbA0yQTE_wrwD#~WmD z;7mTpYyI;LkY_235|@{X;S5Ai4SyX{W|?9+ED*N(lzL2VMbJf?j{ncs3rB!d#m1d& z|NH>?>?T#EjmG4?M>0#P%_rFY4x}POfQV$;=P*2t@lR*syhsUY85nfnF7;JloA1@? zFt}p%7hz2-6ym)U1@0S+mADavqRj?*xKjbF@iXg09}@mNG8H)zvp+Ky-TgOa|J_AK zR@FP03XUy7;|DmB7(f2cA3u4+2l2)cSpKbg`S0@m`E}ShusqaH&ND;3`Ohx>bvXX~ zy8llf9xQy)akrP`nvhGx-=EfkP_%~!iap-vo8y_Zb-*pG3kU6%EeA0{xY(2)IQS z=B(ck&#ZE|wv%p`I0vwKL3_B}Y0b$%`2y**P+B4r*CRt^`54V!&7hgZ!9&{G=Fjdvx@64m8W}&7& znV}sbZ4ZH z5OCIZy8FsiYN9rA8$FSExXqKJ+2v<7WI4d^-GO6gs*xf~f{c_izZ+en7J=#$bb{EBpip>s2O|l6!Fh6`RH4(4o;oinIP>o=WMeaZ6!I2J?K5MOBO^N#1*%#_$~jic z$Boy@9{hSZ@@vRUX?!xx-=w@l%YxWsO-=J4n2CkY*%k71!cdY>YC%CRqifB{_6%;t zzPq?;a*20Z^ktgy7anJw_P;p=yYEdKoUYbkJtD@MXK&!zwzqF9TR(L%YW6CJl_u4b zTN+m&55fH`ZrqIVV*j{DRdD7MVB?rrzEMXF>jlvc$DniU;P_{qj=4Pr>D^bl!eV){ z0o|nLQ2EvYnto zF8KdECgcb%m0dF0WOT87=!>d>=LNFp1d+(hH~k!&LXk)2_1+ z7q{qxJ^Iu&=H7Dnj6eOJQcz#Em8YZg)t~5;(|NMlPM#ctiGNzj2S_lUcvpl;<-%k5 zp4h+p;k!BAV(-=0E1_FgvHjm72KGK8+8GMO_CqEoNqT0X5qgK)I)=pn-IA@IWu#MX zUDEvg?1*_I1=(wpYIe{Les8_|Q+zS~^xH>IFM`f;!dwHUfVNn;7r$e+aYpik=bF{K z*&!yZsk_PAWi)*@*R~Rk14}#c+^LJ0u_bPK8@@Kbo*`e-ca zMjlN@;o4?xwt78i5WhVp!p#zGsnZb77>VTh^W87iIztIBjNrp8VQXHK_dhSUmq;ew zPg_nU(lLJ7DrX!EN;mG$`ZBu(BN-?(q$Hsi+YgcRvL$V363g9h_`%ETwfPc{ZP^xx zn^rrmmA!)*!99wnhO46_7p_?(9{+qkWMnZ}qvVa=drapp`B{j+V$l8*QzCB~5RJA$sA-9Z&Ic8>bPvsLL+#u49i6*2|y|HFDdvOcMNiA)$|Ole-J` z4sbbKteBCh+~096i5X`{psO?zE}1GDm;`)@nCbTCLdt0h8BX{6O84?KuTES-^3S=)*!Od_#rVg6o(rP4 zC>6N05D*?SeJ|j8A+BvEntO1XtFjlu`)krpajIf$DG43Nel3k64VD~mCv~;W7dxHf z+9R*AnVI5WyS}_uUtX$T%bf0Pn{$=YVEKu|R3f+xJ{i!rycQ(TK|lV?(y!s>&)H#g zyLKrgQ+R5-6;PVE#c@<)q%PQa-ljs5xu1suB1r;kQ0S2B_o~jEhB`{=BKGpI2@{CY zWyd>&Ub|%~r^Fq1*mpnDX|qow!`>`sq|)ARcE^21TVQiGHg8FqVQBE^P?l4}n`i&E zd8!Ms6w@RdQ1W-Kn?A?**4{8C$}54m5!HXy{0CcR*8)ekLiC{sl_yp;w@s& z;*(b^$MY$oq7huKWoehLk>asi?+0FUYE!?j<hW!@sQe0$@V3tNfF36%5N%Yfk2iF3Zgf+F* z>;oeQ6-q4V12j8z96R_l4s%O#8~El1(}DhX4V1*IomWfe!BuhGx?D>QN8rfR^Pnhg zbBP3xT^_aO`QvmfuWQ-Ya;pWEJC^DN6oQ3~l(q@c;GJiHBpE+vji!SUrS$H`&yyH& zUd+$yK3kshXHAF+oC6M2D#p;Y<2E%a9vKHT#dXKnW#~8mq%PbjsJH4HGSm6W?lDXMfaidjs{veA~lZn^JNy!Z4Qi-rO8?!<6n=n&#S;%jD|X3RGnqHGD@g1R zgEGI#@6TB_N;>1Bi8-vIWjASC7@%hO<+8er`BF!tXfv~7GoTo~Y`J~$EwgV-eem7( zQlmPhiVIi|WDFbqiOTj9J#%GXX0vSA8pl_*x>}vr0X)VrGMgJms~Wk55jZ&cVP;fu z=5@Hc*6pORKwa5dA25jZh)i;l+n9CQ0z=Trt7e*M{zPx;ZN{|f>r6b3YYBT|+TIx} z2arWs39eq$LE??812cw0L{cs`9IK3kV5D-3=(EG(E~pPCCL^8xPZRInC9Hlah4#ngBl`Teyzn9T&Z$dDWzC;<5!kaN%JnE$A!ebZkDO$!o1xwQ|e#H@vteeVlx!W>j6+-1(sKi`Bm9F zxD%z#-K;g=EWR}AjCI7n5mvZ5_O8+*!erD_7@*hdmm-7I93O19bA@-7a(yH{G47BdHq~V00eGBY3UzprMR);hR|=q$I1?f+}&H!m;#t;U8_oa zDbRb%&j1U?0mm{Z4Jb&C9Ri(|J)c-^N3oPx*B>ofq~p&-A?ey3SEq~rsj;Z=sIV4) z@fnH~r={iDU3YqvXlNEUdVN}w%acbNk{A3gtlu$)SFm4P_s*r0Wj>6tWUi7Y6EIxX z%u&jV8EJ@g*{~Y>{t;#dSzIog>I30Pv?1fw<$aHO=$XHkUf7r$=~x-T?~(b1_1eo7 z{BXbH4gIaQjbu$wdqR87B{>Fo)IG33J#_Z+%BTKM`QNOd8NpPyAV(MbfWg?~ElN}I%UIOcZQFJ+$lAOrt&2Mc%3tJoh&@MV2O zt`ZE^8v9zu2!ie9VVOyRzBSR!`(@d{__>&%9>a5~sihbw*>9;!a8M^sP@RyL5LNs5 zJP*=O)WaGTvQ!Io9l%j*l*p3AnWvq(!^t2fhud!dKpi7UKaithivnQX*|UmF98uc( zHk|MVE))SN|Iv!N-ZBafu7 zg1w_u%KZVcoxyAKm#D8M7>|l~<08O+S*vcam-BfuWJkP`WfA6yk>FfRA)4c?1Su=* z%ihZqp5>*hV|fa59q$bU@+5IfhSAoL<`9pS<@L=E_oY1}_+N)mNe1;xcK^I73mU=5 z2m2Ci557lnra8j)LsQsWWcK$$*^_SC9DEXzDxcwcF(d;lv+km4)vg&Ng^r+o#{WFq zP>TGr%i63-@)U_BYzjbdcbr#m%#@h$!{W}PbXQ8N*UI2ni}k$UDhN(@OotXD_YE`B z&2E2>{t&0LP(#Jv*vAkA!Zybu=>4mPx)NXAA;QJb)EdxV@98NkBXiE2p7|p5l$yv57JUZ=uo6F~p5E`zw|4p_Y6`Zij^^&?NH1N! zW|BJY9Zj6Y&{~Iib;Wqa5Vl2!gzCt`$n)=#jix|=jfa#*JB0mDd3YB^tH~H(H}wLi zk}0^5*zvB-g4=r@}K7E0<`j*s!njRjE0 zLu{#vUql}SpU4HR{C4SZ)m30WR$?iU5!!6~UF{sdKUB~^M6oeLT{q*X*mS~}`hRq8 zPoV6urw5V@4HUSJ!xj9@92TaD{?>xP23@Gq?;NIQFuO>X{h_`2<%%RTCKf6Bc}F}N>&x)lATitySE?wJtJhGmSj3A2U-c32MUBDLDKVY-n?qp74H%F(&yp*aH|6K?3fYwOz!>H!yxVmISnDM5wM} zU@9=53^Cu4Ji1yUj=2;6`Oe-a00oPSC-n+*q5o}q{Qv#6y_;xn0Yn?|j``Zq5ZQ7D z+$5Z3<>X4iTBoPVVL|14oMdg?hm)26d_!+{s2!$GBQA0O`H>rTE~x8IPEF;RbtWJt zIVwLm-S-^GI3ESayKfDFi>)FFubFZI+hY;d|Lg*Lz`(y>8-D69{vF}*f$`4}*MP1u z_~+bz^6JK~J7BpG96%{N4y`6*n1tT^d6cYg@ZC%IKiyCx(gKW!0YZWAhbRLNa~vF; zKRz@Qb?$FCBtq6If3+~J)NYTL02g%_xG37+5C8j{A)W945)iy`(HopD8~1)M7(+v{ z7~BL&%wE$>5JKl1gj|vw9Oq++a&n-lHIpr{?4VoXKcZ5n@IJoZxeDJ;%?uKjxlY znLccZOq2e6Q*Y3OB<78Liy_a5_W+gP_Dh34Os$AjEl)f6En5SV zaa$;(ZE251r)&@yP&SBDBhVyuiXTcBM%6a%R0H$~R4cUA)YP&%S|1`|VN23$n_i#N z8-yc?rDEPY7Te8v(JPDskolsoL_BN_f_-`6Vz+W6dfaw5Wr>!EZbx~X;} z*jQUThD>!K>b{<1MD-Eyr|lS%v+HNc5dLPYJup>Qe}Ku!5IEjUzQ_W)*&yKeWgKdc zjb;iFN6C>O3BJQ0d+-Kog6S?o^dFO54#VShTGkvuJ=ot`aYAUDz!J6wbyJNZBT1f` zffY63#3v0`5l5nmS`9GJ5YhPaR*D$aNI3U7sRc9|z9MaY3tq7L&lczez+mP%ACYks z((|AB9z66fr62Mvf&zqa%HVV7XlJUUi;i91GdjzwJ`Vew-@Uyl^MjrBe((d?-T;#B z3c`m%H`B&QhpTF*N|c=fPscH5ID{kJX|j6v+YlP1D{VH=LE8e@@C(3GYgM}s2O&jS zk5%=VwnabKT!QIqn;g%px>l~amnziNj!Ux=d+Kg(HB%{^`Kv z&M*vI(y=LUb1AVmUAV;0HmyGo0%iN^$*_7kqn$F9YIiu&LO>6vQGqIqiaz7H5zuFy zMy$&#xugnJ*FDoLWTHdG6?`H{>|D&>Qj5P@R=zLS!J6hP^IG-P$<7P+dzT=>Tg#Nh zahL_7ZN`S>m-;-bC;geMeoZ&$N5+tnOI_~enw)=$x{YIkFdJ!R)*C6;NUxH+{rni% zjDL^(0q8-ORqL__f!#LWwl)MI@+notJ80c{)*&$6Q{u$}&$cuO)2^%Hr_*}Ra{Ttl z+0H4J*-*ndaVVU2NpE_O?i12bf0{^{^DpaeCQ<}wwuCOz7sH~t?PujwuiZ+!DnlEl z(~hes*F(v^_gVh*-e&X*YCqfT`}EYgwi9(uh!%BT`S_wugsX`0&D!-C%lziN*e!&K zUs*COrdpN0_Pqn?e05pow05Okn*07r`Krsr zJaB5=LEAsS>>asf_7HuStnmty+VvISdq2jCEw}NrgNqh%)Du{j)nl*iZB5oMsEi0? z{GqgvwV0h1TtOcqzaPbcfMtG`OK_yUYC$uwX7e%d#C0dTeU>F6qI~iz2>4OFPDWH8 zRX!!_dJ?Q)Be}&9iq;}aC`Cx>3XhGh(#hsLrh$C&&k;=bJ67Uy_aYN?_@Z%te&MX$ zZsC#xYWA1XpF+YK|BhH%SRla%bLFhpd!MI2oVwH&^|BzEo*~wM1~@QZSDz8hMCHg) zM}0ZkUt2r+t|Ik1^qT|ldq-BvClUneMjt^6Q5mbV5JT{0`cpThY_=+5<2GAO%r`du zv}ZG$DAhqZAkXMXCP-~BNag?O?#%z8Uf(~ip=Iin&LHQ=nz750RAWo1n6YIoTcpJj zGNo)Kq7Jf;F!p7J8B8156DeCLW+ZVi7&G_0@B6;4 z*L7XbR~p}5r}xk8O5AN@7?xz3OJn0|#;SJ&XDe=qf2h{DMaf8)xeLa{4r0YNH|@@y zVvFa_UQbZ zJ(eB%%8zG`CfhirX?hJNoh#?oM<2D*WtvR0g&Go_M>qv=m86?Fip<3KcdOTkZIF5T^_mhRlboceUF;on5z|(DLAbaz#)B@ zxdk17;qa>NaVfXUtj9_>y|J`V4hveDa`cAGm4gGT(8$VRNG@KEWMkFR+3CA^1jC+;5UYWTm$oF#_yL@Vh5lxk( zMl0!$93@sa`iR#ja>$kOuT6ek^N4cxN}ZS;OF(^MqYr!MSC|I}p8)r8S;#Ce ztsiXCddp|7Wikmy{awp9`}}6oup5(j$)>KamJ6!C@8?lG*5 zsw#0|1ZfHF$-5s`xBseS*1Rb${_59iNvB^Xeb6_356)-_CFVWP?c=JlBKt&$NhJ4K z&j#L}c(vrUL5yd9bA>zOE$sEZb|OD$E)GVT0q66$dLY(wUn)8|5=*q!?oebj=s0Y`Jrh9WzaJP%WL4||+?e^%U>ffSXVAU@}^A=Aq3 zaf>kO!EYM^HE|-s4*IlVA^$Id7klZ8w>U)5YAy?zgA#VbJ`{rVR$Bp|*|h9{5_(16 zUgFp85Z`LjYsNhIL@w6jJj6N|a%qHU)6J!$lJsrayku+JfGwgEP6l*#TCwEJ+J)xb zC8eCWy^q>oR7D@p{P`u){dpY))g62A$K6w8r2(UuhLzek6hihP8U%&}t$yG-LdHPi798 z$7<4YF;o5Aoqh$Dgoi1WE84!vP?Wf;hKMoWGcr`>q4#iIUkTVNZENl|0H?6<0!JlP zXnLsaN~?v}+u7Qa$77!4@I}$%BrO^IcB3EjIr*@y3!Ta&RYDdM#v@bM^-)obgacGg zcGBbV5Pm0O9PM(JMgffAs)S++`q|_p`=?^zC7k&zBkd#NVbI(}PWf@?cu2~n(LZF* zF@mx3qz&>ViNRWJp82H9WpDdeFi~e)$he^-j(ws2c)czw9^%81r|UzBn+uOd+P4f{ zeVxh4jum&gIyzBMK99|?Ut``|vG=W|li_OtV|J5w>!BEri*HsZ3$rPE`6+nS= zP&MH#>8Yul5)a*y{nt=wdNIW}cr(SGa6M)xusgRAPvU8$aD!d3J;&_unI=7Azv-#@ z!SP8n6_1yusgQ`XbX*uOUVa5kE^ShESpm$~htX(5;*G7p@2M**tc;GPaIeDVQjO`e zPA6CBu9${V+DCba6-_6^kLk_Rwk2uJ(*$$sR9N*hS}*EaOVHsQMktJ(VPFj0C z>!z{&PO{^x?VChV#5-oLkv|Xiavum&rPJze8OCGGypSY_y--_pSUuZ z#}7Ma+DWpEm4&@hRdy~Pl$-B_5I%LMfk>uausNL z*^O_A(>BbU7oZ%L-pb;7O*%qTiQf$4+NF1+OLJCfofI8ac`SF+pVUD`GQqE1rQaq_ zIM3K6fAr2GH3DxF*KDvne`c`y9Bv3(yQuW1VApK--GD-s)OrY!iOp;aQTb#K>8--A z0OcL54)#w;2`}}U{EyjPt*P3fxVzN|KR5h#r&p9WZV2 zDkIiAUvbSanVPeb;TU?%HxG_WjR5Lu2HxG6MTuQiy`-0CU!{*96J^Sm+ioxr6fZ1! z$S^CNoGH0#l9g1a6k>9S6|Je2;!@}n;W(2r^hd{W7dBBvgGj}9WM@nVbL5Sq$Sb_u z0hYnE+qL)A=U9<(Pg}S~XDWkL(ZTps9&L&by}8IkmxF~HVc%jTY1U1x(?;J|+O9oW z@whUqHrbUB1u^)@e=*3`Z3!z6)e+xD&zvm@CdX8R^UC6|%xv!A2z zlQXrho3d(GiB}to#pdsD+y@4mN4n}A1vv-&$GCPub5n!a?dFq&U8#KR3she)sV!t| zOXd@~1c{37-T`{7Dq$oHFJNAC4cExreyFzR^?gTMI;B%2h9DI|52r3Ul^X_;=G47L z%`6YJEtcswp1$Ue0856%J@sc_&-95o5R{o%g|N8M_(dum+C%$7?diSZH72{# zOZJ5Z5ZMQLh@mjjZ)SYIy<6;yo*J`zKsIi>ZalT3n;Ey`SY?- z6?uF&M{V5o(LkEgJ$?eFqB8o4_^(HcpPFgw@49_0K{DI-B&G|Z#oj8+oBr1JNj$r` z8&Z@RU#C+WWx_o>Y2)?A%;+;Jw55;K=qnS8M7ErEl|@?+Bc-;Qek9wRp(_pY40?1I(YPA0DT`DGU#+C$ zVIf(-u5y=UrIq%Zt<&_re<>Bv;L*u3+W>VUvJJ7)zd%?XIzdl z$$~2gT1xZca@s`(ET%Zo`|ut!VMAl@^6SJXXO?e;4G}KHJ~gPl-D5`t(#x@rNo%Qo z5$j17zKE}-gKisbxSC@Z<(jWoOPH3Xp+cxGxN8}HBPd-b21%JM8`P3a_KDq*GW&AU zHCCHot0Kmbs(g-eu@ROaOQ`ATGyRxAc}!h4h}<)7ZhN5Wr8-7v;HIf4TZqnK+>4@f z^8F?q${1lt@+Sp5tys^F^Q#>?mPOXt-I80J=Mxw#9|GHkETDS1NBcVSagkgZ3;S4a zg=A{nd$fw%yllfW!q_Y)ZXTPZm%~N)_cdwYGLCV->LaBmQfa2uO1n(H zX0cTP*NJ`Ytz2F!5M_dmQW=$$(Dg}`2k)*gCfe+1!;s#58xM@oRoj4?!EC-u#6+no zb1zBr?Dp!h+|G<`Y4{p*(G8)Q#0Mj?KJwe<&^+$`bw#zFAEp11cE2YIT7O*cu_xvf z9>0Gos{6D^kwKIkAWwCIscKhX-1Wb~`|U6O%~V4ZokDDqdgQ{Awl2R!V=-)1GaQwE ztW z%X2};PWr)$f0UQM*Fyn&@9Oep#C#t5Ia*j-REJV7_Q+ix?>>W|^HO~VljfMFMj;{RLz6&kQzr*CZ7Dt(fz_EHISF8E%M-TC9)ioA4_d>{yVZ=oXai7k&U0Q9b9 zpTA}16TtNQ3~WEy?Z%*e63|s30)H;*FqHY?E4N8@9c~jV#w-RZQ6)V@g1v-&Q3g`s z+HCQkM*HvO3uxqv(rimpA4g75u1ZitIv<=N%wGgSso1&WIams~ms2x-1C%@h<^u56 zZaaeMPhnYA0|&-erqyHLgK;8nO>=|G@0{`$67>3Jd`# zv{-^otm+i#t>-~Q^s1ihDeEu<5B21|&ancQUpuUCWr(jwq%g!4P36kOdHB%H(`oR3 zD(+`zWo_n<+p^gR4oi)Dpaoa#gj~-Fjv3jN7Ml}iVP7`=qR#d6GZ=P6SXK2cpbNB;yH~d$vW&h&>|7i4o?{eXUCr~Q+;F8h{GyM$5A#gc= z_Nm^LjcxM>q!+*eq0T9vf31)~LNK`yAV$gDp4@qOckEY0ii0m$4Y96Ig{|DWNL40q zlSC%~J$vQp*cIGEJMRzc6X?EmzaA|mdw_M9jpp1e81}vyew96r_{&4lq4ZDr#|x(r z{TEt`YFh3?2$euA0~_%-Bq`IWlq+B_jkse&d^JW)PpFv8yd_8+GKQjpWxzzF>O@1w zvl5@_v+#|->*2@Kc($~|*mJz{mQ;du6+Bdj!;P*c^*pYI`*L9Dl}?+B$;BzLTnVe4uY&LREPcz*}N8Uz8m0Ll~- zBQWTgtnT+HpSLUTnCN_>j!I}0m$<$Hg$~be%c)bRz(;$=&&DQY?54CQVM6t;3|*(B zyV!M7aU875ZTSWRff9jlGXp&Ht=1Elhcjy9|uTmRLt>m0S54uzVM9k>;4 z{8N@UM%SbA&m%}tEBr;m&Ze6XA-=hA&f=%TSu+#eAVQsKg*^Q2izKD-M90Yc(5Cgq z6(HUS7?K60iI!QYDKbM4yR6n_0?O9>G_^Z)6E~6qY!vi(Al2)~8 z{ULAF_t)7A;AnK;wz}2?WBt&T{ux}^i(x5nnZ{$W$Ho4{{|V<$S&>EA+e=j@hGO1V zLF_ZUWGa9QJqMPFr`9&3&k!S|Aiy3Jq3sgp<;A31td)$>@Z zdMd5%NZHoVFUUNH;830P{}z`s__=x?yfe`t9r+}eQ36xn|PzwZ?ejLoOs*E(PtV;g}oDh zZWQ!>B3JQwPyO)%A-~Ux6<=z%ToT-14f~R=S!7jemW*Hj5#WI%>qL7 zvYd?aFAIQ4PEmMxZPj;(82gZBl&%rN*Y-7Qqmchg%AuwY<|XdEJ-N#DNI(sYGsea6 zMhEeEa&UH*PQFE2Gz_$ECxObPTCOQ)MUDrf%lY?dD2h6gPg5_~-{hHHS3`KO{ zHy`9o`WSVL^igV!xP|U;qus%)uuen1NU)`_-pkNVw=OO*H_tr(4i`Fnj7_%5=Ym@! z^aRAGh=~4h>HPg`kQSgrFSe<7K~0V{xj@fGd_g1{xFrz(thmxoGAJKMh`MS+U5GWb z*4I$5oYbkRbV(XeuGTp~(P{D=9K&~j$pMw@0mT%e7r1C4>OS_GUM1eZz^g=8my{^Z zOYU((V4wN9RPuSNiqSm1;AzggnK#-`#9PsU0|#Zxw-Lu^CSnPA8D27fAaR5A4mq3Q zcKqG4LB0woXwQ|oPp;|3iQ{k%*`#^zNL)Tp#6t>6SHU!^Xm3`)spPREE7fUbV)B@cnM)Xkv>OeOkT@&<-M)ae6&v0Z!Gr^}4Jcy!$S+KLP1?P*i~22_u($P| znDH8(G+awYww>)buB@YIpQo`jt=+_*E=4=`GnM$UHmtG4c5vC;9~be&y0=ekvPv zB889e={yu|S*KiIX4D9lgodX0+p(`hpD#Z;t((Kt=s#;(PB%GdKKmesqZ%BWH&jUs9jb;;CwcdW zge^1|Q;G~*D8A0KqpXXyvF(^?`dy12Wv6Qtho&S-y7Dc4SJ2y=uY7%$D)Vxlb|bfL z5BdV;Yrj@U_p`7+^pTDyAI9gOT~?9MQFs$XSV3&@BL91D{MClD3+6_xPN&IK{4mY* zLr?-#(z+sH%`>_`-Ns0i23R`E|7sSKMe1ae-BXpEI;09BVzi8O ztWc!$%OmShtON`M_C{7vJUmeJGNyLsE*1pr9IOQYyP)XBENxs&oe1c~Yz$pYMNEzD zO-!Nq_@JC!oJ7=0*Eim*JRc6VDuSq~R7s;wO81o(JFTLm)>I^P zY1T&6A4lPqTEAO;;w|K2S0QJjLKB(~kQ~_yT7)|iTFBVrkB+&P6J-s{D@0ON$}I)r zQNJh9*HR?w7XoADzbpconaqq8MT{k1Ny&+$(D?bKbP>7bJ{rOMu})%f)qC5L@9ypg zf3`RLT;`4{h^ppq!1#fFA@}cEODrVVTCHvGN*RyrGgSDc3?rf)QL~vG@gS2^R$#>% z{U}6buj>JAE#qt{B|qO=@3s+MiF@4(bozDA%^^ZS=KOf`0&kOl31vnk(p&8W98a{+ zky$*q4rkn(GyT?fV^KQr5&xCWhbhy6Jo<5x3s4=F4Z~%3@#&I_JJ=1NNbTb^9GN^F z35Rr6TWq22F)Jix`Nq>=q$X5@)>R||X`gr6kU(z8p@&sBx+TBZzFHW|z(RgO&Jni2 zc%*;9-@?%%oBpfm-rU?F1d{t(USPGwdeT12b8lEMA6Y9b3_Bw!Bsz8AiR6d!(Im*g<*q9xdh?Aa z3(tOr-Q^n&%*Ji?J94sg)_gK+Z!FR;I2^4lc(T0`jt}?V8(nZ6ia66Q%}6ZkPcEFMsmgTHB$Ni|L}dBcTWkvWUPj`p zDEMRKl?)|(mrgg(B|B7zpR4!`hOdyhnp!uAf?$2y^BWJ1K@9cFRmEPx`{Q1Vmw-ky zBYU+po1dYd%X3SZLVh!Z|JmH)fq!bf-+?0|9wS;eG~k<+#ffWG_XAu-+E#`#wKMrYZ2YJGKdzXW{x8572^iTq*#8GNCITh~7Eb2> z8JP(f{znopF|x67{{JNYzd>wnGgZ;v*<_8ec6`ep+4-+{IY8VgZfpP3lyZ&~bR#*3 z3KX^YJ#)looJT)ju`wp*dR%WhYg!uxB&ur02#nwvK%|0jwJ)7=1`psEQyk-*T@ z!4%`OIQ^63)6;PRl9Z<~ERHQ;oDo^=K%Rg)wQ2yQ%xm!?<^a^p%@0#N4wF24?$ zz%o4mPy*`6uJqyr&;gsr`1vqSj;0K(jvo%dxHL7m0WA2GMg-#Y=GM~GYkjQi7#P25 zpINk%5DBb|O`x9b>sr7vIMaZQF;Fo8q6uT+3s?fMP5|jZH8R&U0RL^|QoznHASR8T82{HCZrJw1nT`lY8cvbw}ldIoSr6m^9JP)NojC@CoYeN)CgzaQ93!pkrJ zJ^0%?Bfhuq3<~Lt=!|OwfnM742B-(v4xFW1_22phXRm}e3-~K}O|&|IY5Xe#4A`7p zSxNjCIWaU8ys$Vkl8Sh2GjVS52G8$GZH*oPKE8o=@8}A)3HXONIy3MsET>Mr-v{-r zixkk^f;m3``pzzi^h{s&6P|)EiSM@dNBBw@ak>pUWZv#rzpC!6)>y=6uKj&6;v{?A_(1c;{kCcuBA>sy!EI^5p|iYSUH3CZdb z<`F;mtUn*%Q(-J{VP)wh@4W9tWcb7raCdql@4>Ret+Z zm|Q)f34O_5rlkgWd}!c|zXoj;?i9^mmEGOhWYMAC%?Wri;!piL9P+^5HcdbsfEnN% z1i&j(lkvy)Wv=*}wdtGHUt<642*L?OBYnLC;HR2403RI$A0649pnqip_WbPWF77WH zL8Beubk(dDU--8{5#hIj?#KWT=>B`^d*HZF_pjx9`OiivGalCp&Z+(hC>@v{a0MkNVtSEMr}J%kyWi{SRUR=(j}b=TD;5EwT9(B$HaZ zD^v4Fdc}XkR7aLCOmBE#5$?T#=r5h+_m;RTKJbUbrTQ;d695d2F7EH%ce*g626%o1 zw146w9{Rmt!2kXS;g?<;Ah|jyD6X1i$3JF@Ul0k-jKJ9%TpEBHoV)n=opm{0}}%#BVS5SW?;-aj}n@-mPfJ9YPWocVKg{b}CkLv!ta z)CUd@0PzfxNmxx2IQ=eUv#32p^-UUOMpZ!Ri7{8%0NP@TT%g8e-U;C_^5i}dAT$zq z1E{GoLf~ns2mH4z5Uyl9i9plUHe*l8y2O4;HRw(D3-t>!jzom)nHhs6s0lR%n96Iy zGS=uB%4K@(tce?mI}8vTiOiT=yE~?_TH)S&#lrM68TSHY5uXH_#wKIV?J{So@@*G% zl40kt_9j04j*)yvMBV%}8fv(X+{h63(^JkR;y;c&A*p8NR|fz2WA}EGK?G5Z^j0^* zbazK@-f2V1GFWM;QaT@`n!YKgI;X99yXf{c_}09NRGLL@Up1~8-Jj}^PCCVZ9vE5t z23Nea0Zx1bueTmo7yzjhwXce+=h~;Uprup8zhW;y3_-`crn6g`=0M_M$*tzUmKToy z8pQDVBjzX?*%9i5=>}FVUHbTgYjikD>(-<`LvL&Ei~*mKfcz!KP(fK>Ol%u&`qoa7 zTwtT9b|uUf;@(QWGbPAuHZH(}C?o&iM2&xl;%sN6huqI7nT>NJ$U4BeFxAU>+8|Km zha%~j*6$X$!T8<2T!^vhKIlAQE!e#`3lOC`gNVOkfx5F@*AP-f0Y0GcbrVYN3;C7v z%o=;BYfUe~0?!;6zs3KQRd#3fK#4z%PpQ{LCWOO19?qDS)S4!^SEHK_a3-xv0tahm zOkzcW8`}Qu?hONxCda{q|AFOBOP?h1(p{X`i|Ow|(ZNq$Ml|Nf07V=q49+hocFEfU zdn^n@gyxe+q0Euuaf%-8=FvNSLyVbOeld@>ll{-!B{kTBBlkx=7Im}smd(&s4MCcrdUhxPNS>tSLh~dL!?Mv~5g|0Jxcpu`YF6fsjCfTz;)&%( z(Q9+=nP)bXM^x+lBfl!COL2?1jQ4S0zsXWf=(($+Xk4fQNq>V1^H|#O`J(ycxa@Fk zB+Z~=hyyJ$Jp_gAJ<;DOw@m_PKBnk|%$u6Zv$@}S47KMJwZF6t#g6A_95LC3&x}WBgKC2 z)~>*?rT#VGK?U*^kjZAbL$ql^W&~I6nR&rt%|i2L+Is;LM;=&ZiANrezvAipEo*QQ zRss;4L|bk&>ln$;4~XuSCsl69O|3TZ5W>zUyC-vpKd0qd!F73ghKJ;qVf`)w?a*n8FkKlEXK~j?>R~<VG4T&EHaM|%8~Z&WDfB$l`|!Fg55Mc?Xt)A>E@-mQ*}Pvz zEo2Mc)_aT>@_bnbEHMXK4L)dFAtd&4vHTDy>VUkc^>1BEbrfU97N4_bCPOS|U}HJ< zCk@{ieV4MX?pD9f@(IR`(QK;tI{N6##X>YN2*>$5H=8=#9;`W`rqPZ3e)d)JrnF*C zN=yW*6K{O#@~gC*6N6i@_1k-UHs($}J8rB(0HiXX;>p3!<>{$xrV&(9P6e!@GQ?hH z#Ked)NH+Y}4Y=Vf2j+$sFL$CfPM$jkwky)-)-4nlSRU&Ww`vX9-8OWcenM&f@|jSh zP?yA7#o9YjxoM(H(L;Fy_msX&DVUl#k%MRp^mvZp53V)kTq9aIyYEQxKsahA*k<}UuIP4*1G_@7fR z9a`@~d+_&JB>A(?a$3;Lr6JG8D4XXJC+UaJ`wtM4H*NZ#vLY{LohCSyk#SS#{cfm+%fENnyh7DY(^da>ivNHw%H^G4n8O2J1p77Tb9$Ck1$rfLEB9qJHANj zQ2He)vTwJUVgwRQ$e(C?8h)q-Hl)g97wS9DIti|c;s6@e62dAb700D7uE(zJLzpRy z7?zFAZRe`~xMMk1o}zh0kWjP>8$YnA3bkxo7!M&H)&az_m8j&_C4{>3on5F`Dw^W) zwyX&zbFvp*(SMl*6bthSm;{l<@6D4|9gfceS8GC@TA7o|f`RVJnggJf$@%uSNya^X zcE=)&lxJ6OR*oysC7m!vU^8|oIkYfh-CCO+_kieY`oH&b<$rWFkqMqLo-}?$7^cN+ zS`&@f&=IL!*j!{Pj;ncA3}_Z~l)H>C6*AhZ-Fl8f$9~X{k_OyKw+SA4TCn#b-#}aW z>KJ(I{=$@DBexJ2BpH;-3+4coR@C@tdo1npHO27jX?;6QN;&!eX4v|^7g>8BYrwEC zBKp`Yp=2AkUu4dL0vWv(7de{z<0_~w4HFgE0)Yy3t4M-p)gQ>W1sRZInIU24Y1P{t zzII3x0JCx-q(jempmKU~-%p@L`J7C8=3)--;|*c0Mm!HD)|Dzcmcf5fL>KM9F4H8a z`0)pAj){z|D0a2ew)<}xt-f7Ohn_r79OqXd+!MeQ-bi?5jw627fT)GcRWJ#n^OrKa zj>&7Sz(H^%!1KVPV!hs@NIx4v-O(6(Jw8cS!~*83X88q%<_gDCM`w52zM_;9i2&aG zCu&bwY-pUSBP^@r&n$~u*GU}wjb@xkGd_^EZxvf+`0E;w-X)G7(tAVachw*|MepZx zWu~5w?2iS@YYxUX-F8JF3cgLiRr5|%8X|j5wR?r9GNjQdU#yD#r3=K59xE+o^Mvu# z*_(W;6-(*`90S*&qTYLPsfrRi_L_!fad-Avl0gY2zU!%2b9(19;EB%$k#F6%dHmH% z_!NaY-eeK+sK}D*X}dw$S}mWhy#98yV5@!{QD)mCy@Z?bS$I#UmR&*nTBje%r|3FZK_E{p{%;% zf8+PEjA>>P^>hc8!!XgRzEI5r6=r&M)EgIK!F&6NbMH{ovOwaT{ECci*K~f=Ci;4EZtf;#^I`A^YA!r;?B?gm)E)I0Fvh zyEc<+2w)|P45rK$ov#u^%$d6&Dq4D%f)5P=@v=RP)%oi^`V(Srou^K`xIp9~%aMHM zTntU@Zz=Q3={nrMC^S;g6?zZaYd15UPt~T0ab}Z-NI_*?55SkjRi`5-E$PD`I#zBX zV(H7=yCM4;?>$ufM4@zyD9fP#U9a*~8c3N|Z~%c@m?~lQ&0i|x;G(dkKqq?UoSXk zV6OK7{e|*Ps^D3Zv%UQToX5A9FVa?BJF>n-zaujVX!bHSxlJklVv zh|X@d{C=6X7#WU3Yqn$ZCuKVjOwjV%3KnQ9v;k_%s5q4z6COjJqMG~(COdZJ=}s0x zB58Oo$9E>FB!`EGsc5U+$~k>sS)?Ou^1Q{t5nr8rm3WpJU$gTsNYr&{b zS8PBv0Alr&gcjaRvGDn}wR)>yHaJktkeC^Z#FlIlMuo;;>Q4U77cBPp;qopy5zdea z*gUPl-5V#NGEq}s@(V$J+0Fj~{H-f6ROB9ejgNjt5!t;|%yZC*7Pnm9CRsw%Bd zTMPTPptZLSRJzcI>hh_J$mm?|-oKorXW*Sup-VK=gJ)3+Sel7ft=tDrM`5z7hMjCd zASz*HEVNfKBA#6b3nA=Tulle$=c#oqL+6={?;^aKsi4R~NhY1s$-eyJ$EDSV@*eED zLIF*(x#zJO%l@|0t%$>JplC&DIecBTaybpK6>jM>aPG*It;Ay}A5~K9&Y&^KBTe4hYYUdBK)3A&evNI{if2L=?XRn=sW( z&TJ1-$l0s={88LJ12gCLh{@!z(G2SqAi||{{&mAK`8{O5N>bdM=eYFg3D7Cw3qnty z+wJXj?8jgpE+O%bPThg7JRbN_wgvGCDxz!%ZCkocsr%r2bQk7ue@A;@J6Y>DNhnqn zceq!_YDsCp*|bc)Icf1v+K zeF!oBp^d22CFv#NiNH|u^D|vyPUT zlYO=&>C7fM3?9RoALHkXH4;1HJ|dr+>c&pOM2-Q^Q3`x3k3#uTMCK4opI3y+Sn%mI zy^UuiIBLtGJvL{@?biy%{&x<0#>vH)2Fh!qD>|eHv})|zdc~aB6gkCS2RNsPgcr!csc>u&`yM^zfM@14%X8;+ z`L6x)sGOdDr*@p;b)ei%y_M|Z)`g&NZ>bm0UoE9UM3h_fkc476H`NW}FccNurDr}T z!o+i|Q&5%(ic*G(fjrZo;dN)A;QvMr2t+bKNH@?!o?s9#eh(v9uCUhI!kIK8y| z6Bj9=VZEC{3UZOjSfB*Drwi1o{eQ@!@JLk`4;_B z-z>|z2PG_rQcA-;jpP<|D7U03CFkQ9*a*Wv!Eza1HA3~HzI<`mvX z_9tIEhr$_|MLX|xqAe7PP zW!PzcD`oR}=C6 z2%Rr%HxA@zkErUALYYji*K1@QzLEqMM90BTB@hROFe9%q35(Pm$}{YxVl?9&XG0i3 z>yI}wz3c;NNo*o7qn+I-m%P@(XNH`9qt?AScP9^Ie!6WgcPv9{Hx43!iRqcNd(kb1 z7V%7=5tRNuq-vmup!ekLXs2J1K^2%4&k%=g6pqXHR-!P?5>@5fHfkN27-tAAWsOn| z?nWrYf~-x_F#W+?q;Dzi(stW)xB$jluaa4Mhbi=hUV zxNZN=)}O!TF$j@bFmCXc!V= zAEnU7{1$ia1F8JxJa-$=xZt{FJCzwylCj`Lk~?(ZTxo9)g>CBAb{6Z!K?h?`g+~XB zjJwP>aq=k5r?rM2+J}^LvKjV#Icq8Ac^LsgaeT7mZntj{S>t@$Gn8h%S_MVWDEnHx zigiNr8_yWW+Z}!4)n)#L5|STI8`2T=9=-iKQHi3$Nnt9Q>8|7O&qk}hr%X{#A}>Upk+cxoI*^hqX*^eIY~?T3tEuJ6FuII5NUx+A?yX?&31C zqIJi=>Vgf}&~nryl|hZG+uk~(xt=nv*hS)okUUmdaPinljV67f z5^HG^YAXkF6D&B2;T9JL1iVLs=?5v-Wz?a{Gug8GjNIM+9A5}PRA12!Vg|) z;9Wo~t~=AfNS5p)WmKB=2Vz3#%G(u8Wa& zd;M#PxLZoW?cqSdN(Yiqz@%8vo|T3bRc(M$3lu3MoSwuMkrf6dq0~eD?c%wl86&p{ z%4K*nUJvJFDciJ?Wh}Q{Xum;6Z@qVD+Wv{*V^ghm(86MK1P%_eYZo9Nf={$K<(+RifPRbW5-Ds*cP7cb5usJqUVK<^d(gZ{6TjY&hqzh(T^e|KLsfRmb8q7>rQ!E<;*mEOCW+hK~!@eERn6qW{E z20hBG+`#H8gF#<7yM-EdC{I%Zi^coCl=BMC#qpv&H0+yEW3d-nMaA!nEtrbPYiRbi zRe5D!tqfJgdX?8Fdw!cUQehQOBw2j;9rW;mp|{L0JAS%d3Ww8-TG8!B9MUonTI6qX z!m9QpMnpCyf}1218HGJ29pkT?eM-BHxNMp~%9jyx$bXjyaS99OLQc7+ka2KgGU_xm zdk@d)1U#^;CG$#rLBhKpSk)b$ylw|QD0c`uY%*fki#OLUNJ=+$?GILs#ukMga*Az` z3UnS{4uyCqL+CJH5M)esi*&h%)Z2|bgN@eZVzY(-s3_)1Z< zXOWfnW4Br;C7uCRHUu6rz!~ojV`s-x%Uc!hID}E-S>ZbP>7=B!XrY$_tb)Z@{2q|s zsrYB(iPaRCaF|8q@e$crk1S_TSzI*yNm;_ol2=xAt5jSRJj$;H%0IN=nrEn$W+|?f z^kUZR?RT8!>3J6k`sz~I1Cy3I@_ut$f?UHK@b|Kj=+^K%dPc&}hz(%!_fcU@D@t}A zqr(aKDL{65Cv4M>^->5cAJMTHS}qivaSL*Y_-4c;M4>L@42oYlJ%%0So)AzKl8}Z} zic8~|+k{WHY^gadLo`rjh5lm~upV26wRb>y8&}83W&C|$#Y+uV#{6a~j%4CnVmvfc z__?)JY7QtHmje*07Lw9KB@Hna2KqBX8)Ef~ z!5HyfvC07mg3-&`zpXp*RAIDK@AHFTRZ_6&ve;2OEEYf8Q@^=0+c4=}i$eU2K)AMO zBPr8%jT{E26h79YC`i_OG<-d9Y`Z>qrajp2i4a$LdqTHa`zC$-FCqRvY1 z?$6ZK4md@+L~S`KG?)zfA-**C8gOHpP4U0@8|bUYoQL!ZW0}HmoKCUxq&i~ti-1ld z&`VWtp4M!{%`DxW^0k?O^TijNpIs0!cJ217dU|#B`X(>^Gq$^z-v-TVMjfPDIQ8JU zDnR6ODu+?1DO!{wqGA4KgH8YdBN?A_0h4p;i~I7sgeP#Of^1zIHhr2? za^lrV>7DLWQsC}VS963CoJ3h);}Zya_?~pu)D5jZ5<9rbaqn@vV*GMn?l$!MLnx=vB8H& zhnjI&>#fF&Y4!mG+ctp0EzE4Z%&}!z#N`94%<_6Fw232ES5nSORt+dQJo4VPav+i3 zunHBaaSFL!^ee_3$&nd4g+Mra{8jcovI13+4F{%I4V1&k*r6y%h;u?F*q!gEQX~pQ zJAfQSF0yO!LjTd~UA1S=&KJ%Ao)Pg=-u~VT)RKc1BcV&31s&Sjr)jq`Xd(NFRc{wI z9dvJd*E-=-lrXRF`xy7|WNo9Y+bVs*xjtCFD2t{=wWk5kzG-Qx))hV!>YK)#P{^T+oEK$UGW#$y5?`-Lz_EZ^LyQaE9u*ZE0Ec<5*GD+T;ffhp<0G4?;r8P}P0cvfZ3jQCb_ug5Mz?!%-osy$J8r ztC*x<=UtsjiAzY}P&nt1;OjWtK1P5(&j#eA*HHF=~9j6!=|S zjiUH!3U!2bjifQ-hS{55=-ZX)=?wL~5|>9P+`rh1 zi!DBHw;`Pa?1(klcEF_~Ji4SiY`u|&6EKW5{60NS1Wvxar$0FD$!2QIpo0Az5 z1Rq&FjzQ$)=cD-sgRyjeGIE1&b5mj01nF-}u$Nc}>LblFh|9$`8<&~13Hdq48#-XI zrG@KDnwnS!78zB5qo}UeSt&EW2c2E&pnsyKN;6#``xGYmj%-Odwwh%LE%&24p+Ls@ z6=KyeNo9)ooya{+z+5-w-3ENYY)W|@2>3;}CaNE&@3akeLc#Ik zX-JJA6AoHSK3$ZAZ>23&c_xn;7nLn0xvj$)^^3n3d>)zuf^p%=!J3LfV1Lrop7NoV zC8N78_(gq&p_N>6kj;{b$Lh)er~H*73pY<&${o&yxGuY9S8HTKeQkvcb1HF}hy}sa zU#YcZgglW^jIVS3FDw+Oo2#CL!3XI%Bv`M=*{lHXsoTCv0Qbgsmu|aBPtrMv(YKe( zE9UBc2>8(mKV?fpF@gaG}Aa48j4z|>9DO{2Yl8&jm}R<4YlJq8k^gqhm7 z7X>C^HAj4In>7EdPv&zAnt;uCc;(dEMd|^v0jt1v-5Wv@*X`*2sW!v&g}Qgxn!KEH zWYoBj!NjI!g7EzgiWyzSP&gi8ZA_T9zFI4RR)3}a2+I^C9k@`|HN-h#nTA8;@J7Tw zaoqe38_ordXdTce^{tUq%tsyF`zSJ)EcfGfC)oT6aV;=`KVB`DOo^8}DM(!8<`pWFiR02BBl%;oMwRFW&22FL! zTCMJETy(A2^{SpoRV z|4jF<{hIM2H{1$79p`B<1-*mB|C)b~vRCKEO&i`7EW{+V7X~zMD=s2dg>bBC&v!{j zuXi=6IcR$k0$HzA+^0j}jD7tS_hs{dF?5FGwyC9c$c$iVzSt$97q^<66ZBjlgY~K*H^`h_4I3W!$f(aQB#zbZ z2K>!gWpvYI)3OXP*C!zmO4AklDspXI+NA0pN02Rke&@3KH^06%%F<+NC9Pi@aq@}? z->RD$ym#a@GK3r=_L~uNdNvgdtU^jN!Np}j{_-S!)Q25+CzZUYp{6C{E;~3h0D8SZ zDb0kTu`PnZct4`Al}Ua32oz5Ad|kzTrZ7rHOG%u-1X|X0hkX9%>^dfo4Fp(n`XCC7cH+y$VV0wxhexQm(4#kt(<5! z`w_@Fv|Rm~4A7fyb0>Rd&}d*kN9A@CM_19FR0JZ__~RbusX*87 zajl$5F5f#88v`S{kensv=6fX-mu5-5ZAt$4now5p!yR(IQub+^0bB2&Rf7Rv)V#d@ z=xQN&F~%De82jWFm%qf@K<`-AXcJo@-VwqeNEc14%)q)x1})%8CI9=@<(r}7DCQ(8 zmqGeaK2BWX9hOi1FceOgPC3k3j2-ttTI`M&tOhp9J2`1Lg0Udr&5~%l$tumD7{6?l z93v$N-pLLX{hhyJ7@Y+;MJ07MyjBQ;;S%5J#8@N_P8v9Un1vu>!QNggO8Cs<`@J2p zk>;vDQN-;BHnrJt9Zj%>;Kh;ro-$XR$UL^$k)G_sK!&%_S|ole261D&8}p($Ah)!R zt&>y&uMf`!w-@&YFpbrKNkchXpK1{e>#6qw)XW(|@k#K{=+3wBSp2Z}17sU1{S zpSi)cW((0Qd!}REo9MJiWwF; zbe8Bj9b}sXQn`tbTGP*|96BhNdj+j29Zs0u$}Jc^)TPNw&Rlr7*-p=w;zKbueb>}Q zPi9j^uy(=S7jZnVTmJXzn@xYV{y@>wvt8-0%fv1Oh)3}FqlJ`wkkO*{^Al;Y$|@zF z(55URxH<&oc+E5ntirQKqpw6*?krXOl8iMzb;$_#?K4b2tU^GfsvVkB&)!4I`UQis zB^0K*C0eFEj^)XfiYRvL-CSIqkVh3S;I;k7;B#TIKwRfn3x|pcX@Wwl4vAroXI<+O zk!ng-sM*iok@t-j5=Z=dgCgXT?k<@F@odc;a!ded7$85n;R7`g;GR#Wql5xhP2Xk@5-oJ(ZNTg5)14_;9%NXH4NvyEba$v<2WG3pZy zXOqVpDR#IZ{~_uQva5?iV@?>MO&nEjq7m*CpQ{c*4xUPyy<JB3!N>e$ zbFVQ4a|Dqj^tH>bg^0*Tlgus(|7?B@~OHs4fzyyV2d zo$8HClB5|wx<0cjv;FHccn&~usm8ehGr79~!76zZSr>Ybi*(@9it`~-IT2R~q5Dfn z3S#*4#k#GT<+AxY=T>xW9-}6bZ+kMGSgyhLqW##Qv8FJnGLeK~Z78d!)(t~+bwcmiqAy#8%lD#|6%%9e9)0SO+zq26xl=21?!Dz9s*rgTbQCRj%N=!YZHU6-QdVNQ zT%}N2H0u>dBOq81@Qr^3K=)MUR4Up%uaPQHy^I08mi9YfjDCu%+kLAnSiezgeT+*n z)_ZCl&EVqZ;+Va3xFMLR#M~5)s?-hYa+$YjhPe(IXXqIlTsy|H#8rCn1CZ{MW1pd{ zzwl(vc{SXzULh{4!!PB%T8&K0U((%pbZbkYn|If@d=J_9f_XgAPkbjSvXvhf~LOdJvd^5d^ zv<0)+f0yIvY*rOZk$_S1cG4HENjP}Ol45k{7=Ygo5))fz0FCf?9&;YQ0^82-M=s8)G=bRF`okT zaOm;EN~zv&4riUy`zCuTgd>IYcowWqvq)F!oYmp(r{s~7bT(T-9tH=zb^M`jSY^aK4wNxWA$aEkJO~ zHUe|pc@@G%uH~A&P{KKr+a=4D!K>fMPZSO6gFnptbVtEmAG>H19W91}2N_&F_TEdG z5Tg#Iv?81RvDu%t` z+=Q3SSkR)iV<1KbUlzJJJ_{N+(qibOjL9mRt4l)PnN{XCWk8qOeOj%jgR-tE(tv_X zBL@^>A_s5Yerm>sR^P^1sADZ-L)j!u!g`)QdfAVTuOgJU%F`MQw_7d&>uo`{u0i7T z#J@IZA*;0$(a*rS7-U9{x`GDjd3-P^8coFALRz^Snud_Nwaj_$q&F}`nIU(cK=ey= zv8YdKI-Ym85IWR@_!qZGMA%(t5$q6iu45UnzE?bf?OdjQJl{FQ9gy;U5zA)GMSw1q zMp|x7fw?(Psz~QF`OoEjR!xVBhAT&@lwu7+#gWp&&-V$vnm714|Ao@^pQ+ks)CBnW zsS-@-%?;ho7n-T*55h9<#2}6cc^>!rQ3?*lo8c%x+rv=F2IR-+ze5`ldyes)6zaL2Yx9mZ=G;_r!tMRAwub490@ z)w%tL-FcU{xZ5=QaOho)s`de;q^6oLR<38zMs<6Hmg}C5idB$ip=S8Sv#%L>p5hZu z(N?kiKf0DV36h7~?3ma}bnPo9;%-Uw3st7E+c+!=Qo1 zT8%a5d%4^eke4BAY*11FXh+YLbW>e_%~$_7o0oMI7^4u4+B}}ZGxflfJ937n$l7y? zya(UaxfE9@PqS*m<}Radhb9RX&uvsIbQVkl69G* z73*sk1N^rX)lj>3XmU)BZUVKvUh=6R1Sl;@SAKDRh#%LhcD9E%WJ2rtIbPR-bJ7pX z#L~jp%{Az`M+ni4q!!Ph4^@0K*H7aBp2a#V(BR(nwZV8D%g0o@n_Ewi`6i`3`JZuC z4DZu?N&4bqu66UvQne%DOcd*TW#LuB>DzXZz{i!A1UrC3GN?1n>WZxYo3j&!EPb1C9wvvF&X#Gkq539D%q zQ@Lhli@?HxPN2<`i!B}K#VPc}T$i7C#nN`_&|OE8t^@WYmWQ0X?X5UFT$kKYuOfuP z4p(*Xbc-+(Y9-nimxlPgm=otY=E&N^q|X_sb`L?qG|`%X+qP}nwr#un zYumPM+qP}nwrz9%&7GLV-OQ#Ua+iyW%s8js^V9+Ql*1NBa+a9%_`Z2X3n`{Lnrc8) z_o=uLl4%Mb-k>e;%4jMA4Kd%|z*;nibiqDH(n;mhS&97jSvEsBZn9BNtLOGaZfu_T zxK>3dFLo(mIasLQ5`|KaGA#Z&nu9-$>BJY2Q0&?F=brxf8uu&b=@%clY&bfR(kLxw zi3)BVccL(F7N!$E!|Xp8KJlBi^pN}|O+>+bZYHbeu8!P~ zMf;MRCV1UnfzO}9V9?yB*zYFCy7;zD;ZN@>AZs#x4^g5PB(ong5cX;STXZcjN>9WR z%i2;R(ZlO>=_rLWNmpKPN|El1$Ma>Fmg#;S->g(iYFbbhe#lCDAkUL6vATyb{}{+T zil!bt^p5E*3aP(mAj~U+66nmJiLC323xPL=#4(YH@6m7FnhHj?Q=p6jR9Y>8s;Zl_ z*cptgCwgPJo;ix4doMItXGNNLPC{PVob7dzCh{#jbBJ!2UV5#?>)pMK(>ZI9GNO9w zmzb0UsclE1(T3%v_Ue`s-H)Hb`6y;TA`F9NMMlavTp>&!OxDXRaXG z>%8mG5sdc{S6ZEFE&Clr?pl*yQ5QIpd7>h!Ol1r`36NEPNw&Z0S`-Tl*m_KVY1>Z< z)wczes=@+B`yrPJKF!d^bp~~y*=c3@rTGg$H^?`U+e`GItIPwZtAmbYt`!`6bDWFf zx1E`lFn!OF6w`>faeaa~{7rntFmh2@v55-U_cOq*|B(jw0?%6vmd_04fP~(_PAD0 zY?pfCjSb@PbkVQK)%#&GV>x8BK_Vy~$ z)5|x9#(hDWD+EX;FC9ZrYl$(WiMB7-5eHsr6C8uXV@GfIEWj`WG`*P>7LK zIR*KvY7o-ka#LM;&PjSb8@DZRX-b-k*ZZ71q>R{OpTB~73B`&9EZ3inKrg7Yf2&Vt zJT&yp`r-6E1w9;FggwlnqpyW&JFJH}`cuP%264ti=@0?iMjv3x$79kz)|~d9aVpo+ zJoaX%5528b6+}~6fO*R@oCaZp!-E|sv>WJX2Xi|YQ$aBRs27WNEy55>&{b+ zPa!9(P@nA0@rY!0%gD*nR1(0RXLtG$-h>5d-3zCMA^?ng(JC^{7sz6j$vNc-`XDaf z|R@Y^IX3=||I-IiZ{-6z!bFVK$0h^W8-Vt}Nu)VM@G(9kMa>4N-DqXmn zkXRf{+%E}GlNd=Af*9HB9L8^gics}x$1?rjgy<*m$t-&YBm67^xvR0E&Bnu(VmCE( zf|U6$V~RI%TVSjD`yZ}_B<3Ea-G`fZD@={V?Ma%cCmh?cPS%pV3d2v|Uhec&L(l#z z|Fw7L2Q98sLqZ1YkSlQo_M3Ip;Y~uIvLfqu1RucKlzoq3a47$b^v~pow^w3KF0ym% zQS{w;Gs#I-_A!G)T7mnqd@nHTAXP-IEeQVXRHpwWpOFsa+{61cO9t%!BF+e@Dc{rJ z>6`|7!c8-__bg{{TnU>BZlnY7uen2Rk46D;6`4S92_)|-2@4rTw1xgaN9}|Pp#U`J zsGrov?cC_v+g@Gzlp_n)aJP!quYGt<5%Yhl8)AyuKpsHl5x^kaI+*sv@G;=18w%p@ zO9viZ1K??%*Aa31<{R%#V;Y^>xpGp7x>k+eqP$MliCu*$NOtZaQg70DQ^9xeinr7+ z-I7v^OAf*>SutvGk);$AGCfNZT0wbRMa14-($z9Qeo3RFT+@Jg}k8oa?*V`AS68_!b9_(YtRub5~9#FF|=tgTdva9M42a$ zY(m(heRrgH=wFFZ>f~o1Zy%P1%T*WZD`94yC6q+o$7!k|IWbqo#7*)P6)XkGVh#ecJWb$(=-cH|NJWBmgb(56j6LJnS97WVBH{UT`Z4ob_opvtFoQHZBpo8Xt8h1et_Iu`}WnJLZ-|atIxqq zllEQc3KYMV44$ zFFvI?bwmMq$(7P9BNN2$3O_1qVCB33BxGKi1eQrLq{Id(pTVGh+0a7P&i+ft2VgG( zQa0WyHw)hCO%aqCGCjMujAl@Sv}Eai0Y583x8UBIqvWOvqYj)0|BC4N#OZWQD<$41 zk6=n~|M~PFug}0=xp}O_!_+64Z&vgAu3}`f-yIes5%0#}|Vvyx>+O@uw{_8K^sL zGveO3lw%kt?`r74>s58oLrI~13IB&Wo5)Tgli-OgTa;mi4W#$bqYVx-q(y3>dn@9fi`FI+cb+4B4REpD5g6lx_ZG6jaLMC8};@8w3;V= z%laNshTxmol2h(A@qwTucPsWFPp7-(jp>ucJgZ#!m1CBbh!2ML==f+0y%nQxS8*0V zybERfU|psws|)3%(K}Ic6nuE-r4KM;<*_Du^OUu~#Fo+gpw#K<7zR{VFM6xiR2}v*A&Y*_6?vUTb2YORa|x-UsKE97BH#- z+dC(R21B>qFtcJ%AeRJapRp+R?;kf<*YPiA4J!&WdSW~t^C!D3)mc<%6sDsnIO>}o z^c>^S)CC0*snzCr@yJxmF>g840EU*sh$x&t-r~7QndRWvtjai~= z>KjElxXHcRceue;bUuM&baJIIgjwZal|X5S>?n2pwct(H@J|XqdyC~DB#ZOh1uPssYW-1% zZ!I%=omi^GFUH1roM_q#>f1bW$GDT{IRHok`-nvMW`Sq@!BcQ1&O7t2L~9AYz_r*q zZ@|Jf8Y1vkYgA#V4NhoH0u5DMZzFvy3kE|Y8?ceSC02=_5hHUA&onDbW98wTXk zToXh(F>G@_lLLdd+Brhx6t|WrCmno|kzH9q8WBGtW6;0}M-#m%G;?!{(@kT9*ysVa z?J%=80n$7DI&mm+di58HHEuf{?TI*!844T_aFij6cE*Q4i1D|vSsJ!n3NoTkYu9@b z#RCc|5~lOGXPtZ@?*0J+aK6+XsdGCZB7^T=#Jpt_8T{6@F5Tu`oC+{ua{>o#zh9Mh zQHgoSzs$RRqrQ*m(F_SNqLx!Ye(v;7*&#FzmDCvp)=1`OCZXMvTfEUQHs z&I4iE-VoC@o=WY(aD7-tkaMTMx!a=;reF=6HHgy0`8STMk-7#>I&xP$$4!#s0f-^U z3#e$DDx2?=k!+rdXc1-sZ?&3#0=OHc1SG$Xw-|nUg)hh5*_9&K=Rcv7HsX)b1hXyQ z9=Dx~ANLXXs>}f+M@r3W!vkg!;Cn#DloAh5U9@eECq7sEFvlOM;ydiV?>)a1?&_JD zFK;~v?nCzeLRGc!M%pGTjGO~T` za3Ds!pyvr8B)pF$Yh-|d42^GBYr6 zMgiEp3Nif^dXkt%MA#|6tX{8<4!5>9NsWpPSn(!=E-fv&f|jc)XP{n5-gc;rPqWw5 zaU`Y$jUOV@;IBvADHJ<((-liy{(im)8=Sk~T2=7;gaVW*u~Wd0J>riN!cdj5cphQ| zNZMK6+$CCG#$P#2f~2_wu=??ex9IQRoF?3l|M%@n{7xLlzPTGT7KpZKsH6<_nv2?4 z>lIkQAp96N3Ea)M#+N75ZyHMEy3823HT2JULzO+|n9c%mQdm9^NsDgUX%c8Vr13bP zbC!n+V;v%KQQ(ZFGI+t<1^8c%lFPUE;Y|D^o`JA9nA=0526+A13W23$s6R0y-9!_X{l15s&o~>&PcWr5Q z0k0@kU+98hoM-d_C*c%$lbt038`*eAF??s+>gN^%jHcqW$$vDbC@xEaSwx+Itp}jt za*t>5SJ@5k>;*kNtuav4THU>St6Q7-nvvzYSOR;>p$B-V;shGv7wW_RRrWcskT#)I z4KEZm{Q?N>2KUf|y|&d3QznyTzf!E+j^{iC?sk4gk0Ia3dH@m{ z>3{}X8%V2huomsw1>v)+10W)jdNR$494@CF7pYPf1oM0?P^ zl@#aPI?3p|U%7~da}%5jkt6&pHd+m!D&w2R9CO*c^e05S&)e^K|b;}y}+x+4OPh*}Rpy?^c2`ZrWso>8|zuO^2#TE79 zOk5Dvp34`6zI6#Sp;Oi?U3&rbr}gLckqr!}jXt36U;#tj6abL^+`D+~M0-mTg5OOq zRlexyjqK`0hP*bst+>AC%n-tk!19chs-&v-F-U);@B?YVi z8jh~LlHK(s=>sYXjf6L+gGb%MZY>A!FoDuXn5N**-Qg3UA+goWmHb#fz}nf?E#ZS^ z;`dEi)(R9HPw_ZSd7bGFnrbfn8_s_?MecDbkHD*jgV{kq|7D}P^R^o^@RhndC!HsZ zI9i!5dYsW}Z+xMGUoSFnd78q2otYCbH?x}{t?QfAFdQ>%mp(7T`oXOjHgZFvAN7uO z{sF2t`HM7Ego_Bj$_cPnV>VkND2U?C)+=wvHvFeoImTcrqEs}gu~cUyD_g)&iZ0Ob z(K7HB%{Wvj*jM}t*Gw*2%78gQ`qJ6w>9F6!W4v6nTGSX|6up)xQfO_zKme5Gv4^Ll zH|ivjnCsRt=5_jflGf&zqmXH7{Ia^1fZt%x0w@a7;{b+!SzI>K3u3;znr2W4zXW2wSK7 zUZu94#%E)^inpkInO-Fy0BE{<;@SZydHLybie$=G7_C)0t6x`wn_$rqXTu;~w4M^k{X2t7N>M zmO(j!=J7bAE*4PvuTc3JghPasOJ+`))kQ)J6ti>>QbfMErBM%SsAGnjYZ;x6s)f4b1RK-%osbNlamk+vD2LxFob8keCx#DK%mbmoE-^TNI!$hW(i-;$r}!60Sf5#ZBxv%9C+tXlw5_U0+PX`kbFVA5?w;_DY z71?i;NV~HYJ~E-j6z0k$jl>G9S|w#Jc#yepcsNz%|AM2k{4Y2v6AK&5{}8DE;i#NU ztgQdD{C{y&W`_R{9M#iAC2Omd?N+J!KNQuCQfqsMRKT9R9o-EI#f?I(SsE_w2e6&SOESTx}2*cKy%vP7J+qqVR9647{DRH+40r2rJ)h*FLq#L z@R57b&JI$|pJ#r1c4BE}77W0-8DPALssSiDhlfxI8=xnNxF5zCwz&<2LlcNauoHk5 zmeK{*Ke{tAwGs@NnxsCotEnmJ>d!u5d44(35Ud+jeKi@#zc~YtNee5pZ;w?l7U5TI z8o)$0{MQXH{J_qQlv36~mRFS$u654}z#f1{pw^bHZ|N7ET?>M_AKwx|U0R(ytPcs` zyk@|_OpuLDUS3{}+Kru|Dya&!s_94no2ABP8z5J%))ioU3Dp?>vx?z17A@W}utv}i z{Cx-2zc7Y#d;#`(OLFtVt^~L^8G$?;=iAOGUdZ2Xtod&XXCDmcM;yy4e@nkNva&J~ z06ag27~>einI#M(Fu=f#-Ty+LX&in$@YXJnz`wYw>goQAZTmwTqKT$)2r&D8wv z3jX=~E*21XXUF?jG3!sKwIiu0D+YW0<}RGTza)nDoZYK!kGIL;Te_5_mWG_Rq+p^D z0+UA^jg(x%1xIs0Z@;hFZw%S5iUMNa>E~{s)yetM`T2(*C6z4}Ew#_e?%;O1^u+w=5H8W{i~ke>aS}fR6#~Wp++PQP zfZoZfbr13{Gi}c(cF#zH@Y^RRCmS$Tz>7V;FNPH2QQ*M^*wY`tK;13A?~Wh!FG7Li z9RQ8()L^_^xD|R*f8hj&HU|L6FN#p+x8yUAF#pf0ib>$LA|$D(hrJ)jIAXB$8YDCt zc<1IOB>;}#^G|i@j}3%teckJ??7}bO+~2pUMBUZV1z__pZFW?4a}?UR zz!I3P?YFDhZ|bC9VEj9ZTWceTz~)1`_{B?lL)aAE@pt{3uNQy@XS@5aBq+M1BZqi= z2DbmF4I4r!{Lcuc;@7ShfQ=Q`hVr9L=FiLew=v0)#jUa4q4~d2*?S=W+FAiS6xVT@ z{}uq=oCHBOfRCS68UWVSv2q6C(ES^@*YNeuj-nqs3VL_|S?_iWd*SH)WDnpEfa@fG z1ik=UTl^9L{bVn8!((s;Nxwnu0M?Fv2qdM4_#hD#&)|ZE<=@y3!iD9(g4qCUFZ>Wl z%13(=07#(k;DYHHKiCfT!hxBBV8D&!zczq~8GqOh0@pJ+*Mv16X~9`vZNmcx@-lF) z5SUSZ0&oD>=lBRhvybq{!NYCMA4&Uds}nmf_^$|`YJdLm-mCvT$kxAsh1AU6;2{dE z{@@`dPVT`S1>Ao%A>fBs5Sa}A=7qz#e>5qEH!k>>5h`c>mg5a(1;#P0;Tyk|d2M6= z5`KH~{%Av4Rxyt)kDlY09E2ac+PMX1RhYm2!UqOs{=}d19DMEK`c(f&&-_&`@L7-j zRVHrcqzI+hvFrTG2^D`io&-38YIFttQYV2mvDcR_@Zk_@Z~y!e0`dOapp?gN(GbF& zAIVX0@Y^r+?^&(tkB`%j3rW-8#^|3@$zKR-f32J=0WLQ5_*jNtItAcBEt?+Lb^KE@ zFi}|d?@QK#A4&d)6YBSnw$R|jR9Z+@6mI_@CBVkdc6^B-*VNe`e($N@UaeowN&M5P z{`=bs7(hU`z$wKOv)rjrL92!S;1qCiBpKDd1&Br*m44LuWM5d;T(&tenR`~BAS4nk zoGsjBA+4)YGAawyJHoX_p)_)&*`zo~svO4ziTp@gn5`YLz(6Gn6yOekFrgXoK zy;@$M-6gkqY%z~rflixjcF{;VLZ@l2YvT}1Iw(Ib!zA$pBpFe7tl-pgbxq=4!zND? zn5YsF0ZZz5h^jd+JrAlLbS<%dxcG3Tc!BYZSQd$@lnkW{=M=3DLCfvXP0~7=kCL%J z8~Yga&u>-fsDIFyU=g>@QnW?n9_I1QOGxSZN?5A*A^nkSdqSvwMb7hZO)c3325Jh=^sfU}|T-kyiqpG5!oqq>eAaNhdGOpCp~HHqtpa&NGq3Km3` zC`|cgS^{p0Iuw0$qrGhk2^)u{h~-097CRpj2HH(S?egHw8B(xO)vt`2&x-KN5^K?p zbDtXV{~H5xza82pK{^^s$%hbxK%;8bpr2H0kz9VKYu2{%Y}z?F{1|q+ol|{2FlsEO z-Xx`i=d~2s-6|_|Cu$~hGGc;Cow*kN99IejWyN(_rq{`@bc>9-qIhq42xNMm=58+^ zPvwW-gT?NrdZESKaN|fc?iQ=*qH;>x%*^0|GPqOe>zRlK=Aa_*UVI?f-hg;sLGfivwJ;^*uLAG>!E1#L-B}*yf6NdSMjC<##Lzo5Gz}^q-7KiT-H47OIYq zF8WfRdu39M<`7JH+Ra3JRJeV-|bhZWbVkn}lU2VZE z0Pa(iK6xVq8(zr=zcE%Ej!hZKz70N5X3k^{ zJT-Gx`Rt?uvNNNP?@~go?W&LJHX#yP7%Hd8X0VAz2`*Lbh<4x4nXn++A_t(wt#d4{ zEng>g9!0D9GVB>FgTBNE5obBePyqt1g%h);ZP>LH_-j!zpZ3da-rXfEd=bj^J(QTv zpL$ghd%;&B$QTu$ail=Vt3Lf=dPhRReB?*p^u*{1vh|Kl<*QzQRBu;` zQOa-xvHf7p4k3N7wfEM7(U9B!`kn{sy;)kG{$MkLIKmn+c6hgdO?g<2CiY3tPDr>O z_R~xMo@pIjw;s=@Iv+gss@|&1d9x^NX6R6mU*p>5qm(Rkyg|GJg&->!+l_qgu5NXC zR!0-h=ZP_31+ei)NdD{m1KZeEOL^1v)iRQ!&mDgd^qA&&)vqa&wfe0%G!%c#(kHQl z3_zz0eaXhzvhaoS?K4;E`_sE`hD9#<+6F}}{?|k@##tIU73tX-it4(M)mS7jVeTpH z3F!9CZH{r6wMF%xhB4!v+tX#s030~uyCxe^@M$G2Y~V{FIz@+4zr zaVwN3?CZ{f_MV$my+IKfJCzknXSk)$RkY@7HalarplX7~rIc7(U7_FQ_>W2GAB-Qw zcEF)zgPU*J)1 z!~CBP(YNEL7tOtp>(B@8oQX%kIv}xajfvN8z)S@?aN7|CXsxc{f;;MEm|Av)LoLOH zN&dd9TK(Ygnn%@kdLZ}{489D}`el?^&$2$(A698}DGtKKWeHO@vQ?Yo5R1F7%VukWSVta9T1;&d8EM`H*$8ZY1wb3}H_eA!U&i`C(af`~e>)Bq(r8hJs^tWhVdTwLttvuI?be3-? z_|&1kEE`Mv*z9i$V@&*!U9gcJfamz2)Mg{FU7G_kv}GD#%LC5+1oiRhbNQ!!_vk^7 znEfxFD;cbK1#pm8$e3P{@vg3N8AbjX%}8qbJX+>H32v2c_=$j+JlxxQNn21W*e6P< zCjRgz*(`SgQ6w-%SKJy2^3ax-7Qk_?9DG`2E4T-Zp<8#E@oi-?2n`B}Jq_vcxJRpEH^IMf+Z48_&r>S-w~13# zz=r_G;l}E|{hixe1Kz=&e>z14?+LzMbib6-|M8h$%n^`-q@$yn8Z|pj`&hEQWZ0hh ztPdMK0E3`twRQo|*3}#6t+9pP74cJdN)6CQ0iZtg3v1cuw+uKjm+%p1$t^qI@8;xq zjQD-cb262NJnK11$Z$x?y5yx8s&J$^BtUmwldD<0H#WJ<0D!cu)vw7GVF#OA9e1ah zXQy7To#eu90Y z=L88?G`llYUkyjv2G{2(^lc-B_ZYDeFovC0Z*X zppwzpB)IAZ3)4h2n->CyS(_qL8PH>0{*pIXPPT`v^=}sE^{XU-;zA|(4i@0LH8trO za^b3uaA445JdV*03G1{Cn3Zm#l@hdbA#K+=(CVWA;C)hnHe|7o67g)woHv6~Lu;Seex5Fk=irB>u^6#HM&=2*6X;tac0aR+EV;Ff^ z0uF14K_$0O;DfblkDI|j^43YGalhH)udRTS62<){_zNC<3)2E2A!8Onud&2bv>-{D zXS`qON~?Ka|5BCQC-*KDISZ?Ka(ul-;4t-?L<=;g4J_cvgi;@Ggx-o;^xBRGtqk&T z1k4cIW8&T`NM8%$$vj2=W3o)VJHN)tFmprt&AwgLU*Z|f4|I;L2P~FW2b9ECK6&)a zCh8i%DhlRBiv}#

fQ1-U>@cvw$D#CbO1FBJFT5TDe{F?|f%%4vMyUT%d^iOk>n8 zuv0!twN~fW5(3sDo)g|KunIDwci8^O2;CS&_~^K!<_M*Uuig!-n2#3abGt~yv*4Os zlA#gTN|4C89cUQf6dBY`%hs5Ejbp?If13!>y|j~irFj5e#LITw)EDZk_n<5%Ij@Xg z9*ErWV?c4E;!aZDVKhk4$g}2sq7AX}mf%_GcH}CiG}2|tC_8ow#7|1tJvR;luzweUItG+Ej{H00CA#ft#aV`rcV`QqS;X-SjS6Pl zk-E>eno+AVU!76*69R>My2zmt&;K%_{3Wxunb(_&%>m}fs07QWh*A-`%+@hUrIFMY2)vo{DJSsO@ z&LlC*dl`ffU4iz!TbO=R2(ILh-<1lojAKd)Fa%B72I+6ii8~SK_3m-kE)_X80 zP5K6RW^_x{O_rSZ*Q%>zvK3+}Q;?`hMBWO-8dlkzae^B|;xl9t0A;oM8C4h=dnq>v zVC%@X|3|3#ZF4-4cpOIB5a$!{1}{VE3U8%vu)q|O{gigaY^H(-zbM29oZ@KHhZjw6<-P?^LpAdUe{{ zsLJ`rnJ}AW)ciAM2%J$MP9V7472$t#)!#^(D||dZDL2-y)K|25R-TP<+&Ig|^uR{N zi5+#0uz2Sg5IQX*1!LR_7zkY@7Uf5Caq0*|X)44lf&ZX2=293r2&&|b{!u4p;Ln)H z@MbbFSA3-xx72meq{^(l;2zCts<&)--Wn~e^V4GCLAz9QXfjk4|rz>66&Eb~C^+e}$id|cG%`-k3d=w{HoVY_X5^g}M7%ywa97oKYcf zw!4sM5$F2MUL|YYi7%xUSC{>n`*vh_?Rn)jk$Cxu!*R@fg3lvC)p$L2EWS&qTYu7` zbEpr#0J=29=RehuRT23)L=#Q>Rv0E3L^gU`*)Fp+7^h5= zR|w}EU;p93t}t^iGq<^=+WkO1=&3}>*8Md?A$Bsf<_1TgzDH2S?rNf#7NfJzhJEPC zXcc-O$d?3@tAA~4l{8C~2|tBzkij%)t7oTx?xT)SP%Ix`gQM-5C+=uSl7oUieq;g_ zxAj}Hv+kVekmlmAb62YFZK;EXD#xS`h2B(fCnd}!JDmP+A)E@E)a8)AA6LOp{6%>X zC@gYvBG^j&op+ZOh?N`OEqPIMEnrBf9fxS^C9y{>xW92{ib_sr_QL+0&0U^>4ZwC8 z&Tbq-;jN>s5wfpiTOEe7$UX}Z>RIW@$u}1 zH~%(rt1Ki*2%as&zFdnOC$M|mDCXk) znaua#v7ZB@x1R=mt78~s5i9RA02rT4i`;rK%WZHkCM6|)sF<%Q**B>xTP$S5Vq&=r zx(s)oDvCTXB4I;bg{W~)71F~U1fRYL&}l=g>y(Y`qoAUxEC2&Bha#xv7t?yUI{)8= zEmRAuv)s+Ds(!N*ZX2RZZ1Z$LV*%{5<$u!b^?5K@TMY8oL#7Dor+SJ zd)$t=ZI_nrHnU9h|3tnRdMs5WxtB9O+R=1Jj$fV4IbFK^eaYu;kAo*3(C;zk?SfCt zk!V#n)3VuGA=r{kVX>fR9zYjr6chRK#BCd4KSt}?^>sFiDO|l)w)n)EteWpeUZbeQu<`XE(LW6Q z^zcoH+tXLkdhv~6vN}e96Aw%O+>;58^-5Uf*)T{bQ||C_1eM4po%jsPTPXd?(VS_O zegkTL4>TLjS@UL}TJnGgq?^F72z?)H_Gd~wYVczMT3xo>aF8G?dsb~W8U@k7UshiY zaLZ5R80a$d&CSAqmS%4rIs*iKZ;D?uTk~$m+T_m!QwOa%pkQOjMe8?;3K^o5uDm~klQisbNO^n)?hE`hZP1wy(gP3v`ooN(2J7vZV46*V|qxH#{TZ1L&uT6XR zy_Qt|JNrE;V=v*%YN3#rmxrnq(uB(aerM>rRB=|(b=x;(fwquCg-w|45>C85;Bq&X zD{=5NixUzh4J&Tb%@;-wbS|2?|NAKuRVGfVswCj++^i1~xlQdCc6dvDo5L!H2h`=g zr^5}ZJpE^`q4vghc8sXp&3e@|WTmbI95jrvt2>_-uF7#CzCX5V+D-p=keWP-HD2?c?E;M4pp-JmI?X z;3QOZaS|jYFX3Y@z23eV=mPQO>7vk-n4WN`{#sbXQC6ej-_7SL&&EYJ(C&6J+WPqBP z8ww7c$hny+1yr}57>f4w`MM#Idn4g;8Hd6ZQLhU}C-Dh;)X%~!f6u~I4c3`fTC>8h zuQQo9;QEW3-YJJn!^;zGHXyeqiJ4^$zA;n294>Rt3C9Fboc1TuZbsR$N@0!r=-H-QDoXbPL=GXxp z`oTNU3zvPYpF<)=mo#M6Vp(FV*_kr4yN{6=1;M*Pz%uXa_HVH96lb}sJ{GwbuHvbP)3pwPOF&MzCg$^meY)_MdVh z5w8A0Bac)vn9FUEDzqKT<+XMOE1+%Uk=J$&RIPOrhly&)TS?g zD${9p(j{vl&m+=%sFM1@OETp9C-8;sSrC({lK-pMi1`Y77J1sDmvA9?qCWg%c|(PW z`|l2>uWU#v7B)=2_RS<-!t!)aL2Yb^l6SBWQ(ix+6c`tUJ&StKG~QYn4%HOa=?ewx?Ib9vS&rDWQcw z&@UBx5oTM|RF5PgAG-x{J+IbgAnNmiNF~IZM%1aBGgx_W zT6_`v1Zx$t{tfMysux_`aW!;|SoiQN1GPgi2`Ty{a1RWr@3V9oc$s^HaU~T)A<0Z9 z3jXxKa-%Fh%u zwDF-Im5{!W?rnSArjV?2I>K|j-bpj~-?6V5h{Nmr&1$Et$@{xH=;&!F!jfxt54l(4 zY#5OaZ3v&&)oDgjq(9QTs_^nr5g;ZTYh>^S6wc@|3gsfa8ZT$Dh{C0EB5S|g=D>3kwmtX?DQCF^V2 zqS)(_+OUN>I5Fx$1z{`L*jx}4y_(}3W$rDfb|0sPkb;&EYAWxoeCTTDx2px51$SMlsLF@p$yXXXCa|ay~G8`0|YaG!oG^GQiVy6Y+X6aEf#iK!r zHBsrm;6pr;osjsjbM(tE$nC#i7c2yGjtP2G4B(G4D2p2G({ZI(X@hVhEv9^d*@gviw8UjECox8U*YK_ z&zia~%5S(G+b4z`W#?pGus{>dj8DcgsctU*AXJFS!r&OOkZ7Y)8FLk|Hq@=XJrxz| zXE7^>!-R<#eC-g&tLk~Win(fA)0%smkROGY0RF{%!G6SyFlH}2kU{f5#Eo17V-Anj z*OV0Z*dj5JcjjEt;wKsX7jkRmfSh*iAqX{*dxF|2bpRR|0eoGO;?t|mCs3!*O)=oY+k_dO!`UtzkBLfj2fXjEA zVBJt{2fJye9+Pq+LxQ*QJFN2QRXz1tU%X@9>x@5Jr)lt!`Qp6=8ofiQK!6X~!xII8 zQzj5G&aE=KO{;SJdnPh3T?9^Us;IYoN@%t-Zq9eLStf+t?CISSM89x{;P1Z77}iWl zjA6}cpI=$A+>uS7iTwlx>R8~}Gbda$3W#^~xk zb%|dD9R#$+rzjV@%>3hXB)vK}7(g^Ilxg^WB5_w2@>*>(uNNZSy38Y?g8C8&1maIj z;ad9njHA>7D`%-i!+6<*CI=GrnwOB>%}1QF26(EFQtS?qq*`Y&n8+VBsU|23i@&JP zi429_9YN8{+rl)#=@0&DHZl!|X2As6(c3!;eCpY_MBfgH1?l2qZWHCIQH+P)izmGE z`t=`}16yo9Nt1=NgiGMt#}lg7Ok0TicM>mtJoLbJlixU}?bM&|6)MKeKWmV+bB{sg zM7N@$i!c+FTnaC3rrx*JR_PioXVWwWmt|0t>%IhWBd9A+mC}v*n(hznO0TUyjnG3- zG`0(Pt%Zn{)dG%&NVC%>e-iRkJw$9Vi_Fpd+$4O}py$bueK#>1H1v=W^#I$*>JS6l z&H$I`$0I^+C%a{irs3yDKhV>-P;*sfln~G`-6ZwT;DM8A3;L%kVT~y^Sg!t)Oh@b{ z+`+@xFw)2%FG1`d;9!e{oTdi_pjQK(Vo$KHF;Gg-EKydkrO$pKA+wz;;*iUPPHEFg z?(zV0Nj|VmlT<5)=;8z&^wTriZU7iO7~P8W_{kHe2iQ-r3`=GSE$*GB|SR+<^_=>VXxu=})U3 zyUQRx)~t-Nz8(ZFOjnUVk-SvqWVc>z{a(BjRbSPvo5O!^V4e5AblsG*mMYS;F}&1R zq~nD%nM&_RHAV8KQ)|AhY4D5sn=^|_AHiR4T>N$zqu731EpvhlwYh%F^H9BI<0)6Y zpN#e>EPZ$mZMD)*Q(lY1(oWpjj>W3)+UXLNUFFAVnl}tZk+ZOC@9XvBV1Q9r2oUUo z&ig-R(jmlyw9r9(b!{qnjlNQ;G=(iPy;NVGm=!boCgW2A5mN}49>H%nr&^O54Z07< zWs@dw(kRlKK-i&zxR4XAuJ`tkWC8yNI6%k0PG0#Ix*vsy#mqgXJ9(zPBmZIs_AlZ1 zFW2S-u@kVa(^sgADt(Q6Xl$JWH^#$q2880fdZj}Fq|fV&I93JZe$(R#6GiVg?$LA{ zoB;=CEPUtInls#|Na2O3mgTARa{SD&o(Wb@M?H%Qok(dz3W?dlOJ7R2U|+xZfE&}& z%RU9fAx<4uWQ5E;r>}k#7%mm9?41yoSYZf|C`eLM_&W3CiN)waI!H|Iyxt=uIt>Fx z^>q7akK{CeH`Y~|K0&tq>2SauU$|#<>-(&Tr?}e>yzr<(L(E@ib7HV;AExqX{7jyo z&NopJo%otgNE2&?^8!5@l=o={DzV)O5#B192c+Gm8(34deNaUya3Op_*uy_01g+~B|se2(v6Q3Z;|-W z@s{|>L(D5{hha|A-6gu5WBVTG%iHEq3!|Dm<+@nmyl~>H`r`HT0OLGDB*Ki{h$gh zDY}N9te$X`ISR#}oZc6>Zc=NYvEI_&@6p)-ZtY)AnipR-!&?y2^=47Y98Egb_pS!| zdd`K{I}&KSX<|-h5Cn6&(Y2sWYlx*$>Sj!fW3&g7b8%OEMAe=!VrYrY!DF#UAB-bl zni!aA2qwBp02SH6W)#t`ee|_45+JpxsBWn3Fx)16#ng2agCSxwiGR{NaI0j~ZlJmd z>m5Raxbai+y_dlPhqjW78r-+c8BvyU)%UOBcn6S7T`!=ie&rGy$8jG@pD9PTnJDpk z2sBXGb_vg_^z-79!!Gd5#J%4a2?7Y3Reih&xn~)|!BrK0dE6d36VIpjY%*1?^_0Ri zU34FN5w>9(4Fx_BUaqv(zTog zCJ967=KdhsaPRn;@Af#QhVsIw>O*!gF3%Q1^I6!VkFJgr)hOZEt;meyr?PSUgf;5| zj+%ZeH>MuW8cif;5RxE>dPl>*V0(J2Id$5=DkN~dcK-M%4eM(E6W7wtFF=9QkzKDO z$X9pL4X2*5O8uouu(eX_Nd}3f1M28YZ`b9p6E)Yp{PsQSmsd&FM999 zc_V%WnHy>uSh#`cJg&o@S1wY=I6N8raGZqgNrjw%9r>x_b^Mr91eH=#yS=ibB?z4^ zMY+GSbJOfiwAncN=wV1efRj=eV)wq9qnA+N`>Vc2w_tt2a_>vJgG(hf7~|V7BeIMO zJ8%3^0AhZqAmJ$=LY>JBT&2BHalOR|~SOn?~w^Vi)Tvp~@a# zXhQ5$E`l3`yO_uw>T3!*6|e#nuG69-f(S8j^q|wJ<|SZ{pFfvHWg&*AbEqJt((>Y)H#_AA@5enS_rap1^xtS=meH&@qQ1vOtBKrU z&#px&G8!fK5L;W=JJ^9C;8E#ph?0}6U>4kQ=A!JWtA=yVqTh@Uk~OGwNl3d0GC`@6 z49Md9fg9W^v6{!-yN3!)C{ep>YG*9)crCEQKb8~qeH^8ln7=S2fIuLyDb#gUJdyIc zVYY7L6+cTYCn?hS`IT9>;lQDe0#@(2b!eV-g#8u*E%7Y>>g$=IbZ9?6thiXK8zP0C z0dIws_h_kbXhc@x+Ya7>?faKXF1_*Rx@e5X$B{$BMYf|qIcIpAjL7`)6B&ep%sie0zYEBDg+mH zW>|EF=j$ii({A(ba$~9SpR$tJv4t@j#ANA#_XAnU;m0--C6O`2g2kqp$^GNXA!iSL z-#bs&z1nIRXsM3gMn(2I-G>U1T~-W{DU#(jEn&^*WWK&5Dv-^A2JhmucwFY!+v^zpN-J9KsERVJ zb2SaatlLBct3An6>tK{7OYf|HvdfNq^^vLf&Ps{iX#!^TV=IRUs+as~4!z`KP$nE| zQp8z5Y4S~XJtc<;BD~%qwYHUGGMRG0iSXMKr6qcq@NoS^O$?LiBsMi>l#q;U#IdHL zK<9+)O1rG13qnkeUh6@m?jkZ%iN(=oEaB>47!TDHeUs_Y@bPtDLjpbZJ6f(W?RHk9FBGBKwDw;p?{1dCC>YU16{k))q1~V#Uu*H8Cp0Y2Fkp;f zs%W15@&zSie4*EOOyW(|Xj4z}tJELu{?tV;d>E%~(_#XE_}K+PdjdS(o1@U#nT3I5 zYt8)q#t0EK$?_nWL<5ow`XgVu28N_GiF=h@A&z-O10}wukd(QJ>Vg5{o|zR!sc&ko zFU+;^+(i7gst1R(WH$Ccab%6|gsaA;2paDti0yr|f(!LxQz)h1#E?p%_wSN?Tv4X< zS#`cN7n>9Jrz?8{1%BTvp=u&DJzYvH5B>5Q*C#CNT`_$~=BC_78AHPyspwwamWxAp zxWZc~CSsh8;#Bv%E9#W&cR;?Gb1{-{Uu9|8g zaocG*)@2GNKhjc@8njZMmb6Ld^>HdUw`~0UYp_2fTPCZ9l~C12L5Gm26f+*{hUI-A zJ^h%5QQ^dy`ti8}1r!wJ8^ol+-2`$}8x^vaMIRjds-h0JP1;Z3gG@gY_Dw)9u11!s zJV(tN|5d$%PK=TxXN6zd-QcbPJk>XKj+}ESfuur`Z+_H6{|ZQpDYddjvc2QE!Lfx4 zN|l7(*cbALW9a8!l=yXcsn56rU&MlxSY(4h6jGQ*W4Mr^&)$qH-(^eWo~z{!U%=~l z@C}#MQF6BzeWHd()(%;e4JW+7@NgzqZs5^p-PlfD>w{mr>*;6onrQkL+*iv_FC$qs zOH_J#Rj%F1c0a>T2UgzB_#*`Q+$!96Wq8wlDN)B8yrSunt^PL1@brn#L`dd{dpmZ7 zb&621*2~_VojgN)&5m<^G9z)Cr7kqo&f}9!rSLhqn$Rg4&l7%7_)6S7t23C#(cLSC zanN$y_Igh&8Vv(D>sm254Tj!nqOo7($AR3R zU%PtqmrvEtDKZ?XL-xO4u)RCd=a-4alpGTnEI&+0_^1kau6-)TIU}uQ|0&kDnQe{3 z=+{{}A`Dx3-);FtR=lk}-_OAQE2(u0k#~ud{?HsZ+Tz^R@#VlzieKDcm470Y?`S>a zyEt4OD+y`RmutP_?zJ&}D{e!wRA~yJf4F%Rvv$U!Kelsr9j#hF^6kI8w3dG|rlb_^ z3@v_weC-i)?}E143a5@g(+?m_SsNR}y|L9(66F*ifrCl^-Xh!qi(|?+Ni9oyyFKJ@ zA{v^MRqSG~z1aAF-Gv+{pbsNy8+dan2?Mg5|_>v_$(d_Y&89rHmED#{8d&|&~|E(kNXIuUJc<;3oX zwKke!7tK^CsaQZAZKbCGGFa53N8rOr$^kp^C016ssg)#$Q-_y30AT}y<`4duS+2FU z1imur)s>bHyjIU|Be5cYJ8AHN(0t}*S$p_r8h3rg=OGl2xD_}qMypQF+!F^Xe`UV)Mg)Usz0aZUXzrR`Wtab$|HYKNtmY)lJRktZE;mM@Ky5`ci{BbNI4-1jc0@$3+O#uexN z4w9eF-kyQ?>(XvvsbrWB*B$4WH#u|(kHTslM5S~I>?eua^zM8gH}v5Gq=uj4j)Egs zhuMup_4O&!eidt4nAC)a@~^oc&Q*)3POg7p*r;L1rd|4Jj>vc(THN!3t&b=(Agz&H zy3(|OR1eLV@xa!9{o=Fjp(zt5j9%3PWY0n#NON`Qdo69IT0>g>U@PUweG0`^PQMp7 zjq;Qx=AJP92+rmuLy=+ZT9r>$|4b?&Vtm(gtmE6>unBwH=zs+rI#=q3`NF#>t2h)2 zU5$DITU6CTnC76=c}`4id)NJ(#29yYMt%?gNxf%&^lH3fy%x~ zVVuf|g8I?6`KOoI^w|3w#4e9N~IqMRCg z2r>`yz5AcgFj7CjIu*vBo_kO!4B+lf}@-H678Sc@a2(IkCFV<5{2?*B zduIzUwj72F6OIzzvUJs-i0SOS1;=fVg9@AQxS{kKGQ<|RDKw5BdzdI#=C$Lr%0pVE zSLgLf5wY%zKL4y6bv$?sL!1NefnyIwMa6mlndZo{E^d_&dBD;_n#C{2Z^`d2 znr&y)hM1riA?+ko&($Zy_w8i=E|5uUzg7>34-||)R_SHW9p7Au`Z3`)y{K`Tr7exn zSV~39cTJzgY!kwafVm`=inG>UBswx2Hk{QPm4To=aa3yb=pja>t(sWbzYS<^NEBNbF5Y| zqaytpv;^@CFYk0EF+PzEB$UdaD=F_dq{E z!V+4>!t7xj@Ag~2XiTD#>YefST1SGe)Iu8J`v}IvOu5UH4b`Q)TH^=RXI|v%nxe1% z3i5_td(YcfxGu0WY7HD6UfQ*u{dFpdFS_}n=J&)l0*dYjt+YWzF^@dJ|$)kkdCV|f;iZ!OZ_W*eJ!rB{( zbN^tyQt`B=$Dvp&olLr%4ORI}YfXXNeFc_u$`gmKf8%B5*UL8jmHm5dd@m=H7_}mYOj*eqC6Y064AZRAbVlI&Ah%kPN6N+Rh zl^4Ia)fpt#Hgie03ozqn5%qc0+X~$0Mfbx?%uyJN{qZ2!wX_%QKsklo9(ZMDXWw#| zPXCLgySz@tGs`JxZ$%XQ!!!S@T{pGPcp;_1`=b0x>%*bT?!37o<=1uDuPgod-uYO5 zPkeo0vbA`k5-%LF;`$sb!u1lQX3#aPv*K@V zS3Y{}8MNAJOz8tIfD38!yuye1%;(j%eDiL5M$-LMe&ojkXUK`#f!hzX!w(6G;9k=1 z@#S_hj{X<<@+#rI;BcCGoi1)cV;_V;ER7FKY)J+02G!O%H-!w}?IPE6KI+gbS?qj@+ZUq2WscamAsfY|x3sSn z`eKvcs>;XnUR5yHhMeBsuDfoVwY!Z?If7r#}_xjxFIo6UqL-DErPDkJvK{4Loxd)7p*9 zSiWrQQNks(?};T&)2ZwlS7GSk=mEn|0N?&l-PF!qVM9s?t)Rk>G2Mz^^fhb`y?~M# zde0xe9@@q^RXv#W1yCNse`hxO+Ge&ziVO)o*k>!|X201E=aKr@1=*t1^Lw~ke+5|% zQ%q%AgsXMI_cBh$=YT+|O_R;M(?G~%KY)`hht#YOoB|$q{H+z28S%4H(_PWfu31+XHcqui* z8Cx?NO%zacc0pFzd1vyA;ox&z=A;AM%Bu*vCqF3i?guA{2Ob)SL!O4az>K6z{n`K@FyM|-g0Tx*ti(+GYYl1dK@A1v*pPL4 z-MaP{@bi=}H9K25!SpYeA~RtP<|Q9trY+sbQg92Us_4@bLMY4qDbk? zDolUID)f4xUhf%t(YEWl+)D??1gA^d&4zJg7fR!%8TQTtxA@-wy5~O=RqV;e$6nLn zAiV;6XM^SL6Lh4~v3#qY>WW+_t$bFM@Jp@pU>XvL(k|w!b0d7j9C{5gW#A9Zg7g}! zt~lusTh^#HKj_U|OFq>uSnMsK+!4xZ<}H1;pm~RX^GS_@Kru#k%g?{wUmdpHAnUZ= zn0nbfN50&`JDq@V;k`K=(Km(DPpIZubg(bSnl`jcB$Db%XZXVI?`60;U?NzK64M4~ z_gJV7G`A~t-;{eFW2lLv6Zx4Fe4@)TUt^d>M!ibhPjl{t3)&?c7A5CAM(tDmHN@Rx zYiycV>Scim$HATflh@P}MQ}jEclMM3&Xf`=JqeK7v!h?S5nlI##ab7)!zqq90{Kv} zZ@9%6XnJd<`i-nUe0rVP!6K>Gk(P>eldJ2>H`0^Zar0wlSZxU1a|lsh9U8{B{0 zU=wAMqFHlo?ucHITu{$}P~+#t$;Oz2m-X9V2$ez($V0EEwC?QoX3|j&V+Eo+-`?0^ zIQE6AmRl^4Jsol#9bkEV=T0~!_8VTj8#CWob$jwd*z4aO7aFg9<4-?4z{4m{?Xp~c z`_I3n^qhDXTd=iax8I=#GSFRjgwjoVcsoQ#qy3m?8~U?Tj%#jha>{$`?B(B&HQC7j zT4K{2G3?M8UBR#%3p337tM<|AJbVJULo#lcq}ug^p7pO<09l)Xf~Xv zrw&_kp>KdO9i24l=lRfvxL0fAX_3u>@6@iqR6IBJ#t^Nt$gG`}jj@?jKaFLXSyirJ zFUE<|8gPug2Q<)Av(`w|JW?A@Aj=h8BC>H*40^yb`({IT{8nEf3OIgxfkW24zz>J%y)3j#DH3uOW3XF}O@ zlogVi?LmSVgmB6)2vg#@o#uH;7@)|8Dnx$<%4#CK1!``LvR*>aj;Qs`SytNL%TMu< zlXM2Qf`0v5%BH|j`y&6NUb^RKtvknU<>>^rfoX~euJ=e=Fc5PF+!zyGUF2M!Yr@o) zWl8Ak6@tt)(Ym<+bSV>O(qIi6?5r|qwe~~Y#6>WV8p1r7g+OvnbO~JR5KV_!YVDM? z%(y2B3B;WD6#a=cozxBogj%CztNFaFEbJ(7-h8&dv1F(T;)FdM$VwjkU{2cD*5Yqt zpxoh*)E0UNl!i~=ez<-H?H|m-NSy96d*_Vzf`EDc0c-QIc&+qt(!=e;F4GeccB9`Y zoCG#nb!71549Qzb;;Wn{q|?p{i!j+M%ti|848ln5Kov1E&?G@}_aedaT+lvTVAu{# zE#kqYt#}LIpsjCH>fd4D9bv=%*ub$HAQXqx_)muK@@N_-B>toxqqYHpJ%ZtKqUsO_b0pO%T*Qk;!4;Y&us9LLoOH_J2CmD~v;BEc#>_=NPE4xHD` z(uk{Ti>-_4g+;2Ad?G%!ctTC~lJO$nLKDnp#9|5x<-@!Q?acLJ_D!~0ZbTgmKjTAek6vp|ahc6G^ee%J2gF?-Wq}=LnLE5u@53kHqCTeZt^I{*UPqA{BoGlM^Mb zl&|#%_L2@cyc06b^yIf+19D%0?YX-Acuu zJk6LwWE4Ub<&EwNzz=gkko&s$e8|s5`QRwd51#*W2-x(?zVotuEHBU8&#X;AaWwIU z5l{Z-HesUeRPX1cM9Z=CkNa$t?ie%&xE;kFMi?Qer!JNGnsc9bfG-KOh*z)|F68177>^n&NUg{eFQ(6+75Rje%m*h zHyOy^Vy`D%a0CaAe%rvHYz&~d;$kSZo9V@7))DHzJx$cVFCQoX*%RKD08rmB!+snZ zH}j>H5 z`1O;Q7D*wjyW#zU(m1w4pQ?g=n`2T<%z}apEH+6DPDU8~k z*hMllQ97l3gkZL>qSmKQfwo=M_Z4g)^FC(7SF7AS&$~w`C_ozxV<&$nlQ_h9(!|n2 z(Uua>9BnmS>xO)3C(UTTjN{(Zl*|#66k#U4z8x2u&s;<;ld8i>%%#~$m~xAAe9gRM zpcy0%?Rjqi73r0MHvzH@9pNHRo|OBxZ;6uc5g0QD9T}u7PU47PPoxJN(47CqT^h@Q6BjAKV8=_exnSYORW!BU8iMp zB4+7FP%iX;W++C?M8{M~S86R6$ADf%$LPb$hjcPGam2$u10V#~dv^A{rd_NnwCj=I zWABu^f}5~5Vv$v6a-J|C;8%EHx=iBci0iyT3{{MJ z6gn@!&TYpD?ekrzk>*jo zp_(hc33^KZfKu+1gf<$yVt+dpTCMv)9rUtt$e#~<$}NKVUxZD4ALNRHUrAPy{pMovGgcAYi2B=f}w4asra9;x!-r~sh%AbeIAEnUca4qSIv-WeS?kvX=`kdS z=q^_H8Vd#E|&HndztxDf(@B* z*0Q)KzLLw|HLRO%JVd8{x?F+V8!oBZ>D5$KUr6de5&jw7f$JzY8#!899 zjOqhf&C6P;03MK#x_s$W=A+v}4Lk;c@37_Kw1ZFS2_;?*I;$t)WsLQ+mk7r!UbN_@ zJj<87^t5{_Z@!&4b2H$B^2->XC|TtGD#mON7LMO?|IW;*W0=lnAMAF8b+q$1pTH;Y z7eWZ7w^D@cHNC*nSDZZw>2Dukwi+nok0J|M(($bL2*z2njam(WhT%;{yXop9C>I-B zz^N|@YN_n7XTWgmiC7t#QAM(t6VcHRKyT3LQaD4l|6^WDpKWw}z4E)!`BD%p6~)H6 z1}hVTa#ik12j)Kg>zIe;(%9D5Leadlm{>VnV%E zQJ>;E6tqf~D=7n@KtLGJtYLpz%S@L0dR$GH`Jg~Y%!Dp0h&Mz4^t?M?e@JLAct@;$pr^N=({s?<2-RZSGvGS(lfMkwC;0*MO|kMoaTxMMb)ug^ z7>qgUs_6px)R3*B_wd$UC|wTr^9E5*Z?0ZIU)+6BWVrT8l;chm8rfiXsm@#YW@rxh zf1Ms(^efnit%A#z_lR(rh`%dm4qmg)AlA#*Gsx{!B_q#`E5mw>G^dtytk{#zc5l{;gJF z*IW$Y6|ldqQ!{}w);dBSx@+HiL9ZE^fLhQCIvOr+Rx|FqAz&O2rs=5ETrzR9YFX-D z#J%xOYqV1*Y5ZCmrC<42L-P5&_Z+gWC1SrtFR%zhk1M`)kFMqw(^t9>tg8#R zp3yEi5=tC--n4C-sHwVgYfaYIdhS=n@Q+#7P&7N3^y#cioiARif$~OGQU>i??^+A* zOwYk5n)6pjz5O#o{|PKl&!65dTO=%mY<-&!ytt(UX}y@+sG7MruW;xg;_Zyx5@@j~ zD`;Vc-Jgx3f$dIok^}33>^u)CMSaB)(`sw(3H}fJzdXkTyK@G){V{7)2^yo|Rx%QV zZd=~D%E$_uyPu$?VScV$!xKh)?U=dPLT3&xeVJ=^VS}(vI*#QHzQw8CtAU&V(c%b% z|IMTX%vP^WX!LSPo*pHj9;JM=+u2>q@3zCiP?~MUa%;Q-n*KF%K&jl5>l_KS5oKjR z(y4D=W*gR!e+7h&Yf0-|ie704NISQQ(X8dn9=K^7%AQK)zIwBDlg-nfSTzHZtl_Mu zt^7PBP4Q8?17%7Umr1PbpmI?}enpNl?zj*kgg0JBWp{i~M}XUo;?@L2oL&8vH( zsVFhrNsZlFs6ne}!2m4Kh+#EALYMR<ja(CSl@CApc<>1^liHMN=L#`kR~n^;=1VU5dTLgQvC**JYF@C_w2w<;;f#1 z|6aq^P;cw5_tFR&`)hdjhS{~IDs$^+mI#RGHU=B|h~phR{5Yt$`J_$v+np0T~tO-^SJk^erWf3h-`-Ku34 z6@Qjr8*_!*$Z^yqwz%ch3j!5`YO=-c4=Uat`)e{ZcDOJ-OiU01NqzM1w9jXU zKFGaMDj#|}Ee6(qfj{Hm#Du zmNB9dO1g^E9omK&>nXV)^(E-kuECKK{cd)RO8`Hylgll z<1l2f4>@3_>5ohNY3Mnn=Q4@>WBBfH77^#6Q?Spa399}v9~~lvfXG5aj{@1s0Xcm& zm4%~+sxjB^cVm}6oF-|p8aBTsFJ0|sW7f#wWH@$EfFogUP^PMG`WcnWd)>RLOaz97 z)X@I&(9xXc4?$qD@r5g~EThJzlOGL0Zbp35SlHs`uMzfT1b^h=_R%b)zxZFwqyE_56+CBh-T&H7^`g;R}UQ9led zk6rsv1m_A)FZSKaWcIJ+>ak8+kEWYGfF7v1T=$s&a=*QUn8-R`f|9u{o$Q&}_=6+O zGm;bB$B2|)uX55y9a~!v(S>rNl|x+&deGZ&AS34{v{@=kpEx|fp~tu=IX*M6X@oM# zluOLu$7Enw05A?5JRUB=R}^v@-SOK$5K=r^7^KEu@%B`~1 z<=4*a)KLahBr7nhXU1jEduF%IYL%Uf9F6A)Mh4>zog6azgh(q%+$Pd%^^(W5?CnLxHFl6I< zpc5AV695$5oNo%5kovQj2?HK)OHa-UmL4ele6c`0N_6gKrYdV!Qi=0udXa5xul$PN zuG7E)-H$))OIy5p3^dUz&KzcXuN+V96C|56rJjJ^)9it*I`wwU+0k2Hlwd!5S_d=Q z^@;hhgwI>>*SXZ#)a`?2un*@`VR}|U$Y33wjepQ1gRDsw$|jzzfh;w;^k*vXmQh6Qx68)>`k1qWg4~T% zh=<^mPvj*Zbd=P^ldT^scXEjT@7~GT&=kc3*7JyXBSjVz%0g4R(n%KCdhC)Ek zf<_yP$~q!7EMPOOR$7CNkUNBA)dns~fi|%NscWXGQ{un`MkZDuH6?+r~5sR1n zBG~Djuy`^@vbXLT@PP15T4o|uWY_s9)LyWE+|d5V?jdYx75WG`vm`%5Y=n>HpSv(l z75v;lvvz{N4w-e7wfU!j0Ri=%hxrf#%;5 z^`0<+g^tD$IWSGD_ps$d9A1OI$H?CFc7+)M%mEd~1bI28GXq!!F0El-!qqll_EnVa zVmWKPZR6*^cQ!HHAkY^RL%1z-xO|$Xh*Fs<3<`aT1srMs;sokH`CAA70sSW_r5Dd; z0nT`LRIcV!+=m{1IXy7Ve8RjAV0Ywu9#gxD7`T?yvar$6*=a`PIzp$`im)=)V^F0! zC*ry^5>3uqsPu|I+ym1nZt&JosD-VCFayJU)|S9TfP1UyBXB7?}^#u_)*R+H=Jd^HUt^wp`X17ThBVdAF3Ye z;l){?#F59;2p;nnc}?6Ogu>;ZAB*BL9qRRN zA-(aD0h|S+IXKPfp)1aF3q&e3-lPX!Iv6ENb?Mof=HN#_+6a> zy;LvQqriwnvdVX?hULiSS4C!Ju@zPEf-A=*jJ#>Bta}|cLYac87;L4mAwC|}fFD84 z`N^d4C0GjF>La3I#*xET<+Ei}$7~VvU%DQof=8GM2JO`7&h#N@!F#bG6~o# zsF$bs;O+yYUC=@z0qy#2g4lw_gYeW>;GnqqO>PSOIEjT^h8BNgwVeuQ$vXPhS> zmhSs>GF*mbqGf@DoMzR*E$h7UBLK-_g$EJADrfwg&C>{Bh8Q|Knf`&9Vm0zjEYo+c zjvP)-2HJ&Ii>@s?M*R8<^b%q;<19lq#gRwbEaPOsC;}}=9wee%S#96eI+bs#uTK7Y zy-4lT^?@u$F!NMWm-Km`eK+ym_lO1?CQC%sp$LWBDB#V;r+^(d(By8eLZ*!7>E4&8 zI9YbwB`kE)pUb=!z>w%bqysQx15NGpitdw`wo(dY9p;$8b;4nJ+Jg6Vh_R0)C#DxxXx z?ap*JXpx+s(Z|{LdbH}~Y>6)UUauzH-It}vIE-(kSJk*qtbLRxdA=^X#56=U84Z8D zX#z`uiFT>RDp?h{#@^QYi7MH;vOF!O#g4L5XZMVh3-1ZHS8(bJStr{%s{ChUHr00= z^d4Qf7YNo{nD)A^5nQFP{TgOrj@HCB(qdP-3|!K`>{rUR9?+3Wx*fFQG~ddM`r1%= zmE&*O4VY^uTOn1No!79{u8pE%y+(e3dx9r7jZ_{))kva3%5{hDeWpsTN+@*&k)K3U zLucE;paw2$rys60?=^r4o2@~%b6o?gO*iZtIJ^e1Uiw$Zx9otb$(w0z(5WjF>3vj~ zi=iV%iFu**W?&4_p7NI?w*ra#@+Lu{sXwU^Qy(>F#*!AL{$90;xZgq>^aZSg!JS%G zX^wy;fH-R}sCIa9-n+rJh%$|nC?AmMmd)xxB;3WjfPz~5qCO0H!w04;IxeFDEEkI) z4lc>(FlE9%5M_VDEI6Op44k+O5@rSf4vc-?Xf`DG=p!u}`Bg2ZzwB*fWq1^$ZG!5F zqc1s<&W|A+&}Wl0LaUTJI#%{-t@@K@o_u*{8MCkTb5{w0T1}a_zgJ;gOpUWrIiWER|+mVh8G( z?q7@cP=FWU=6MSKyqNn~OfQ*n`MnNaByMtih5xIg1Ds{00OX#s*tud8Pc@q^*4X0u z`8eCs_unEW&}p^4vFGYWK0;eGBde@Tz-Q_#kb2GEyZFZ0&(M21}w~9+$+j zfG8bhSSX6 z9r=LcIfWzX++b`pd_)7b_I8>zCkR{)rFLl`^W^RXL){Uj_5|jtsW<7MIBf)aM@$~@ z-^9Az-y%@3>t;TAG2C4GVM}+9W z`o_R!D_p1LLlo=X&6E$J&$V=^@Y+2ttl%b$`;F+;2Uw=%BA`X&>euBh`gkx{;zz&u z%O$fK!Uib237$!c&Y21Bc}X0?48*pOEl7ucFfjMmiJ+2Re;#O2wQD%(u9ax^Qs$$~U+ZGpyH61sj7!7 z3jyA(>In8rOY0ixM4buW7zd0SXe`{wWtb~$G$jn0x_@SkJ|~`3Y(EBVXzW>IM|H5}A410)w%?1>cwvzZlch;YN5*|2qqb$C65`f>JT-6Vj;SQiQ%`Rb zm4S2;Dj73-M2iKy#C>sm)5#_8kE;sQ$@NsYYy2)esemwEW^sa<%KbfBEvvJSav3bjgbCKY1c|?xjuBl1o~BYNhBRk0rIvbtOaa74 z@;^#9zsh-hFE8I)Bmo^pH#6Z7-UqjX{=YgMS9&f5LW27pASg#N0G!}&oY01^kh4K@OjFh0Y=6F0(AwT01 z>h)DvEJwHZB5s<5sEg({DvRcIQV7GD)yVEte{>Wdwvb_3wmvyRg+`;8s2pDk{tbRa zcz^42(~^8$c*y)xIkJRE))fa-RM2>MFd8ics3uy@BKoaK`mK5VU6Vq+*Vxq->wZwNF27k*|_fTa)1!s)WW;a~eZkvh}J<|v;k z-CDojQxcgQSP|7^u>0AAmDl@3T(g#PZb7gj0DkSF_tC|cBpQ6-O`wWH^k3ADnB=se z6QLbQb4Y2_O?O;@!V=*zzRCKZvIGXEHec*+6$qLrywBKM6~fN;c;vmDeil(I5FAxs zW`wgUO(IfbDM3au8U8^ITTd>|I|%Sgw*JdG0?21AId+3vNqS>0;v@?n1v^sL4c=XX z9v-he0!)N=fUYeEKFp7#K&&-bp-12kSI7MS892!G#*hMvy;N-@=QP&3 zHqfQNk|oN(YViOjq9_!QUn^E9)DAmqnu)kDP+*#|2s{agb)c)dB?PMsyyz#)rC5iD zRKu_zwA#$Nf!$QMa4V8pzJyp`a#&Igs4=R~aaX`{ABU58tVyY33u%t2^$Y9LwX*~{ zDy#oTy%W{4ilqt(K-)j09h}Jw!rF)W$BMV`TyFF0h}Vc+>sGtRdPsfi9>30E#Ssr` zx~}shG*Uo_X2Fg$rjY9a5HLvv#kc21qKo1M>EGVc zSMNta+Pjzv{X1N$AZ5u?<2^up*q?K+{V)>G4#Us4G%S4SQ9*_$cd08W)STRS7!mL! zI(H&sdZF#bVs*%f`r-U|zpeDqufTjL1PLkC5d|^#Oft=c=*NHSsRN~PNa+V1!-ZAx z_F?1R+LszATCUokUg+90a%e4up>@ftkFe$IIGnc%b-sq}ei8hxh`7W}Q|X@p!|?p| z8#=?$9TBBM;#)gLF&9`OGPhQB#7;9E6ZE5^d(GU4)*8o}3lQq#0yN&xdfiLy!uTBFho%^W#Pbz{4K#K#1E!w9ppdsTuHfFU{`W*;3 z6a_?lnlnLNla+eE?z)rd)HytKe!ysj(?yj=%AfxGWS=B`c#id?4sqn|ahe&><2I;U zbK#snF1=Fqi{IqWOCst=;IjQ6PcV(zH=!-;@-imeBo+JeY^l5~6(4yi)7Y9TMF^Y? zfjiYgtlWAdEIDqbU?HEvu^)0{rM|th8ifSZ!R|T%FAar@bO3|BzSCs%i4R0Og(Kq^ zrAxO=tGv&QsJLWO5~;xR{yr$vls-PcOM#3m3;+J98uOTBsc^K?t?S9 zGq^+W;2PZBgS$Jy-Q9w_ySoPW;1b;4WcTf>t=g^n|JT+1_3bmK`*xkKug=#~x9^Fm zzk(k(G@Q4BBXFgxc)z*0U&cj+0u{2hRA)!I%5HckY9PO-u-ggs=4-u~S5pj_%V(u& zM>@)7NVgPV=HvR#3kdZ7q?w}Damg>tn?)bw#3C#~;VI1=&fQi{VK&!~7|v@PQ7}ar zlzqfrop?w~yH@+2AAnX!S8kQJ!3fOYxkLYrcC#J&jxgW@Pbh6_dQ_5$FoHkZHdssJ zvK8Ktm4q^&E!ENzyab{2uyq`xEdU|_wlfo8UrVuEH3bnKGM-X*i8m~ge%)B+Ub(lAPHToFF7dB&`WHIRXY`_9 zyB}a5lvlHJKnymyH;ibq*pLJ;D=Ht1<3cLA$2wfsh}|zq#?hPEH*-EJuI}p4q7ypo z6EC>s-~F^9VjbutV-^hAu02D1X6|o~0u}V?^OPnNpk=B>P=B1rP+mzAVK_d!{o|}& z{MtrLh%*?g1SBi6w&ed&n+z(dkq@E7v@N_fl^8-K#MY(ZIJp-T^v|#F*{*+k*82I! zy@Sp3dzDG)!4>xk!>~e@tH%~p&trcVM@_Yg8A+BeeVW_)@1eVhzA3&@<9d5WL>xwA zcc^{MK*2P%mM#Me{r5esTlu$l&5v2#giui2tcE}` zX)h5c8=c?R-siQ##ft*YVJg~BifCf^c=?hb4>DAHT}ppuN8lD7lQh7SAcv1i6G(BR zJsq_AWzKL>di_bU+nn&RKRORSK*zK^Ce}UOY|G5RQ<2e@x!6~1 zjLU#4ioQ(?_`-UkZZjvb@LN88olIUKph&NfVf~w-|3}Nju`eqRCv-9}KV21ximE=? z{rRYHT~I^gSAv@{%gSAMqx9o7q{19k^C_?4<*i_XyoDf0n4W7=4uI!@Q?JMIl@h{Q zOTYnMLxpkYQ{7laA|pmI^BN@F>f>u;i^fE4T!W4et1j}sD&|UV4Z^0Reo4c3 zoU6O@p?SzYRm_vxK~2v7P+LE*BUHZ=MT~pgSr#lX{qb^MB`;L$9 zGD-b15Wj$qO8tl{Ep_XkwxX5ubH*8Ya+qq6hk`CfwA4@m#|>AF^c_eHnW5#qqLFX(uo@#j4vAg z+!GX~%fiy=4~Rt9IWUt#(N@DvB-n;{4&1}X%-&>b=b(nWJeWWMo+K*imH`7EyFioY z?PQUNQWi=NctU2Yg$OO1(q`z+gBwi)-Z#d0HjhcUdB+1*ROo|CM$!bR1QT8|`E|*# z^!f*1mxcvfKebm(Utvox4>hrlcdNOc853St5$bQg@Gk1m?3pyG^;DW}5CvC=V_X7A zqB`=MM^3q-4ZOj%?r+F>ZV#*qoD-n%aNh#|r#`XAVYRM!m!$3r4AlvUn`Y8=13EAG zz>m~%_x^hhFCZKUGj47ow-qBzFNdi}g}&d==bbrNKCv20JUQ;I63M_4xzUD#?Xln% z-RmFgi8+!rH_m;z+VL-Zll- zTF3j^F}-MnS0`VWY;sJ|W4tRCJ)Q0}c7yU%#io-@WX@X|qMJ5kT#Yh20AWlG6@uNi zMZYu$5}B063hb*K=bnZ;Jw zltJa}F?6w<9&taTgC7sA@pAM=Ya&ykfbZW7wbb$qx$89PGWpO8d+lB^eh9 zoP1x0ZeVpJpg&mi_=kS*zGoRB4eGQDR(&#$Oersa{?jcIX#VAAye?_keR0ADr0)$HK;U6Wqc9Lv)08(h_swv>8uyjeZ9YUJDAUs z=?>@c9)ZJVDcCVglaZ*p=at z%~34U&XKk+EOe!Gh~}S~zWyjg!VhOovK?Ni;n@V5E9A08KyCuY`_FQazerSkAo(qE z#$He;nKScXpvgd#bDfglh@{$M+0b}nj5=^htln&z}^ovRs;p02~^qVL^JBB^)|Vg_(Nfa+()tmUS_d*Cg&`~bc?_t z2Rd&rej$Jm<+WfP2TgKo*Zf}3E$^30Lve(` z*Z%HYH!gqgHo(slCTY%GdociScecFfP{AU9gruK#vXV*)xwn3G`APJx{$NIg(gN?Z zNxjhu$0g8FKGeJ1`yA(w+vOt`75OQ{@x<51IN(XGw8IAi*{++|Y*+39bkme&M}jz$ zsw$#&4BEu8(_>{O0mzw0Yx_LD!e-50Wd`=M8vxpEBDeEn)mr3J>_9#K)h7daX_AA=;0js7DdL4%EWW)u z*KW|0K))Xdf}oez?!vbhB{NfHJe`rV);JcLriX)}&!74y|! zX-m2M2&>Eu$YPyrQ!lKDG2k4FC|oxcDg(0rc)L1Ni05{%+=P5@J9Lb}C&W6As1EyZljOVRWGokm7ZkKDnziCyy^ z&WGR)5LLXDZ--+ zGoXOLSOhT!uIEyO{l;S1o)PloT&M!F&wKsdhBr}#^kt(&ViS==QLokm&ZF9D$ZUFV z+zJ#GHNQlTQKi34{^c0u0?@$J>JPm|`IA!}@8Q)eqcv*#Rjcjva|+|r-VQeM>edg) z(q%7DU=rDPLV}A}d+w^6=WM33xX_lS#~>?Wdu0OqMF2JAi3_3x1b(xSg9^Qf_C){k zAjF9d3w3XGz_0;o$ZVNm7O9uRYFx$i1Gs-ZA?kv*v_tON_IqZ@2j}wBSd2G3%~!I{AwNJ1 zz6tdb&xj%$2!mLRc(lcYR^3BZsM=^7$rX33nis{P!QgIUW}fP;&v%y-{C&Qs$#K_94<*w-2FrxC^e}V-n=4%I%&9| zG5SDk^uAx|xTA1x|8`q?R-K#mWoFdVPu^Vt__Q zIaw1yP>)%7R-fL7x-T?#r#kU^)uyTjCsIEAr%y-c_f-z)fn(X&f-DDs29fJ&RiWPB zT|(LS6Y4UFSHQ2G{#06mNm%nu5Zpv1^iw6F;wbvm!Q7|7vmNBJy27ba7DMcO;82*z zvL9JcsLuUx=H8GiItHTZko-|gTTXz24(H=_%sa}$hQNO-BFyn0MT7w$b}qm_7X(Ie zb1Nrf2LPkEmA;d)sIj4~kud^4KZ2u^gR#Cfg6pcaimdGp6Kcnynk!1;>c;`OVYQ&x zjKD7hE?S8e4QgLxu;D*r2(G?8-=?S5x2d4`fd{c%rDG3uKHk-5EjMnqN;*|9)bZRe z<=V$zKObPMttops^0l5{_q8lG?4CXf@qO~DbHR4jswz`y>RNNVF0({lT8yfwmJIS( z|4xiik#D2BRvl(Z8M1p91OG}4gLj(zy1%8Ni=GfuCa)l7F0gL5alanGa<6GgtnOmH zLOe6Vl@gA}PI!BE_l+l%ydkK&JFsk>6B~EN*6~jg!0pwgvw64d*KLnyRne69wI}Yz zm;G-{UIdz)r)_G^&ywlcTDtPZ0RhN`0RzGfHiy#gDBf1lapU*lvpZN$A?nmg^#99d>qb16j#;=QqCFd?0x65grGqvIvjEh2s|OE7?O`BQ?Ozi+q5IgITPsM^_3j>qmjZ z9=@2Nm~R%9(7Feqc#Hs9wavUr{5)=IxH)k#0l;#zr>^+4Z2%KF=&VlJjkb+Id4y|E zmo&pXM7tGO11-nJ2|Wptd&zLL)El!0RU+yngD8d&x57KbWzp>!c7oY40+n{m+=Wgp zq2!U4Vsd)yUdej`>>rtnIKO#CmQG$1zYUkZ$M|9Qw%K9L>gW-|tckLU3zVM1xF*Cc z4Qid#=p$9v#WL3$qZ!UmnM5Xc?=_Gx)&?>kn*t4YjMfC$jCPGi=O7Y!fL2$)0mwODg*{wxlUqwrZfc3ax-p+rsM100v$Ab{5KpIvp;9wQuUi= z0@)5LSO<6}HB3cYDp>Qh7tN3Q46^CnYj#-O>vtTH==jkkJO?#ZJqJ7McUTu2c35A) zd;5OVKQ;rVM@>7dPMTM^9a>koof=oE#}QN0wqP?|ziEY5=+qYW4(mhE0}@}*gG>wf zdW`+1I}ua;Qx8S}I#V=vV872FaK+x`ke~;7_D8K6<@m06|Krg2E#-DqBx0KAAv7Wa zSr&moxs@~AHhe(}~M19&W4=2TiQ$+ryYTdAEi6Dm!^a~{q`Ab#Y z)w43#>6?d%hl##rH36b*(?&v_b-LDt?Z4As3!TwVJQTM#THoURo#$^Op7|UntbA~y?rilwT}yx8 z@R)778HHU~%E>Fxwc&LZ@hKj!TwH}+^wL6xt+z)wAQhXSvTv)hm6*xb$z@>5wP-t_ zBXj43y@36Zy(w?{efduti8{l=%ZPl|#lm>_iwj7VJcLbA_)V|rh)J5&^R-oKekZ@)4`lj* z)oo#IciP{c)k^X`G3h@2ko^LwM2bolYX`XYn}#6EX&B;WS0@uyDy@3dS@RA!5S40boS^U+i$Mutp4*oT% zG)}&E5(}l=;H!uTrOtHgUXVt2a)ln!gvJOCkuj!Wnfz-yjt6Nuu88U6lVL@A-5btJ z#_b%@sD=*`AJ>UIb@DCM5ufQR@Jho+*mE&e*m06u$#XZuSv+Bkn{3oLGuWH%ZY2+! zmOo#DmVc=GAEc}>9xy$XrsX&MM$6wPPy3g<;uor(!QQm|BE^{5Gi{jJ+v%9urJWB* zuVC|}u;*Yan4sPy`~kY5Q`1VJQwiYf-3xnCc0S0wM9RbTI$-4{3%^J(20g@hsfWHR z$G>Arvp)&|OQIi8=2Q^U&cX*cp;yv~pS^^Sb~?ce#$l zRNhN%F^I)(gtz9YtGCai2qd37#&NY=ha#Ej+s2i8jW01$@?CTsJ0XTRj}>E_K4OIh_tW z9qLXG&%rr$@0&A!tBa>omCNn^#U&u?xoWjlVa+7nbB<#-lgf7W*sf|l=+@_l^M$NO zVNOi5mvfitZjMQaVRZqqIjNA3^@N_(JyuYkx$Fz~dckGY&1@BH}m0Ockpiip>r{X-RKq6vV-o95kXTo*Gk@3CG$Zh!CW=m{XhlX za;PVjUYsnm#nKwAiAA%cnWORs(hPfoliAZGaoKN*)zA(ic1|U^tm9nRw|)M*(~W`N zgxHrBsjT9!#y##Bb|auapr@gmP>@0(H}hkK!Gq!60%v}eH=ft++9mE6`vht#gP5GY z{=mmbTz;Kt>t!-{Ldp(zqtn}d-0L7Owsu(Hq8`HFCF%C-m$-DgW^6W!3X*|0iUJ{p z&A|H&VJ%^A;B4yNxZef*g6@wP;Du~>M@@x*s4UMq;ZaS+W-pJWn=RbUn}Utpq2j@1 zm67(~TvQIeFHeNp!HrPKq}RnJf_wN8WcPxt=KZ5)W-B;n<(r;gw;J3j!}vgwYkI2u zCl}b*|BVY@9c+!94Z(7f+Rn&C*;t*PnSqUofrTB2z$oPGWM=C?4P;a{cd{}D@0FYl zEQ}4Ez%3E5)N}&>rp?O90p2*ZnV6Vl{(}BrQ2qRzhM3s{^9j6 zv)%u**!*o7{RP{tN(a4ZfX!^l5HuBMMf%0BTWg7A6)pCN3rpW)^lX zHYR!|PAVoQDzKldt2#meBfEj>ZU#UzNqw8N{8ftPJ#RZ2rbV+05J# z0ABwY6FPvZv4bPHYXB<)GaEBI7YheFD?JMX$G?IFKEYTWAY<-g3;^F1E2F5ffw{g7 z5MXJi?_}m^tPgPczvpu>urh!{3s&pKHb%Av76>f=_9&Fy?2G}7@&*=4PSyyFasXD& zzgpkH(Fwr90dC3|+n73;0hrmr&+i}0|NJJ6dg4ub&lN6tu)nP9KTOlk6$pm zTOLrT$TKs8x=pu;O6=gy6kD`7lmSgmW= z*2j2W482?YsxY$+yHEh6l%1g4=6c0Dz)7&7>eC80Lq`bLmzQ<6MC~)MDcYCPL-7uN$K(cTyit<65e~{^d zGw&ErP+|06<;{0r_a!2ch`e?WB$u9{RE8}pWNQjO4m;4>F0Qk%(_uUPEXmJn@RHYT zsZML14{|v+m`+&OOO`qRd^idI8vN~*lcXOX?}=^P%*?EL#Y;2Loq8uDI6U~O=%;OO zD8*J*uzP8K&h{Oy*wa9xj{8O3^oceeLBW6kvf%+%J}O*EPVBL^iIZlJZ$2UCM+^Or zL~IJPUn=NVEhA|y(!>@JMd36yq63$gBaYaLJCfhZ*>T@`;3QZZa@5T*{Hzq?j@eG_ zgu2O0F-iRWE&}J|OnzuchW4M?HSNtdJk!%(H_kSYMdM_v@^8~YtI{uMwQ_&?Gdkmv zgj1=GB2<8D$kN7aJ69`eMJ`?0J5$talZQ@PsBx&EBQ?Bg;{cf04PCwBHfnwk8j=&t zVS382>YcC(fRYs?U1xSR3A~XO^qmRKL;RX}Ahhc=6#A3{U<0AsbD=77FUZXu`BUK+ z=21WF(_tEj)?@5aXQE1V4(dw2Sq9JsX-MGYQ94Oan9C$+KV#MO7I_&;MuZWV1ciH$ z{GLpnb+?p9{H7iv#+Q;fSCm$Wc=fq(YtTxB}vV5&~oGQdEIDE0iH{3|vk`Bc^=mb|ChK8RiXN0Y^bR?T_2x*JWF-CU(d>3(anhi5kN zrHHFX|K_!!cdx*uk~fxD*N&R2b9gYyLk^xAdSp|R7r*G&sz>LM>KU9%v=7SHU52rg zo{Dk^{bjwNeWhc#!2#!Z6V}~a;FoN-1&qFO`!yZ9&!kne2?jdJ-$OkdS!oeA zUP>M^d=3R3)>SFT`()Zvk;?SBa7p<0Z=E3#OCnkl&T;pT_P21W0-R}yZayyj==Ia7 zxeZI@<(3ntbOZ{5qW3SKIVvRnVfXgq)QjO=HB&s*#<)(t9G7+D6y_J zH!UL3b8!DIme08C zp{4Zgon6ItS*mL#wjaVjea+7ggp&jAMFAl2?fuzPn?cBW9Td%0^`pNwMO-9u>R{qI5 zO7tmz7WDV(7Wrr(3Xuxj_Re^Zc5lnXGnIWJ7b7BPzQUN1y`<&;_dax582D0RHBpo>b z3OM3NgHq;-4)^Z$@Z=;wDA|<*@Ubys!c)cb5cK~%nv&yBQ&f^0jeB55DuHNSF)b8M z!4f$$0LS+yGh?^fPkzZYq*Gw;P|W*mS4S`E6M|LE%`?C!=IP^@(5L_MuHopU@8IO> U01hWRh?#>8fr3IzUL4{70B%nLZ~y=R literal 0 HcmV?d00001 diff --git a/examples/notebooks/intro/input/solar-system/mars.md b/examples/notebooks/intro/input/solar-system/mars.md new file mode 100644 index 000000000..f28fc1a30 --- /dev/null +++ b/examples/notebooks/intro/input/solar-system/mars.md @@ -0,0 +1,17 @@ +# Mars + +## Solar System + +Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun. + +For more details about the Solar system see Chapter 1. + +## Mars + +Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface. + +Basic facts about Mars: + +- Distance from the Sun: Average of 228 million kilometers (142 million miles) +- Rotation Period: 24.6 hours (one Martian day - called a "sol") +- Moons: Two small moons, Phobos and Deimos. \ No newline at end of file diff --git a/examples/notebooks/intro/input/solar-system/mars.pdf b/examples/notebooks/intro/input/solar-system/mars.pdf new file mode 100644 index 0000000000000000000000000000000000000000..a48c4365b376f23dd642190daf000c5e35027dc8 GIT binary patch literal 57872 zcmagDL%1jmtYo`w+qP}nwr$%s&bDpawr$(C)&G0lchrNdR>eVh3~kZn7EhE}4B-Ea|DiP{J zdo>%>sWXa}17H)czney)VbBV*p`}^S)Yy(p21v)|)umKQ$U)AWT`=ffQ+^3zwO^v{i2l?mt@J09z7a> zAtSSB^X4DJH)r~JTLTI1?Qzg`3c1EG%r4yBt%5*kZgP5pr)5z_E{WFT_oLMAI}L|v zo(-Kruxcv81LvCOQ{bmU_T-BQG`dw2=Il!pd{v^BOu31jHA&f<*@pWR{bOt?(LSYy zSNjdLDU#92vN`!m;b3(0aisE#&yH}e+mhozj# zi&yG?F1@qmI3RP~X>P)R6<|1p;$88`_T(E?BW8f!*6tgC;NxFjM zkljXsal^2ljJN)68O{+N10Ug49K|$2pMVakoud*}pIVRvzyheU(PwH7eqjmXqvZWN zc1?mBwXeSeGvrTuoyL;V?OGx6gS(h8Wi=)Rw4+9BEH_`)k;}-&o>7y|psKEQmEMfL zeXU21+Zrw*kG$?xoIS`5hX2|J{)KrI^hPY+8dt;;Ym^4bYuEJt*Z???rkpL5(MO$A zh6m4&2@M>m(+>29Ns<6H+7`dhRI?C0h2g)x_YKP8hgS$?YG?9)sQgd;e?&7g{aq#Fi!oAg1NzWv0Q*$mdnqTbLfzlGsBWhcs|z0UgKU z@&nrh#<;lPG?)M=6tK?!DO-Rr*awRThh!Q9V^ zW$g#BsI<3)sb>VE09*l>j2eWB0#F3iH02bP+MFAXl51^cd;MF9&M!=Cq+k_@&=-@$ z08l9ai=d*W`tel-((-?`7lMsf{A~WKdAfLK*U{35)R)wiOAoxg;{?C}m=*8_Z|XDs zD`J0~n+5n=+B7e3=!sq72L;%Z!Ns-Zyg1j<(Xo)t{%;c7Ygr3h+P@3=iREd)eIsLg zz;~WaVA{YxfFl#zXUMW5%6omt-+v$kre+Y$ZUDcqNDUup3zx`~bNTW;M}OAW$PittI~2CVu}?KJHdQZTRM7rWS{!_fGi9OYLut_kZozzwegVnj730oE===)x`+< z*Qb%*+q}d3n3suWgUf<Y znT_|)09S_u_b?$Gn%#lCJiogR{H5z*YygOvg~#L*{pnEF{TrjTv9|y?eoKC@kNaf* zSby#RXp|=BP_0aiZ7;wWfUt%ZC)PRq5)Zci{8!igmysMBo0AvXfc(Dp0)ETVdQ&n= zw*Fqf;L%n0_tg4+@CVG`h|FpQh&h?Xg_rybKIwbfi+;&3K5ArX0`rhR#7n;2aW(yr zUY(a1zq-r;Eil$KeurKsM1`TP^Mio{QeN2Z8Pl-isR%=U?y>0nROIY;pL70b1q$=>ZsH zd$0Een-~MnOa7_;5H~Obo}>Kb4L?{e{rMvS^8yFJz1tIU2h2Y1iNFEmto%j5g$lW;@-^auv+v_Og>Ql`*^=|-~6ir*!Zii^N&VDsCRtHH#9Q_oo8rt2EfqZ zBQ!QJFnapbZwUL_tp3#`$V>XR|G1L_1OU+@kAhqfsm=4k0j`mT;%9tUEi*<;s2q9(VGJ^pZe z=_%=PL=62HYgDyJ`C%?j^O@p`q>LkJd|HmqFlW*_(t-q-G#(4WxzUwZj6ueySeqEu zGOe-hK%fhChM75kO)amL9@pn~NGs(Mz~RrGHzs=V{z$7ZdwC3E93vsH_3n&cd9ar_ zlz>R1rdR&#q4W22liUAHfc8{3#CUm5p4VkLxG+*#ok*w@l9ab0r8K3dZ>i$^aNw&hp#@B`?py7MJGg-hRpA`g8|gFR8f&~fb5P|MM;;9RH_ z9e}0aQr6u3B(sVB+Vm-YE?|XbJ_oDx{*dz$4eO0`LG*ztmm_%VhFNwuLhs(G*+*+) z;*SB0{=t0t$6ZTQZAWJoY52iHpG#)0qjMt26X4uQy*eVqStu;bfF>aO=|zref$mvq zuW~WPBo~ixAjCJywK~$xc*rDP@k}0b&60Y7)8l$?S)fQ({*`_dw3KdehaM}NJSbma zctRE!+arZ+v515{Zm-ux_`o-i_E0fVj9=OH3$$_GpmV)@L0g+k4N9&P&t8}v_Tr(D zY7j>Z6FL+7lbgjI-}{OV1wk|U>;ypA%WhDvb@gCc;KY^ra!fAH))W+#!_KM*0jJi_sXg?>l zxG8Nzp{ds?U-&IMuO9k@Tz<4}Tm-5__*8-O70Zw`QQN`g+;eCK(B0sd`#y^Bp(3T#;D{a3PU2 ze?;KPB_iJNcYJ-x_Z&V0wtq(H7ivw#yWq5b?r2UH* zi6v${izIteMT}S#q~b+aR>_LUOx6MLQk!9lhYz>znT!39=t{Hbp6`T0N-d}>rdpa9 z#@ppr+l@{|%6Rt^I+XObV~*Ol_ro;@y0)UU`As`B zp})A?>G`Dh0cw!5(p>FwbAe$k_vAGtng8e9_ZT*4t~Oe1DWb^@X$R%hrRy2g|<4ay1W&R#R@)1ef@Z>v{l;_H_Cbg)%$)NP%j zDwwkUOY}JIOjf1Sp&~d*JnjSlgaZS3Lq?H9Z;~P=Rq6*wRDdbjp4Tj>vtr{Z2UB&T zsbT->Fz27I5z0)!Xcf*Wbc@Ez905*Zh~S^uKjeXhl3sSckxtKBaul@9JJQ?!v0)ev z_G@GBx}NNET6k`?84NHJuorLJ&qJSUJcwxTS4V-Y$wb(PjkqX3Y;qF5olBEPi#?$9 zhx;0A=%q%9os1fOw<>0^gR=S472-4{;%T2PQdU%6>y(*VKsYa%3zMA%o%`_8Vu^?Z zaePS4M19!6QKS>I+4kEEJW1SSR!m3-h~G8e4a^-{CaNDr<6-HvqSP=Q@P$h{a-M>5 zevor;XCf@UefFp`5sd^(ktt*CbEpVVWdCJ+FYQ;+oR)>}OBYAk`g-5p(wd#fBd6AB z>l<&iRV&>Nye|9MAs<q?Kd=_Ki#zRn8-4DB28-!T@f|OHGzjcNa@vR5k0p)J1Q9xd9?6s z)KSEhEhbp<-<<925qj)7&S96%tz;ODdTp^DPt%AlY&L?ECZ8Oc26hSchI#1@Z10Y6 zsU2ATuzC8z)0XQ|FzRatCF*=sAU#a48x( za{Aox*JU!kwk8tX3rOkrnz4$5iUyezs5OLeC3KC`O}g-;tEY;xA&EnsK75X*%(~(d zMn601K{#iJ^<6M{t~Z*B<<&Z-kcj7RBE5}&!7v-?X^~jo zcmKS(a+eybWQlXY#o$u?S}aX6Su=W*L}hr+YK#59w)m*DGORpAgzzOGg;g~c5USNy&~V`%I60%1ud;LW>~}4e-{e0 zsmCA&_1%bXYws|MLa)JK&Bcz^3-w2`B0hHOE{cCd2MOTX#heBYDhKxR5Y&&n{w2u32*&Gj-pT18u_4JB=g#&|>aB zTIOKS(icq?P0QbvkCg|QZy9dbcTQlY5ceo1z54`EH7)Xj9T5{M)Mz$rVJTaO79!EG?MNM4uPE&CVx>@XsDdj3fp9 zE6_@P-;y;$cH#4jn#qenxJ)gNZGAc37Yvk+jK`g`LyMIDfIi>j(cVJO*C*iNR)sNn z*p9GbNK20B7sy4v3lExM+|!sT@fa?N{N7Q4T1GDcVK&e)w~eWA42<>i&VU~VJ<|F@ z6cyolN8{QjZ1>WUyRIF?cgQ-BJh@5^XET8IRvc#SU*?NIMj@Pojoaxh1C*k?myG`tlwDgR`EiQfHs{!Jpn$VD&<*%G>u^L$i>os zmgs(9rp3R3v!M9Y2}NI;$ycNSv8v1@{a8KDnZe(F&OY=rFGwl>Y#`i93yxL6i1@RQ z14%VhWwB+NeficU(Z2VTnJl=vZ4w>|6R|{5>+zRPLdxUJWMRE^tJx(poNVYt%eohj z#dX{0Cm>1@wW}`bqpX`;C@)t%B7**;Fdlo{=>{cG7y9Uo)Gfv0r6%wbpxiL8zw+g# z`Sq5qvwyUIn6R?tkJN=xs6mmzeKtj*R&b-SB?Yp_4ko!T8 zj>3EMypE4BuH!mC9>>Xc6_2wi?i#Yp0-&L%UtRH|uEv&UylCw`nM)Pmost%ge2ieZ zJH^gzDuz}PH?9T;57ZJU#QLk#>i;?~KT}-nu-tVdq%tT|W|=TEo;khfL~zbI%)nB ztL=LnXb}csZ_f9rm`X1o&Kq^PemEjKKo+}e91n9STRuRL)#t0AK+&m;r5pJvNp*i| zw1=7sKehQvksrB)0RQMzg+V);Td#JvO#QuU*mJ;vaUoe;BtSC>|K5Pc<4TTGsNk!% z(dZtrD_bK;IE!F}^xwo8)4=h;-O}ED%jfC(T3Y9#lX#Y~)<_AjpE_Ze0&zgQ*?0GM zg_7{dx4d!v`k*{vA^NA7bJoP-X0xFPNas#Znv{U%or*0(pHoJuITvS=^i*QJA%H#e}kw!RC^8Kemd#f5UeQ(CBBGX{Qz7dX;(kmJ> ze%#A&Au}y|Z*fM``IXl~tmaFPY@$noddq&A&5= z19w}U^$?}!W0-$6Hg36s&I=$EHs2!ctNsCK&)-|jh%DsMXRGe_v~$PkrbNnoF4)gs z8La%$g|={N>f#9Mlq?x>Cxo_v!K&js6Tu*+WhIyagL`C{Gx^aA*r<6qH1z}wjBaSC z@z!Z7`j)1g3b`vnO|MURU8Kq?;>cxQeUu~yEvjNEUit(^>75Zey`N0!44NQ|acTfO zmwiMAdzQ9V)8N$lJ~Znhb4I_ViF6toi^4!k@99lkyhFV}^kBFB+Q2&98X{cYYlO@= z?5IkKzMNQ01e95nB^5-mx_AvNpyG}z6e+=2g#y`y*vLOFHx3ciB?@&-vGa;W*yru4 z9Z}BfGplW$OE(0OlJO1ZhKhyut!i*pua=3#^&;Zw{^*fHbkJ#1-tgx z%0C#&JZutSuC(yMsHjidzS9!EXoRz50Jw>|4AKs*zj=|aM#9DD5iY9Dbaa~Kt}m?w7iUV|v-V;A-=RhK%4$uyESfJ2YY1Q$Ga zZ$LME!RWc<-;L%DBALGOy#AF!!@##@S@!i-%!)MPc4R0}Kr~+bxpXl44AQ};wn)5O z=TfX{;An&wqPf){_QT~g$JWPgAuk2{sQ2HYGTEn%XLI*b{+(DZ0pw93Hw~Y*rSj+O z4o_)W73oOkAwk{#Zh7eA2HD!2(NrhW877K?AWGDadv_U$+Y7Q362+6M(nEZ37Z0Yw zV2rwtql7raA~Lel*oI^{<@lb9{N5CYl#LX|b8$ik1)L@FExJs-vs?Lqs2RT4jVUZH zO+fTK#(vgrXLt5d`c>myy0h%yKnZJrPBvTcQ8n_b`fC1b!3oOVkpjJ?1LFGLO?nbI zbd86hGL{##K)-!qjHV;dtMIN;wmYOEw)>P#qJqFd9Xo2$y;20C!;fs6j>b(=zDO(r zF?*xXA#-o9ki9F__h=uq+)DG6gua?&$u++a1+qyvwsA*7gf3yvNaq&JC`XHnOpi2B z6S5|0a0#?xuvK=5Uj9WbJ~kvv`gE8dtTA_~!b2G=NCk>hX+lo8M6U{Rw3tS+hJ=Nu zqu5Sg-g%FdOxf9PAM#%tP0Osd_*1W-D0ZV6^>xI$v-bc*yFtKiC#LEBS@?geT3dcK z#%3#GoRV-hAgZ@YO&N$vRMR*bdNeBXzvf%vNlEl-w|tsBHRTCTG2bKJw~d*H&xW(- z#`Y$`#(|}79mSimCI*_ub3Z?ijL10b?Q&a%1sUAK+&1cSz|yN5-m-v+QFt|WQB7-} zP-+mTSb|f2ZuRFO z87ZAMylGBg@&K@y6bv&NK?$Wn!&Tq!QTRe}?=x7C3^QB5+mk3nK83V2+SD>Tg4lmH z7?7`eXrlwn`rk4%0f--|x0HrGrtQ}2Fcl{nk}~#$&lcZWzdW@(=bqN6!Ttl~Qh|sqXU3x2h=_=(L%$rhw3vMp(NB;qRdmt_-VR)Wjr1Ge!ZhdTn@; z^U{k+&a^;nK<5c?-<~#54ZZ9Ai-a&TEqBN!Uh3RrDuv$LrkjQ?I|qdjYC6{}M?+ka#N@QhM1>E@~W zw;HwmJe15=7|}2^1E_he~~YqlEh)tF^~nsNQeSg=a>SSAV(Ocik-a zYBY-whTT?OGFS5m5NVfojXd`)yuu)7?L-XP_6^P&vH^S63@p#(HqvNCiQw z4zG|P0~_NaJp3DCTz_xI*~bK6>A2Gjn1YryuW*3B(%HqRf$LD#ai;hBnGE_w0+sKx zqJt!yl?4js$VT%-go+awwu#J{%p?OXej_7_j-dWju!O= zy^B&!Cq)cf=`{4W(@XK*DWHfpMC|lmF89&*4ur{+#Brdq`VuYj#IsozZ^Ob%9`|R$ zW<{&Ss&Ui%=kl1b%)|_pT(3fN2XdR*+s0&m*o1GOE<3LomMB!H{z`?M|m5%oyUWt4J(bI1c`bJ4xiF&P z0fEwtgiD;a4%=@Bw@)SuXSiN%2RP#--k`dGe#d5IqjU6Ew9Z!&mwJKd;a9-BSo34? zY#RT#$Dz|0>Dw&YvCh=%VuGJ*qSE!8h3DVi4ZvG+y7c4`p!?~Q-k{fwZ26KknzeJE zm!6ejKWNth=gwn3@@GQ_q5m-71f99EIQm=?dw3?75W?~#(%glR6SU^h$^_oyp_tDr z_zKX=Mo2k9h(?df0+hwCrbA=SjZe7?2!zCh9r?NwM|fWXP@Yz@yMPk&bMX19<=}42 zenvvVWL<-0>u9g@qgmwahbEWL5Q?p_D4->E;v%%0#kg#p&@&C;q|Q8V)PIdb#Mh}z zGz@|q-k)D+jhi{-{lOy*(^v=yX@v%Kw!VpsynbhydUJI4*4S1 zKGS-}?Kdlcl~mtD2cq9yCq|Nvu%938c+&yvN3l~Sbud4_&A>#j$c=8?Wut^h^pTWc z$H-l(CE>}E5jnqVE+A37)`RLY#=Ws=>t^9S@d-CUBmey4*A-JhA?5j3_94{ZW&XUS zCE!-I9~g30Q7(KsFBAv;DYk9NP!izz%~fag94D<`WNNJErn5?>TOa-PsjYYr&dNlF zMP6(3!cZb(pxwF@+fRimtl2;ii=HMVN5oYa)7+d6nhoWF2!7$r3?QT19_ey}bTmvuBzwc6 zXDz-}|1Iq?f=-8w>reR5mAO1n_q_(&Cm#90aWjb`r*GnhA07{Rg zT!dii#bWkN#QKGS2h6~jb!os}ycFOv`^De89TB#p&a$gTqM&rRlStDk>(m%G_?{T@ zZ>+MWGd8xu-2wS}zG%Lq-?&NAl%H-#b&fsFJC$0Uvp-uyKn9e!sc_W`sHRk}Yj4&?naPpl|8426Xh$X!v59#V;!@kUl$#wHo_>U&zY-eD#;sWTH zJ7f*JE0HkRYG&lO8X=|GX`LJwFf9O)&uJ=kI0=P&nNCs6T^mjh4*CE@+Zsk%(Sg^b z0!7{)h0M@sn#6eK9iiIQqJISP{M}D||92(NP?aine~3}9B%V2$%2>kDA&o7u?n~>t}_2% zA^4tH@YpAR|7)}W{^DNgga{V6*vsnMPyWkSV1~YeMHOdZ#q;g`^KnkZuB06>!d}{R zJs26^B6}Heg4}b5&vLkON{XhMT&SG|!j=4O!0zVi9!DlJsBGns z+u19r;u#`QG)_{4q~@IFMi}^QFd6cS;X}BidDX7759riwN=(t|GHSZl;b}~PS}Z>} z0uq^-8`UJp=ckB>MhnQUfp7F+A2KUVX8aboJrx0hp zy`di^wlXVfxFTcrOvG7r`{HL*94lVyRz*Uh11CXjD zx^f2cSg4M%dKK=C>FYey9y)@JrmH8PF%kCKv|{MXU_ECe%BqVx-n5O)J$w)fu| zXNZZ@!%2--?G|f4EWa8L_1Z`p9u}u*8S`NM3A5Ev0lt5(VI{JH%=$9(I|L@9Iks3T z7qRC1&HJO9(4r;GZAGkXpOdl}Rub7y9yak7*0O0b0f~J0>ZfT+e+E<^pG;c>zkbLx zuiG_y>3tszA1n#=DTi||!-+tjmdH6SRzbtC=7A)qJAt)=vT4puCemw3dZ!C~cavhT z;n=j-(&Ki@nGv~oRY4>{X;*bYJRuilRWh2BLF2DPB&}qrCSGmPy3}i-80phA%mAY; zqN3*{lp@TK0!yvfp43HoqXFHg5-?xy>^u$>&%O427`1%);rL$UYkg`hll?^w*Lr2# zzkV8Si1egeR$@xT{jR=|^b=)A(K=uu*k@96E-*Yjf+(aNTvk>?T2dHYNK>&}h$`_5 z8Sof{o&v@%YWTq!ZJ%pAkv;qf=)aufn<%o%BOw({ZgUK+g(-Hg8M%hovx+@EOq zk5W7RUsQhk*A4BQ8PI*=q98=H>2>W9%G8pVct~iuF%I~U@I&+atEgu@Ph+MgS^>k{ zu@kdiA+5m)d)S;hJ2Wz)71(NNz4wSFsvMz_u;zJ50ZrtPBp%i)9fkwtd`*h*l~c~+ zCRA!t^Gb$ft#TBpdaEOG6S`)7Dz|tRUGW`91gk_q!f?fV>p8Z5NwU+WX;Ghq-_@k1 z>Zn{|5j-uoBqiKC4o;<$1%qD%3q~pBz+_?o6K~eK6Y_WS*<29Rpuz-lTdM=XX&&_< z1y_M+@MK|3bx@N*S zH^i_^v!oR;ccr3TJGBUaR<~J`z1wu-RYUQy+El(GAW*)jJ7WFw?e9j^?c=m{Rm!>$yDwi^6hVf|=<@6RX%r-o?qw;^ARxV<5^~;O zi<-Z)w|Ik*B71TdmP<@RvP^F^SsC2zNJc^y;MZ8=sbw1VBkV35wsvcd+Rlk7k#(^x z*^B;Q!lC8iWlojv5DWHOHa4TX2_YYMP->}y)>Ac6Ehr2!c4WhFALYwCnV60X322il zkg9^ydcPpSZ5pe}O2=E{VE!O*|4JnZ^BK1u*E}4_ez-z6-zFn7&9#JMpuDs^S9FouM~Ku5 z*AW!v7k9+Zdm@l3od(yne!D zGN>I;z?b78qXcPCE~v>S{`$E=XQ90W5)-rE%owx|kJF{vWTAT0ubgr3t+~x8le!>m zBQZRhko8IxbpDGSR@l&1WaU>ymy-~5qTa)GH7k1;unq>;PXP_FFieHTHx(XQa1dpj zM`i-^2`K%-pBvgzO`fU6?)$mU<}{K5)kQEzaN7%^DJ>o$FyY^u;uj40~7B64N6elsq?Mrfki8w7#b&CB9*OP za(bkPkgbd@`n$%1uc4guWvr~d$6XX!RWl%b-sV&1g^o69b5$R}@)ShMgY3Xh#{Aju zu+@bAy+}m2#BVXuvwZzCVw@bQVcqt@nMMZ+L&+)IQW`ABQ-=-t6Z zEi>#|jI9JgUXXt%4=lTtTdR;+DO6dCpSRY~zys)4K(p=rq3B=2r~@HZ|F*bS(!vu7 z@^GMF&46nSYbIYuAb#@KL?gwC%%zwV1zFb$H6kU*L}iJl$892)aF~NSPA`S|?K!!T zOqOm>xHy^IxB14a<6j#tJXpZVfBJ8Mp_oD?ra-+T8a`_!oo6PK$RBulX@+}=JxA-X z>;Rgiwi^AnG$n;tICjnhc*Ti*r`yZ|In8oq4hrKBehWBo7P3K!y0khzWQO{k&o!)o z{rs-FPc^Y>PpsaHDwsy#?gGaq_j7%>A-yg=7DfxAib2Z&D~8!oLCO}aioA2w(q9v_ zU=SxNLsN)+N9XCC@6AUd<4URkAu7aMmEkp7cE3}z55hpo?GDiE^x0vQ;IZMS7PHhp zeniCu%Ro11G%?iYZ*|T+iTYGudWV^NY*)l*a}I@5_>5HD_A@6_A9O7^hgklow%z_^ z9JI$CCNVc;8MSrS1m}})&vhXanku|$&nS@}4g*Y!gnbbQETo^N(2^pt(ZiNoGFIQ8 ziXiXQ9FCYcii?&INKn3UIDfT;rAe4P5C0G(@vrtD!#no18>)V$K;w!^!N|~!d$XvirVLXw!xRww zRp|-2qEPsUUcwnw;>~P6Cz-6R{-M(hKFa~t#8&Y#BUL6QHbfn{I-96rU{o6CKow$0 z%};)V2sP@dNNFKks1n(6V>=3d7BGSey1Hg7;S)L6`l>HMf~Xk-fw6CdSuGNne#yi8E&E-BTkmJZJT}|ISAXZ5!!YZq^SMdK{E2 zXHF*xw-PXLCiD~BxrNY;sd`hXpf~Zwi(K;Zp4LV^LNCUXhbdY>L<;}=Zi!QCn$di_ z2cJJxHm)DjHexvN23v6`UsCpt?3wXB4r_R(L{G%QY~BDlr_A_{J)nwKR-{{Gk(;e? zWaq1T*-MT1&p?c1r2+cgW>>gNRz93)U;XUxQhM`%xgr^y#U{HIivTo0`rgH5uPE)XZ$_Sz)xH1*6C^KvM2fK~UxdV_~VHCpa z&ZlVq+Tf;1HLMgYQAXFzy!hizXU`Q3nF!r^2GpS5fUa(DgT%$%U)N4J@ioUX{!WvH zEqwFg_cM{xQy{hll*wosWYiH&vy?ptGxuw^hYbA{xn$`B#R)tcW0z8#*9)dDXSTI9 z;3;?qAQ9UxOcZSP7p0VidY)mPUdJNO-yYk`*?;Z~6WM_)0L6&!F#HHy+cFoR8+~2^21#TS^<+eilPled-V?niuSf zl$oVM5cb0z>8GFh+U_jGIlnIzzUAGZX!hOu<1L;tq`$NR^Q>|c?)qDLbOx<~G7ZX+ z_9ma3w6@M7tQIW+4chc-FtgGKVe_YY6kn8iEv8^prVozC>R(peXpB*&s+AE((<~)_ zsSxHPw-UI9V{Bv%^dAG%#4q2lL`_p!Le1#fsrngnpV{=13QM7|UVs_VHvexV@|xIi zN3}SjUc8o7Bg$6RvD2`-n)~ta9~>qPN_iH|66PK<922S-1TCmQ%H%zq1SwH?w zr?zLtM(MR`fG64+3qMkt9*KSMIn#d4;0zWsK^(z2)eEl;dh2!*)C*3f8Q>$~;!h1Z zSv79<9i3pIZEaKs(YWHh1fTjiUh!J^3y*My&O3UXWYwb1T{T1+|2rVbT1Lx)3b+sxJ>4yO9&}z9FNHR!VF@=^cUS|1`Z|V9zMzi!}eGVI|L#n#*SlA zoHA{opxY4VBK+q}xqVcPGQgA7T%O64CeR$jskE7nW5BB)0PhY!I3?EEJxn5xI(+ZX z($O}lfk12ya*uI+MN8rCI{K=|Fw*aZsr3C|((8R^KFw9?!ayPs%DGi6%|{Zn5)R-C zx$~U`>0|67qnV_}I=mu2>1#;TJYtG{H~b8kqPed^Q>E6Xq#cW>uE#3G$zt*8Wvf!)47lWZ z?z~D=)T~x5Cb)RuSy;V zbMT~;8!~;gh!yu2GsB`M24rBbGd zh2Z0%APu^+>ed6tk#bEuoZ5}0vWK_7BdIVq+ole>U$v9x&W8c$9_49hD85z(s2o7i zhj6JOKXHMd@sIt`iMtb`k^H>OK1IYZktm~L4=u{s)_kbLjA7%JYF-;orfO$p+mc90 zwd?2%vQZ<(w(!L3V=_QqEys9NiEOae=@zPBN->4`bg<>GHKKY|d4l$nEj+Xu9_Ov3 zSHr?EibWb~(g^1twW0{k5Y9{73oR&&LFq#Tz^p!pSu{2LskiCVtL+}9T$h3zR6APj zPTiaChYR#Ql{X#W8R8-O8wUtU|G}LMy{0`wb*{{LhJOUa0~XGZkvskZdOE$(<4tKn z;F}93P4rD#FbUJhF}vZ1!1!YdzPag#Mq>jbK4m09cy4Iwj+mQP+HTa`-PFXMEkN1n z?F86uIgi0_3({mkD=*s*Ke4D&FsYdyasK@r=&P}CnX|H-C2>4RUH6XG@;oJ8`rbzY zfo!9d??vV+m?ekIOsx8^bG=7kkT9uZ?)Rf+nxf1(V{bbzBbTRTd}r1~Vyv=dUERs9 zKCT6U@+1A&ZPKoKvB#vFv8X~HJ;NIPm%5v{L^_{U;SF~CEqEihD)Fc(NH}*$peegn zlsTb?l0x^|A(|J4G?(K5#m$r+!np3*lo~&7pxFU{f<#R1?Gc*ZUttR7$u*Y_tfC{@ zn-4h7-gVM}b^l;Zj#5{@&8BHYQze%o=sNtnqTUr?)qRT8XLZfjccK80#wt_CCMGtLm8|GWlNaxa`=Ws_LDZ`{O+Yf8W^w~AUlkNzuaD>xJIwJK{a%>Npr1g!Z{3TZ zv-JJ>*4NaW#C~an5FtcvX7NbvEgY4|*6iA8cv!t3ypGa8dl$h{8?e+lx1kH z^a3CK!u8~{@QX1~I_F_?It6kpxI_E$KUz|hcm^xU6J^Y|Or_gjm)L8oSUR$N=4Wig z-hUfZ2HGWmLld>0=K#{xu_9|B%==4KfE>F>mx>oj<$7^ z(pD`k9o~*lpMx4&4Zqg#+rbR9&{8WgLK(AjyaS*Ii(+ot718v=t3w#kC3Qqx;3*cI z)*dT|DMDnY$T~>*sqF{}($IVl#J7MxI$GL!g13r&$MA5girzW8x>TyjQqpNqT4o%K zYJ{+NamOI-TNd=c;=&Y?yIVyY#!eqHtriO`LGpcEezvsTjNj(I)VxuYJC&B-xx8$u zmqM%tUk2p$3#9*cLVss|t`j1ptX7KLUVhlN<~ioZ%v^#zs(5EOw)V4w>-}CUxt~!{ zA>Unf7Li#r@CAiGbBivmY@naF_7a0B&GJ67qMk1O>#t?otbN|P=i`!;MctR5M``b@ ziHT;R;M&6>lgETE>xFBtxw}pGHrM5{Dp`l2-z*wiDKYa|nhX0B zOVV8{T(-HMt!SXr7AmH@Y!#;rE;O_=8Hl-Z2{nlV&W!LN6$ys7`5d0U>uAW44YhA( z73clb{L;v^zqqV*nc-%k9LL;?Yhz6Iw5!G}U;-Rhk}yN1Iq2d5i)nqyoiY|7*}$7D zN##MTJ{oA8DQdNThu}h#Hg6)}cvi-yG3s<&uJH&QzV*V4agu~edCvvkIqDJ|vNwvX_JY71I~wSshuNs8{KY3nuu~eM+#SVI~aHx7JEbH0o&<;$ourUN>?CheC1<<)f|%uxq-luLf$)g6WX5 zDE6ds0*3QATt5*3LeX?zUf}7!my66?#S8ah*6apUx>AC+nNQv7%*=aHh6;<}Yn=m4 zAjKX{)4H@=UpJX=qeV$#|}%w7(;T7 z`}1Uk1VST-+Wo_YmqS#hGJ8Y#`Jt5MTvn1`3sAL&e=8_L7;0+4f)vgY=|Kwa6e@p$ z)v9u63{l#g>z;_JP6h0JiPmc;p=`H{_3yau+Wh==pOt~g#4zG!wT&~K*X}y0#+=4B zZKk=6XAE=fzPmUZt?rP(`w1^=-&N)Ym#Q{i&L~^P#U(yE{K;K_373HH{P{Xg^4)1G zn`=<8M{GOz`KNcP&~b)l{@A!yv2Y5boEt$eO>=Y)m6|m@p?EekyR(PMEx665B@Q0N zb_Pj%EE1d@Twz_If2}ibu8o4T+8?$=l0V4Z2nu$+^RNbsL(&zT z2naFk_s9Ro51az&?UpD7sjYo{x#1#$^xWB1xwc2SRNN#gTVx_S*eOn1SP=>%r8AVT z3cX3GuZ9@+uIxP&RIlxXu-fEB!iWEx{Xq(bs`hSs_QugbIa#fYht3(b>I{Oy3d%RO zBl__gU#Do7EtUaO8)i0rHDaM%i63vsm6Al+wnW$vX!r0{(^3zgPPOc(SUNJZoX=QT zkn#~jrV*Im{cN?O@gM7-Kk`6PSHLC>_Y_>Sa!TKtMq_UFinJ*ytCpH#`kE9A-vh#I zhC}$#5ju$4WoJ0RgH)i=n{1DRz*kTaHbz|T3pQu$5hq4?tGBMt=*K|4I!aByt@g_#A%_S3|KE`@GDw@1qXfU6u)$*?W4SK%x; zcPs5HO9-O0Q-g}!jl~5d8+#_$K7J zZB(T&)KKamCQ~WeqzCFPB|ID(TDhNYSE=a@psAIiL$}Rs(nzNEc}EDwL-t9mosn<; zYtE_vS#D>fukxJ;OnpJ<FQ>^HK5DTNl zl@k7jvMkTHU1@&?roKBgPek$Ca1y7-8)jXXArK=^79K1#cS0@?u`)(AP=RV9LpYJ2 z;yA>6Z85fn==3O*2=D>@*@y$oD!HYXs?l?A6+SZ|5<+9kw~D#1y3 zO{^Oq=Euqv*;d{vcNho^t7!B2`bfG89SKo%jX1ZCZ)l7U)V7D9qg)xV`F_?87Y^mq zm(7%Tku{Sa8tD&p+B`UiW;P1znr0X7tpx2bPSMsDiSsMjS~4IW3~mnKxmCa9E%{j} zm<~o1lpl??MBKVFG~^|Q#QI$K(oZ&p!c%1_*VY|$&&sRcZpm!GXCAne%1;0$gjO`A07tjXSe)F$5KVfWJu-K(Y-jS^6)?Z0;+j81ENEr)yq}ms>AEUa0u7ct^^* zD1P_;*-by03DlYet+5qq%i$LzSgQ)lb~oR(hCCS2HmHvZhp=QO@yN9)CX418Jh0IQ z@A;J)Rv}nGZYZy%o!@}4CAL9+3F0EVL#qftQ>z1J`{G>pu3}U3hA5ksoR>8HfjK=t zk_Q}@YmIpoCC$B9Q8(^&KO@+KG_F9>iFOeQ241G34fD;WvIRZoL9n))Xif%hNxcOD z>E}J=IOXg_^h#na0NGFkkiMzVkb_<40?co3!TBwaUiHSQo6(`*!u$j@=>j<{1qHcS z2$B=?S!H2b6{PW*UB~{I0d5b_t+Tp1^#x@Yo4`ux2G)YM9ET!NS2`(OimuV;c(5Fg zq3g$=V>;>eW^}`StYUu)6x;?7hhW1ZwKDiquBggH!W5L_4f80S*+P(V>guM!tG8C6+e{Joj1)v=GRb^7o zCVU%9buKBdTL5BWc83O`} zS!`y`F84e5WI92k?d_jSnfhDE=cQQ%jAQRA(RlrLg21+b@PGpUrRq0EtY!QpgJ>fF zGr|5Fdq|UyPoN;1KHQX_wVT3AliLWkIDy(7jJ6~Gv?XG+?S>YL59S2 zrHo5`1TRm6xL76Sbre5q;<~KeEMQWh@T3tgP~Kizmoz@18*)(NX)ZH{8a9!NEK&3c z1=&vhmdEiuMpp*QvZ=+yzld7zJ_YBa+|kcN%I%9$Kt7PAJ%k4R0^A7x+A-yCS2!pJ zt195yzL885rFN%@Z`(a8w^Va9OOvuH4ivN84E-GW)iQ$tV|x#P2vgp=of8Ji$x@5p zPX#02G1;~tPV^MBm;0AQvW^>#(16xPI|ICRB74ogk;39cKgS%_&s&5MvV){^NY51% zbOoOALjvOfR_sXkE&rAG+uA46bxm$G22{l@S5F|baQy_WO#~dOjnY&ND97{#g}6oF zPyb-K35I@TIG@H`nFTRS1BGg`G4kNPmO%-9zE~!nmcm;WU#&cNR zY{nP}9$BxaE6B)2UAhDG#-Erm1fKdG>w1)Ipq}tcg{)712H5mF4n4uFPvA-`?Q!0Q z*9p&BL8P`#qUE_xP&ykYt2-(FXdQ^l>LbUMLz9S5j0XJ@TOrMqvxfTXdYbWxujS6? z;#G~sN9B>s#9CQIW)u7G6~o-yI-uLvST!l4*|sfXV1iAy7pa-!y0v$%me~hYWa=kV z&US@BIv{piE0*-HB4mEV79{XqbE-3wl1ScRvr z4qegv!OSB^X7D|6#QC(-JLOQ~&iiEmY4_AI|AaCOaCS_G`c67}nwi&=-z=h0!)OM6 zbbLHge&q{6uF`r;pNfyPC0^-wzE16s-W2iroS)p7kpA{e77kw1eRw@|we3PPg(Be@ z^;}2+9JsXy3Cu042h1!ewK`b+a#g(wW@Do&F1m?%uI0N{9VXF#!LN zZfjgXkzY^x4)qmqY_<5N93)v0r4;lHBPOo)mwa0)9($pha=_^pzY3T!ABv=j!;k;0 zwyHBX0gy?ck<*&nz3d}}8V+Ue6}MaX3EDy2+}x-z$!~8Jx21wW|Ivmh=G)Ktv@^XO zdb1&bc`6(-dW8)>6x(N<^iS)Z(^TGk#NIagt|Tz` zJhad_T1Z^*)$HK6P6i8FYLPjng{mjW>LEO2%mZ8}NXALXoOoo>b+F7qE=cgfOBbeD z-S`8R6eT({WR`65+FhTiqgBHwB(X(fe{;PxKZ0*ka@9ciRNh_Ppp(MpoF442hjjAu`mubG=x0c-?;n#OuV6S`!6mrn2R9>lb^7!|#h0G%D8&a-hEAs_4 zhLQvoN#h#0+r3+;ng<5~Ub?s%-&N<@XMP#$leJ6+n+B~UwW-((AQg!W&rW8YDlzbP zu}M=x3*oxW2Y)9RL#t`9HK)Yt&ppZ2a@id5m?GYNYc6BGydQ6&_hh_BkEwz&UVZwA zMaK{JBUKSmD$_5qzGTQdAHQs)^YFvpS`=65;3&*TCZ?ka?lb2Mc_>}MNs8}hce>b- z9jel8tk$MZ7t@rdFLn3FzYN~$4`v*atC?!fOUS9lx0U_JT?qm;rpv>eFeCZ#IiQq$ zC#t7c%WDSEhQ9rarf&kVH>84Nx8607uld^vK%|M5UC<@({z#RtIo}mhh!rx9G|uEG zpDO-nH+Av*qPPd|(8A3Kd4+20H$%Kizw1!N_!94-yf!Tk)K{oM{rcXv@yq4Yvpjf8 zQ>Cs%O@2;vO@{GKvny#|ozOv;v+bB;-fcJ}w<6;ZjAhZ5IoS2(M;Bmxs`Uu@7Go$w zm|W-MbL0V$Bb*Id^Gg-9ik3$bNT_M$lQ%&*xJO zQt7P>pPm0$#=a|qtqvhPG-eSMpA!w8SBPHsqqeNv1o)~LpGAypnqkuCRm;2L8-j~Kcqi3zshSU2CTzaJX9)jt5 z55clC|5@wz?$(@5qOA^SD?VkJG7B8BkTofjPpevB1+W~$>jB4$?m9nUmWGDu&dIaI zJy3uLGJbdwlp1?|2Tu~o999r$0tTAlyy2>QBnuKm`{sN0JVMz6qt7is0@ChI z+!aG5=wR@;>3hf%W-IHZT4e<@Kq$FAQ&U3 zZRNt{*4sZhJZEMi*5roQ6zU*Y+X5Df@YoakYT9S}S9G*74#t&5NRAvab(Cv;7$t9=nsKqAK zM+q#O4c41S|djG@dFRpfwlRocK*Y`{5hP%k6n4U7`_YC{P*C zCT_bJNdCjR>@^<8-o7i*?rRWco8c87oCLNaITGLV&Z~Z-xt#7C2H)ie`0bRrh|Ywi zqyeO~e%!GaHN_q6;R%@ZL6g0UVb<@4Y!!M?+XGyta?*92pEXH6b~$ofN6-4TT14@%{pF7-js`~6?v?t%==lXOWZf~Y{#t5py&^N_d2dJBo;K%oCTtXn z{fVz8QaXADmlSU^0+Y@o9?9v2j#8QcAR|v4iT$(M5(h{G#3O@zW<%%|p=C1rL?K?ch zd`TuA#ppAQGbzH0k4@jU^D-F7wH=W4xaWae0wHD-y$L$Rg7T%jrKB?iu(1AO#1>(p zW1GAHL`H!#Ki)?DJ4O5%4n^$1Un^9KHLSV(w3S^R^JZO0ZF<=FNP;%crpO>cI*41p z-bm7c=FtZHcmbvwGHPh-HkLKKCWB@UKvcFAm@^cCAA)9qVfvM^FGP|3yzP{%L2bIsp_d&knK6ti@(gaF$DPX7?$%C zix12eq73JJxjBQ^o(izRP}OYTBsK5C$Z+@nB)fCV6u&(ijuv>q#lhW6P&Zr)<0C zM~E8N9qa&x(<<-}BP#1v9d|^tBOWC7cizP(DyYCT!q<-O07s7tRwDdTX7sSrFeP=h zEYB$$L1|zjy-|z_!`!fSoksCdq_zk;xM8Uc#~fc<%#Qq-keL5kx1*XFIMCtqumzif zD)o9Q9kb4^>VfNA0=25TyigZcr_at)@75KQ-SQ$lBYEU!u}vb!xO~C%bw>;OR3U^U z>m6Sc4DO0hB4SLZX0V*5-XBb=;sNk@#R8ZUWfM$&0F8jNznvsq2SsQg%l%P6ETp#x zV|ml}sGjm@fX(0Xmtp+P?TTQYD)$tfvR4vG+))Dm`b**q|8-)~y12Z0SZ9=;%o1!ILs53 zB!omDu5|mM#~Z9W>jUA-l%Sh9H+|8 zit+w}a{iIv&;g+A<;WiyZ0>1Qs>AFjyx5phw?o7ry>XWr zES`WHah*(lF*!MHLvv|zA&PUs2A_R*`MTyN7+c^u6wF06V32eTP+qgV-ppGHPr?BRlLO(oiJ3 zmggB#c`!*XwB+nHjkcGrraysgEkMAI79V=Z2hwO*7@@Bu;>5 zU+sjByn6#`Q{9JsI^RHSh!u#CkO-viJl4ikQmkI2;UXu0bQ5YkxWjXeM3wDRoTgh* zhOq=0?Xh|U`M4s>PgylNeU!Oe6<>ei9HJ)m-g}eJC>q+(MO(<6>T#dL4CASr`(juo zzhkeq5Ue6s;vfO*R#pDan2)M&}$; z%q^RREjn+lXS+!_O<^byz%~twp)u@{)sTF|DM_&pJ2U)(AFJM zOzVOy7G=+7bU#E(ciT3v0&Y3c=Zd`pT{OlVKR%&ctO&Q|$NQOu_On-F;p!+A z|9FBg*!nEKS+bu!SzLgC+fE^xuz_6xtvHT5{1L%Pe+2rXRR{d$sb|(TsH4yiJ@Zqg zIKgB zm;e)QAA#}JGq7vM#^S9veq>U+K1uvd)$1holdT1sa*ftcorX1nhT|7g zZm}p9Pe#j6sLmQ@96kY$5>B}8Gi#d>_}$#@*>M6r=uF>t70hGb+a#wXT@n6kK|x

\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfMars\\nMars, the fourth planet from the Sun, is...
1mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
2earth.pdfSolar System\\nOur solar system is a vast and f...
3earth.pdfSolar System\\nFor more details about our Solar...
4earth.pdfEarth\\nEarth is the third planet from the Sun....
5earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "1 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", - "2 earth.pdf Solar System\\nOur solar system is a vast and f...\n", - "3 earth.pdf Solar System\\nFor more details about our Solar...\n", - "4 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "5 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "id": "6bdd3515", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "6bdd3515", - "outputId": "00705442-b6ae-4238-b0f5-c94de690ecb4" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 1------\n", - "Basic facts about Mars:\n", - "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "Ā· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 3------\n", - "Earth\n", - "Basic facts about Earth:\n", - "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "Ā· Rotation Period: 24 hours (one day)\n", - "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } - ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "2b34d9c6", - "metadata": { - "id": "2b34d9c6" - }, - "source": [ - "### 7.4- Understanding the output\n", - "\n", - "So we started with 7 rows and ended up with 6. Fuzzy dedupe removed the following **very similar** chunk.\n", - "\n", - "These are pretty similar chunks except for the words 'the' and 'our'\n", - "\n", - "**earth.pdf**\n", - "\n", - "`For more details about *our* Solar system see Chapter 1.`\n", - "\n", - "**mars.pdf**\n", - "\n", - "`For more details about *the* Solar system see Chapter 1.`\n", - "\n", - "Pretty neat, eh? šŸ‘\n", - "\n", - "### Configuring Fuzzy de-dupe\n", - "\n", - "You can tweak fuzzy dedupe by tweaking the following parameters\n", - "\n", - "```python\n", - "# fuzzy parameters\n", - " \"fdedup_num_permutations\": 64,\n", - " \"fdedup_threshold\": 0.7, # (default 0.8)\n", - " \"fdedup_shingles_size\": 5,\n", - " \"fdedup_delimiters\": \" \"\n", - "```\n", - "\n", - "In our case, we set `fdedup_threshold` parameter to 0.7. \n" - ] - }, - { - "cell_type": "markdown", - "id": "5370950a-2a3a-4143-8218-f9b4808099ba", - "metadata": { - "id": "5370950a-2a3a-4143-8218-f9b4808099ba" - }, - "source": [ - "## Step-8: Text encoding\n", - "\n", - "Encode text for the vector storage." - ] - }, - { - "cell_type": "markdown", - "id": "85aba685", - "metadata": { - "id": "85aba685" - }, - "source": [ - "### 8.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "outputId": "e1795167-9fac-4b7c-9417-f655c30848a1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "šŸƒšŸ¼ STAGE-6: Processing input='output/05_fuzzy_dedupe_out' --> output='output/06_embeddings_out'\n" - ] - } - ], - "source": [ - "STAGE = 6\n", - "\n", - "input_folder = output_fuzzy_dedupe_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_embeddings_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "c97545f4", - "metadata": { - "id": "c97545f4" - }, - "source": [ - "### 8.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "outputId": "f4c2cba4-aed0-4eee-873b-d1a8abf60cbd" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "18:40:39 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "18:40:39 INFO - pipeline id pipeline_id\n", - "18:40:39 INFO - code location None\n", - "18:40:39 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", - "18:40:39 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:39 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:40:39 INFO - orchestrator text_encoder started at 2024-09-18 18:40:39\n", - "18:40:39 INFO - Number of files is 2, source profile {'max_file_size': 0.009204864501953125, 'min_file_size': 0.009014129638671875, 'total_file_size': 0.018218994140625}\n", - "18:40:41 INFO - Completed 1 files (50.0%) in 0.003 min\n", - "18:40:41 INFO - Completed 2 files (100.0%) in 0.003 min\n", - "18:40:41 INFO - Done processing 2 files, waiting for flush() completion.\n", - "18:40:41 INFO - done flushing in 0.0 sec\n", - "18:40:41 INFO - Completed execution in 0.032 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "āœ… Stage:6 completed successfully\n", - "CPU times: user 816 ms, sys: 204 ms, total: 1.02 s\n", - "Wall time: 2.53 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # text_encoder\n", - " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", - "}\n", - "\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"āŒ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "b734852c", - "metadata": { - "id": "b734852c" - }, - "source": [ - "### 8.3 - Inspect Generated output\n", - "\n", - "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "id": "7b1c1d09", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 205 - }, - "id": "7b1c1d09", - "outputId": "86c49244-9f9f-4116-fb17-c27ff6c29bc7" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (6, 18)\n", - "Output data dimensions (rows x columns)= (6, 19)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_idremovedchunk_hashembeddings
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...6[]-1[0.07728295, 0.024970993, -0.043180738, 0.0580...
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7[]-1[0.10598018, 0.025460618, 0.023627337, 0.03905...
2earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...0[]-1[0.0077404436, -0.02055944, 0.026426593, 0.011...
3earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...1[]5[-0.062105548, -0.0053322907, 0.031277698, 0.0...
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...2[]-1[0.072435796, -0.058001805, -0.019771898, -0.0...
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...3[]-1[0.091821924, 0.015197902, 0.07716932, 0.01711...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 earth.pdf 1 0 11 \n", - "3 earth.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "2 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "2 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "1 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", - "2 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "5 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox chunk_id \\\n", - "0 1 [132.87440491, 500.84011841, 477.48345947, 534... 6 \n", - "1 1 [133.2026062, 482.90710449, 237.04431152, 493.... 7 \n", - "2 1 [132.87112427, 588.96014404, 479.40917969, 623... 0 \n", - "3 1 [133.20942688, 570.81555176, 375.57919312, 581... 1 \n", - "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 2 \n", - "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 3 \n", - "\n", - " removed chunk_hash embeddings \n", - "0 [] -1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", - "1 [] -1 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", - "2 [] -1 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", - "3 [] 5 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", - "4 [] -1 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", - "5 [] -1 [0.091821924, 0.015197902, 0.07716932, 0.01711... " - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "f5e12630-be6b-4188-a925-77117155617b", - "metadata": { - "id": "f5e12630-be6b-4188-a925-77117155617b" - }, - "source": [ - "## Step-9: Copy output to final output dir" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "outputId": "aa667c65-8421-4d4d-f57e-47ccc4ea41ad" + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "outputId": "31f09b58-7b2d-48bb-9dac-bc0ba9625c01" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "āœ… Copied output from 'output/06_embeddings_out' --> 'output/output_final'\n" + "āœ… Copied output from 'output/05_embeddings_out' --> 'output/output_final'\n" ] } ], @@ -3836,7 +3299,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "dpk-1-basic-022dev1-py312", "language": "python", "name": "python3" }, @@ -3850,27 +3313,26 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.9" + "version": "3.12.7" }, "widgets": { "application/vnd.jupyter.widget-state+json": { - "0a1ed94698ca4e4291c553929e0ca66c": { + "06f9b33494984e4885d5aad813d1d2bc": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", + "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", + "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", - "bar_color": null, "description_width": "" } }, - "2eea7bc810e54eaeb325136352b71e66": { + "1cb3bbf7d724411cbe9831543a4aecc0": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", @@ -3922,46 +3384,7 @@ "width": null } }, - "3077f04af3a9447ab98717bd3131cd8f": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "4f63bfad92b64e7bae18e720376d402d": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_709685da1c6c4164bed658357a2191bf", - "max": 7, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_0a1ed94698ca4e4291c553929e0ca66c", - "value": 7 - } - }, - "5dbc6889a9c243c5a922f8cc5f1a704c": { + "553f3c16839a49d79591d0fc4862bed6": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", @@ -4013,7 +3436,7 @@ "width": null } }, - "6957a659451b46dab702c1c62fa9cdd2": { + "7053c9606a414e978636a7e241909504": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", @@ -4028,13 +3451,51 @@ "_view_name": "HTMLView", "description": "", "description_tooltip": null, - "layout": "IPY_MODEL_5dbc6889a9c243c5a922f8cc5f1a704c", + "layout": "IPY_MODEL_1cb3bbf7d724411cbe9831543a4aecc0", "placeholder": "ā€‹", - "style": "IPY_MODEL_d6e520e4da004c818031ccfcc3588e5d", - "value": "ā€‡7/7ā€‡[00:00<00:00,ā€‡221.60it/s]" + "style": "IPY_MODEL_06f9b33494984e4885d5aad813d1d2bc", + "value": "ā€‡10/10ā€‡[00:00<00:00,ā€‡349.38it/s]" + } + }, + "724778729161445c98b187031ae4f67c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "97b603697cfa4b4ea4e6735b6768ca35": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_e87e8d3262c54cfaaa8768505edacda3", + "IPY_MODEL_b78aa40816e44f7fbebcb24ca68818b3", + "IPY_MODEL_7053c9606a414e978636a7e241909504" + ], + "layout": "IPY_MODEL_da0787b239764847a731083997780a85" } }, - "709685da1c6c4164bed658357a2191bf": { + "9d184ed175f0403fb03c2e13dfd04e0a": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", @@ -4086,50 +3547,31 @@ "width": null } }, - "7616f1b493e1461c9fd1319fae3bc10b": { + "b78aa40816e44f7fbebcb24ca68818b3": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", - "model_name": "HTMLModel", + "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", + "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", - "_view_name": "HTMLView", + "_view_name": "ProgressView", + "bar_style": "success", "description": "", "description_tooltip": null, - "layout": "IPY_MODEL_ebc626c0750c470db6789b26acf15f60", - "placeholder": "ā€‹", - "style": "IPY_MODEL_3077f04af3a9447ab98717bd3131cd8f", - "value": "Fetchingā€‡7ā€‡files:ā€‡100%" - } - }, - "8226b2522ce446f6bd3a36c4e227370c": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_7616f1b493e1461c9fd1319fae3bc10b", - "IPY_MODEL_4f63bfad92b64e7bae18e720376d402d", - "IPY_MODEL_6957a659451b46dab702c1c62fa9cdd2" - ], - "layout": "IPY_MODEL_2eea7bc810e54eaeb325136352b71e66" + "layout": "IPY_MODEL_9d184ed175f0403fb03c2e13dfd04e0a", + "max": 10, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_724778729161445c98b187031ae4f67c", + "value": 10 } }, - "d6e520e4da004c818031ccfcc3588e5d": { + "c0eb5bc8f6ee427ca42204b3c56f9a4e": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", @@ -4144,7 +3586,7 @@ "description_width": "" } }, - "ebc626c0750c470db6789b26acf15f60": { + "da0787b239764847a731083997780a85": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", @@ -4195,6 +3637,27 @@ "visibility": null, "width": null } + }, + "e87e8d3262c54cfaaa8768505edacda3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_553f3c16839a49d79591d0fc4862bed6", + "placeholder": "ā€‹", + "style": "IPY_MODEL_c0eb5bc8f6ee427ca42204b3c56f9a4e", + "value": "Fetchingā€‡10ā€‡files:ā€‡100%" + } } } } diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index 7ce746c67..6a14dedc7 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -13,7 +13,8 @@ "\n", "Here is the workflow\n", "\n", - "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" + "![](https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n", + "\n" ] }, { @@ -27,7 +28,7 @@ "\n", "Two options:\n", "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_ray.ipynb)\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", "\n", "The notebook will work as in both environments" @@ -42,30 +43,10 @@ "source": [ "## Step-1: Inspect the Data\n", "\n", - "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/solar-system)\n", - "\n", - "- [earth.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/mars.pdf)\n", - "\n", - "### (Optional) How to create PDFs?\n", - "\n", - "If you like to play around with various inputs files, follow these steps to re-generate PDFs.\n", - "\n", - "**Option 1 (Easiest): Use a word editor or google docs editor**\n", - "\n", - "Write your content and export as PDF\n", - "\n", - "\n", - "**Option 2: markdown -> pdf**\n", - "\n", - "First edit the markdown files using any text editor.\n", - "\n", - "Then use [pandoc](https://pandoc.org/) to convert them to pdfs.\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/main/examples/notebooks/intro/input/solar-system)\n", "\n", - "```bash\n", - "pandoc earth.md -o earth.pdf\n", - "pandoc mars.md -o mars.pdf\n", - "```\n" + "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/main/examples/notebooks/intro/input/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee//blob/main/examples/notebooks/intro/input/solar-system/mars.pdf)\n" ] }, { @@ -87,11 +68,7 @@ "execution_count": 1, "id": "1fe354b7", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "1fe354b7", - "outputId": "6fe04a4c-8092-49bb-f4ee-ffdcd42b6c11" + "id": "1fe354b7" }, "outputs": [ { @@ -128,19 +105,15 @@ "execution_count": 2, "id": "3309799e", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "3309799e", - "outputId": "5af8cfbc-346d-41bd-c14e-c917d0f403f3" + "id": "3309799e" }, "outputs": [], "source": [ "if RUNNING_IN_COLAB:\n", - " !mkdir -p 'input'\n", - " !wget -O 'input/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/earth.pdf'\n", - " !wget -O 'input/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/mars.pdf'\n", - " !wget -O 'utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/my_utils.py'" + " !mkdir -p 'input/solar-system'\n", + " !wget -O 'input/solar-system/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf'\n", + " !wget -O 'input/solar-system/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf'\n", + " !wget -O 'my_utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/my_utils.py'" ] }, { @@ -158,12 +131,7 @@ "execution_count": 3, "id": "1fcec577", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "1fcec577", - "outputId": "93aa2df3-0cf5-4b04-84bb-6803bbf46df6" + "id": "1fcec577" }, "outputs": [], "source": [ @@ -219,7 +187,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e4YMZrBuFycl", - "outputId": "8a316776-582c-4d01-80de-cd530081a080" + "outputId": "54e232da-b2a8-4f3e-d983-94259505dad3" }, "outputs": [ { @@ -250,7 +218,7 @@ "base_uri": "https://localhost:8080/" }, "id": "33345487", - "outputId": "47dca359-2740-493d-83eb-1291617d3db1" + "outputId": "c14c3a3d-c074-4535-b75d-19c5effa7d94" }, "outputs": [ { @@ -272,10 +240,8 @@ "\n", "MY_CONFIG = MyConfig ()\n", "\n", - "if RUNNING_IN_COLAB:\n", - " MY_CONFIG.INPUT_DATA_DIR = 'input'\n", - "else:\n", - " MY_CONFIG.INPUT_DATA_DIR = os.path.join (os.path.abspath (''), '..', 'data', 'solar-system')\n", + "MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'\n", + "\n", "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", "\n", @@ -339,7 +305,7 @@ "base_uri": "https://localhost:8080/" }, "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "outputId": "704d5f45-5d49-43b0-afeb-1dddf2aa326d" + "outputId": "fd42f265-445f-488c-8c62-b293424f162d" }, "outputs": [ { @@ -404,14 +370,14 @@ "base_uri": "https://localhost:8080/" }, "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "outputId": "5ef25857-46d4-463e-f847-369d18cb2d8d" + "outputId": "f4c02b6f-effd-4d04-8547-f270f721f8d2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "šŸƒšŸ¼ STAGE-1: Processing input='/home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system' --> output='output/01_parquet_out'\n" + "šŸƒšŸ¼ STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'\n" ] } ], @@ -443,38 +409,38 @@ "base_uri": "https://localhost:8080/" }, "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "outputId": "7a069b9a-1159-4993-d2b0-b26b16235f6b" + "outputId": "2cb0721a-1526-4129-a72f-77c1beefafdb" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:49:32 INFO - Running locally\n", - "18:49:32 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "18:49:32 INFO - data factory data_ is using local data access: input_folder - /home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system output_folder - output/01_parquet_out\n", - "18:49:32 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:49:32 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "18:49:32 INFO - pipeline id pipeline_id\n", - "18:49:32 INFO - code location None\n", - "18:49:32 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", - "18:49:32 INFO - actor creation delay 0\n", - "18:49:32 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:49:33,959\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - orchestrator started at 2024-09-18 18:49:37\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.135861206799746, 'object_store': 4.06793060246855}\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(RayTransformFileProcessor pid=1212179)\u001b[0m 18:49:40 INFO - Initializing models\n", - "Fetching 7 files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 7/7 [00:00<00:00, 167772.16it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=1212180)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:46 INFO - Completed processing 2 files in 0.14 min\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:46 INFO - done flushing in 0.001 sec\n", - "\u001b[36m(RayTransformFileProcessor pid=1212180)\u001b[0m 18:49:40 INFO - Initializing models\n", - "18:49:56 INFO - Completed execution in 0.4 min, execution result 0\n", - "Fetching 7 files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 7/7 [00:00<00:00, 38031.25it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=1212179)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n" + "22:45:46 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "22:45:46 INFO - pipeline id pipeline_id\n", + "22:45:46 INFO - code location None\n", + "22:45:46 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", + "22:45:46 INFO - actor creation delay 0\n", + "22:45:46 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:45:46 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", + "22:45:46 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:45:46 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "22:45:46 INFO - Running locally\n", + "2024-10-16 22:45:48,783\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - orchestrator started at 2024-10-16 22:45:52\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.14609298761934, 'object_store': 3.073046493344009}\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(RayTransformFileProcessor pid=1001895)\u001b[0m 22:45:55 INFO - Initializing models\n", + "Fetching 10 files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 10/10 [00:00<00:00, 103563.06it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=1001895)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:00 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:02 INFO - Completed processing 2 files in 0.033 min\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:02 INFO - done flushing in 0.001 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=1001896)\u001b[0m 22:45:55 INFO - Initializing models\n", + "Fetching 10 files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 10/10 [00:00<00:00, 126716.13it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=1001896)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "22:46:12 INFO - Completed execution in 0.43 min, execution result 0\n" ] }, { @@ -482,8 +448,8 @@ "output_type": "stream", "text": [ "āœ… Stage:1 completed successfully\n", - "CPU times: user 4.1 s, sys: 1.17 s, total: 5.27 s\n", - "Wall time: 28.2 s\n" + "CPU times: user 4.46 s, sys: 1.22 s, total: 5.69 s\n", + "Wall time: 30.4 s\n" ] } ], @@ -559,10 +525,10 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 254 + "height": 255 }, "id": "fe59563d", - "outputId": "9ba799f3-a183-4467-d50f-44dbbc86d19a" + "outputId": "40c31bad-d00a-4da9-8169-9db1bcc47704" }, "outputs": [ { @@ -615,12 +581,12 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", " \n", " \n", @@ -630,12 +596,12 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", " \n", " \n", @@ -648,16 +614,16 @@ "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", "\n", " num_tables num_doc_elements document_id ext \\\n", - "0 0 11 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 0 11 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "0 0 11 f20aa513-8473-4bf7-a746-a66eb28b722c pdf \n", + "1 0 11 b4c44875-3612-4c5a-b387-2f04c63d1276 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:45.937701 1.966178 earth.pdf " + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.131556 2.001925 earth.pdf " ] }, "execution_count": 10, @@ -708,7 +674,7 @@ "base_uri": "https://localhost:8080/" }, "id": "f870e624", - "outputId": "e759dddf-64ac-4b55-a9bf-d0722620d6ab" + "outputId": "fd259342-158a-4a33-f148-d8462e2f1ca2" }, "outputs": [ { @@ -860,7 +826,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e1a10c2d", - "outputId": "d9eab8cc-79ac-4f5e-99f3-596e357a2e39" + "outputId": "68cdc0c0-3bf5-45a2-d2bc-99aa79e3e0d5" }, "outputs": [ { @@ -1034,7 +1000,7 @@ "base_uri": "https://localhost:8080/" }, "id": "305f00a3", - "outputId": "d680cc28-2d3a-4793-9373-c56635a308c9" + "outputId": "7a800f4b-bc80-452d-c3d6-170e19f3422e" }, "outputs": [ { @@ -1075,32 +1041,32 @@ "base_uri": "https://localhost:8080/" }, "id": "5b7b18d5", - "outputId": "7151d997-74f1-42fd-90a2-0124c6a68c84" + "outputId": "e6f06879-906c-47d0-ef34-b018e4efa00f" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:49:58 INFO - Running locally\n", - "18:49:58 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'output_chunk_column_name': 'contents', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", - "18:49:58 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "18:49:58 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:49:58 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:49:58 INFO - pipeline id pipeline_id\n", - "18:49:58 INFO - code location None\n", - "18:49:58 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "18:49:58 INFO - actor creation delay 0\n", - "18:49:58 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:50:00,178\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - orchestrator started at 2024-09-18 18:50:02\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.085193634033203, 'object_store': 4.042596817016602}\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:04 INFO - Completed processing 2 files in 0.033 min\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:04 INFO - done flushing in 0.001 sec\n", - "18:50:14 INFO - Completed execution in 0.271 min, execution result 0\n" + "22:46:15 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30}\n", + "22:46:15 INFO - pipeline id pipeline_id\n", + "22:46:15 INFO - code location None\n", + "22:46:15 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "22:46:15 INFO - actor creation delay 0\n", + "22:46:15 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:46:15 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "22:46:15 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:46:15 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:46:15 INFO - Running locally\n", + "2024-10-16 22:46:16,484\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - orchestrator started at 2024-10-16 22:46:19\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.136235047131777, 'object_store': 3.068117522634566}\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - Completed processing 2 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - done flushing in 0.001 sec\n", + "22:46:31 INFO - Completed execution in 0.271 min, execution result 0\n" ] }, { @@ -1108,8 +1074,8 @@ "output_type": "stream", "text": [ "āœ… Stage:2 completed successfully\n", - "CPU times: user 917 ms, sys: 285 ms, total: 1.2 s\n", - "Wall time: 18.6 s\n" + "CPU times: user 1.04 s, sys: 360 ms, total: 1.4 s\n", + "Wall time: 19.1 s\n" ] } ], @@ -1171,10 +1137,10 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 893 + "height": 897 }, "id": "d8138d43", - "outputId": "3cbc98f8-1dcb-4a32-9259-f801a83cf241" + "outputId": "3e040b55-8c94-4f97-fedf-d2dbead55a72" }, "outputs": [ { @@ -1184,7 +1150,7 @@ "Files processed : 2\n", "Chunks created : 8\n", "Input data dimensions (rows x columns)= (2, 12)\n", - "Output data dimensions (rows x columns)= (8, 15)\n" + "Output data dimensions (rows x columns)= (8, 16)\n" ] }, { @@ -1212,17 +1178,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " \n", " \n", " \n", @@ -1232,17 +1199,18 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " \n", " \n", " 1\n", @@ -1250,17 +1218,18 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " \n", " \n", " 2\n", @@ -1268,17 +1237,18 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", " \n", " \n", " 3\n", @@ -1286,17 +1256,18 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", " \n", " \n", " 4\n", @@ -1304,17 +1275,18 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " \n", " \n", " 5\n", @@ -1322,17 +1294,18 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " \n", " \n", " 6\n", @@ -1340,17 +1313,18 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", " \n", " \n", " 7\n", @@ -1358,42 +1332,33 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", " \n", " \n", "\n", "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "7 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "7 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -1406,14 +1371,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "7 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "7 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "7 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -1425,15 +1400,25 @@ "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", "7 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", "\n", - " page_number bbox \n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... " + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... " ] }, "execution_count": 15, @@ -1481,7 +1466,7 @@ "height": 300 }, "id": "3090c950", - "outputId": "fa82f54b-53a3-4447-a4ca-2fe92dea452a" + "outputId": "4c3b6461-ae8c-41d9-8c71-e1bbe634b9ed" }, "outputs": [ { @@ -1584,7 +1569,7 @@ "base_uri": "https://localhost:8080/" }, "id": "d5f151ae", - "outputId": "87a8d7a0-0bc0-4735-9edb-57e9c9e5a8e1" + "outputId": "3dc3ec5d-31d7-4081-db16-8bb6051ea80a" }, "outputs": [ { @@ -1644,7 +1629,9 @@ { "cell_type": "markdown", "id": "20217298", - "metadata": {}, + "metadata": { + "id": "20217298" + }, "source": [ "## Step-5: DOC ID generation\n", "\n", @@ -1659,7 +1646,9 @@ { "cell_type": "markdown", "id": "66811f5b", - "metadata": {}, + "metadata": { + "id": "66811f5b" + }, "source": [ "### 5.1 - Set Input/output Folder" ] @@ -1668,7 +1657,13 @@ "cell_type": "code", "execution_count": 18, "id": "1f747c0d", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1f747c0d", + "outputId": "765daa01-138b-4bfa-a75c-bffc80f9e246" + }, "outputs": [ { "name": "stdout", @@ -1696,7 +1691,9 @@ { "cell_type": "markdown", "id": "18aa0fe1", - "metadata": {}, + "metadata": { + "id": "18aa0fe1" + }, "source": [ "### 5.2 - Execute" ] @@ -1705,31 +1702,38 @@ "cell_type": "code", "execution_count": 19, "id": "f6e9e145", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 883 + }, + "id": "f6e9e145", + "outputId": "fe3d0a3d-0575-4dd8-8564-e336a6ddb68d" + }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:50:16 INFO - Running locally\n", - "18:50:16 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "18:50:16 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "18:50:16 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:50:16 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:50:16 INFO - pipeline id pipeline_id\n", - "18:50:16 INFO - code location None\n", - "18:50:16 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "18:50:16 INFO - actor creation delay 0\n", - "18:50:16 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:50:17,977\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - orchestrator started at 2024-09-18 18:50:19\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Number of files is 2, source profile {'max_file_size': 0.008135795593261719, 'min_file_size': 0.008058547973632812, 'total_file_size': 0.01619434356689453}\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.074102020822465, 'object_store': 4.037051009945571}\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Completed processing 2 files in 0.013 min\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - done flushing in 0.001 sec\n", - "18:50:29 INFO - Completed execution in 0.231 min, execution result 0\n" + "22:46:32 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "22:46:32 INFO - pipeline id pipeline_id\n", + "22:46:32 INFO - code location None\n", + "22:46:32 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "22:46:32 INFO - actor creation delay 0\n", + "22:46:32 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:46:32 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "22:46:32 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:46:32 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:46:32 INFO - Running locally\n", + "2024-10-16 22:46:33,897\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - orchestrator started at 2024-10-16 22:46:35\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.126107025891542, 'object_store': 3.0630535120144486}\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - Completed processing 2 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - done flushing in 0.001 sec\n", + "22:46:46 INFO - Completed execution in 0.227 min, execution result 0\n" ] }, { @@ -1737,8 +1741,8 @@ "output_type": "stream", "text": [ "āœ… Stage:3 completed successfully\n", - "CPU times: user 107 ms, sys: 137 ms, total: 244 ms\n", - "Wall time: 15.1 s\n" + "CPU times: user 122 ms, sys: 153 ms, total: 276 ms\n", + "Wall time: 14.9 s\n" ] } ], @@ -1783,7 +1787,9 @@ { "cell_type": "markdown", "id": "4954402f", - "metadata": {}, + "metadata": { + "id": "4954402f" + }, "source": [ "### 5.3 - Inspect Generated output\n", "\n", @@ -1799,14 +1805,21 @@ "cell_type": "code", "execution_count": 20, "id": "1911179a", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 373 + }, + "id": "1911179a", + "outputId": "b82445e8-ebba-48fa-b1c2-26a9e0743ef9" + }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (8, 15)\n", - "Output data dimensions (rows x columns)= (8, 17)\n" + "Input data dimensions (rows x columns)= (8, 16)\n", + "Output data dimensions (rows x columns)= (8, 18)\n" ] }, { @@ -1834,17 +1847,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_hash\n", " chunk_id\n", " \n", @@ -1856,19 +1870,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", - " 0\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 4\n", " \n", " \n", " 1\n", @@ -1876,19 +1891,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", - " 1\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " 5\n", " \n", " \n", " 2\n", @@ -1896,19 +1912,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", - " 2\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " 6\n", " \n", " \n", " 3\n", @@ -1916,19 +1933,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", - " 3\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7\n", " \n", " \n", " 4\n", @@ -1936,19 +1954,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", - " 4\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 0\n", " \n", " \n", " 5\n", @@ -1956,19 +1975,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", - " 5\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " 1\n", " \n", " \n", " 6\n", @@ -1976,19 +1996,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", - " 6\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 2\n", " \n", " \n", " 7\n", @@ -1996,44 +2017,35 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", - " 7\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 3\n", " \n", " \n", "\n", "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "7 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "7 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -2046,14 +2058,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "7 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "7 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "7 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -2075,15 +2097,25 @@ "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", + " document_id \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", " chunk_hash chunk_id \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 1 \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 2 \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 3 \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 5 \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 6 \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 7 " + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 " ] }, "execution_count": 20, @@ -2105,7 +2137,9 @@ { "cell_type": "markdown", "id": "852829dc", - "metadata": {}, + "metadata": { + "id": "852829dc" + }, "source": [ "## Step-6: Exact Dedup\n", "\n" @@ -2126,11 +2160,7 @@ "execution_count": 21, "id": "4c7a1b94", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "4c7a1b94", - "outputId": "7998935d-3f72-4617-ea03-fd2a40ad9f23" + "id": "4c7a1b94" }, "outputs": [ { @@ -2167,36 +2197,32 @@ "execution_count": 22, "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "outputId": "aa460fea-a393-47d3-b084-59d47f26f0a7" + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:50:31 INFO - Running locally\n", - "18:50:31 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", - "18:50:31 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "18:50:31 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:50:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:50:31 INFO - pipeline id pipeline_id\n", - "18:50:31 INFO - code location None\n", - "18:50:31 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "18:50:31 INFO - actor creation delay 0\n", - "18:50:31 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:50:33,176\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - orchestrator started at 2024-09-18 18:50:34\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.064273834228516, 'object_store': 4.032136917114258}\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:35 INFO - Completed processing 2 files in 0.014 min\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:35 INFO - done flushing in 0.001 sec\n", - "18:50:45 INFO - Completed execution in 0.23 min, execution result 0\n" + "22:46:47 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", + "22:46:47 INFO - pipeline id pipeline_id\n", + "22:46:47 INFO - code location None\n", + "22:46:47 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "22:46:47 INFO - actor creation delay 0\n", + "22:46:47 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:46:47 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "22:46:47 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:46:47 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:46:47 INFO - Running locally\n", + "2024-10-16 22:46:48,851\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - orchestrator started at 2024-10-16 22:46:50\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.11034622322768, 'object_store': 3.055173110216856}\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - Completed processing 2 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - done flushing in 0.001 sec\n", + "22:47:01 INFO - Completed execution in 0.226 min, execution result 0\n" ] }, { @@ -2204,8 +2230,8 @@ "output_type": "stream", "text": [ "āœ… Stage:4 completed successfully\n", - "CPU times: user 99.9 ms, sys: 168 ms, total: 268 ms\n", - "Wall time: 15.1 s\n" + "CPU times: user 125 ms, sys: 134 ms, total: 259 ms\n", + "Wall time: 15 s\n" ] } ], @@ -2266,20 +2292,15 @@ "execution_count": 23, "id": "d824ebf6", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 358 - }, - "id": "d824ebf6", - "outputId": "89f1013d-6dcf-418f-a0d7-5f78b19b74ac" + "id": "d824ebf6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (8, 17)\n", - "Output data dimensions (rows x columns)= (7, 18)\n", + "Input data dimensions (rows x columns)= (8, 18)\n", + "Output data dimensions (rows x columns)= (7, 19)\n", "Input chunks before exact dedupe : 8\n", "Output chunks after exact dedupe : 7\n", "Duplicate chunks removed : 1\n" @@ -2310,17 +2331,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_hash\n", " chunk_id\n", " removed\n", @@ -2333,19 +2355,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", - " 0\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 4\n", " []\n", " \n", " \n", @@ -2354,19 +2377,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", - " 1\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " 5\n", " []\n", " \n", " \n", @@ -2375,19 +2399,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", - " 2\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " 6\n", " []\n", " \n", " \n", @@ -2396,19 +2421,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", - " 3\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7\n", " []\n", " \n", " \n", @@ -2417,19 +2443,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", - " 5\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " 1\n", " [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...\n", " \n", " \n", @@ -2438,19 +2465,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", - " 6\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 2\n", " []\n", " \n", " \n", @@ -2459,19 +2487,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", - " 7\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 3\n", " []\n", " \n", " \n", @@ -2479,23 +2508,14 @@ "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -2507,13 +2527,22 @@ "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -2533,14 +2562,23 @@ "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", + " document_id \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", " chunk_hash chunk_id \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 1 \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 2 \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 3 \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 5 \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 6 \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 7 \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", "\n", " removed \n", "0 [] \n", @@ -2576,12 +2614,7 @@ "execution_count": 24, "id": "82cc9bb0", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 112 - }, - "id": "82cc9bb0", - "outputId": "293489a5-a840-4d5c-fafd-245db30d81c0" + "id": "82cc9bb0" }, "outputs": [ { @@ -2674,11 +2707,7 @@ "execution_count": 25, "id": "cc61dffa", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "cc61dffa", - "outputId": "cf6393e6-c4c7-4606-87e5-892c26b28801" + "id": "cc61dffa" }, "outputs": [ { @@ -2781,11 +2810,7 @@ "execution_count": 26, "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", - "outputId": "4548fff6-f86f-45d4-a812-49aa061fdef2" + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399" }, "outputs": [ { @@ -2824,60 +2849,56 @@ "execution_count": 27, "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", - "outputId": "1164345a-93db-4f8e-ad34-58a1c3d0c116" + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:50:46 INFO - Running locally\n", - "18:50:46 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 1}}\n", - "18:50:46 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", - "18:50:46 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:50:46 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:50:46 INFO - pipeline id pipeline_id\n", - "18:50:46 INFO - code location None\n", - "18:50:46 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "18:50:46 INFO - actor creation delay 0\n", - "18:50:46 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:50:48,381\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - orchestrator started at 2024-09-18 18:50:49\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.067702485248446, 'object_store': 4.033851241692901}\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - starting run from the beginning\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - continuing from the very beginning\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Fuzzy: num buckets 8, bucket length 8\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 bucket actors\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 minhash actors\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Table preprocessing uses 1 readers\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 table processor actors\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:57 INFO - Completed 1 files in 0.131 min\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:57 INFO - Completed 1 files (50.0%) in 0.131 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:02 INFO - Completed processing 2 files in 0.215 min\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:02 INFO - creating minhash snapshots\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:03 INFO - minhash snapshots created\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:03 INFO - creating bucket snapshots\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - bucket snapshots created\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created 1 document actors\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created 1 bucket processor actors\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created bucket processor invoker\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - added invoker to bucket collectors\n", - "\u001b[36m(BucketsHash pid=1218636)\u001b[0m 18:51:04 INFO - processing buckets 0 long, 53 short\n", - "\u001b[36m(BucketsHash pid=1218636)\u001b[0m 18:51:04 INFO - Done submitting long buckets\n", - "\u001b[36m(BucketsHashProcessorInvoker pid=1219171)\u001b[0m 18:51:05 INFO - Waiting bucket processing completion. Submitted requests 1\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:05 INFO - Done processing buckets in 0.011 min\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:05 INFO - creating document snapshots\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:06 INFO - document snapshots created\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:06 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:12 INFO - Completed processing 2 files in 0.098 min\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:12 INFO - done flushing in 0.001 sec\n", - "18:51:22 INFO - Completed execution in 0.592 min, execution result 0\n" + "22:47:02 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 1}}\n", + "22:47:02 INFO - pipeline id pipeline_id\n", + "22:47:02 INFO - code location None\n", + "22:47:02 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "22:47:02 INFO - actor creation delay 0\n", + "22:47:02 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:47:02 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", + "22:47:02 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:47:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:47:02 INFO - Running locally\n", + "2024-10-16 22:47:03,977\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - orchestrator started at 2024-10-16 22:47:05\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.128299713134766, 'object_store': 3.064149856567383}\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - starting run from the beginning\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - continuing from the very beginning\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Fuzzy: num buckets 8, bucket length 8\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - created 1 bucket actors\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - created 1 minhash actors\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Table preprocessing uses 1 readers\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:06 INFO - created 1 table processor actors\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:12 INFO - Completed 1 files in 0.104 min\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:12 INFO - Completed 1 files (50.0%) in 0.104 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:15 INFO - Completed processing 2 files in 0.154 min\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:15 INFO - creating minhash snapshots\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:16 INFO - minhash snapshots created\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:16 INFO - creating bucket snapshots\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:17 INFO - bucket snapshots created\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:17 INFO - created 1 document actors\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - created 1 bucket processor actors\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - created bucket processor invoker\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - added invoker to bucket collectors\n", + "\u001b[36m(BucketsHash pid=1008361)\u001b[0m 22:47:18 INFO - processing buckets 0 long, 53 short\n", + "\u001b[36m(BucketsHash pid=1008361)\u001b[0m 22:47:18 INFO - Done submitting long buckets\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:19 INFO - Done processing buckets in 0.012 min\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:19 INFO - creating document snapshots\n", + "\u001b[36m(BucketsHashProcessorInvoker pid=1008950)\u001b[0m 22:47:19 INFO - Waiting bucket processing completion. Submitted requests 1\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:20 INFO - document snapshots created\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:21 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:30 INFO - Completed processing 2 files in 0.153 min\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:30 INFO - done flushing in 0.001 sec\n", + "22:47:40 INFO - Completed execution in 0.632 min, execution result 0\n" ] }, { @@ -2885,8 +2906,8 @@ "output_type": "stream", "text": [ "āœ… Stage:5 completed successfully\n", - "CPU times: user 174 ms, sys: 166 ms, total: 341 ms\n", - "Wall time: 36.7 s\n" + "CPU times: user 212 ms, sys: 201 ms, total: 413 ms\n", + "Wall time: 39.4 s\n" ] } ], @@ -2965,20 +2986,15 @@ "execution_count": 28, "id": "e899ad60", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 222 - }, - "id": "e899ad60", - "outputId": "70d040ab-b1d5-4797-f725-11982ef82413" + "id": "e899ad60" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (8, 17)\n", - "Output data dimensions (rows x columns)= (6, 17)\n", + "Input data dimensions (rows x columns)= (8, 18)\n", + "Output data dimensions (rows x columns)= (6, 18)\n", "Duplicate chunks removed by fuzzy-dedupe: 2\n" ] }, @@ -3007,17 +3023,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_id\n", " chunk_hash\n", " \n", @@ -3029,19 +3046,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", - " 0\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 4\n", + " -1\n", " \n", " \n", " 1\n", @@ -3049,19 +3067,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", - " Solar System\\nFor more details about the Solar...\n", - " $.main-text[3]\n", - " 1\n", - " [133.18510437, 570.83258057, 374.99838257, 581...\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " Mars\\nMars, the fourth planet from the Sun, is...\n", + " $.main-text[5]\n", " 1\n", - " 5\n", + " [132.87440491, 500.84011841, 477.48345947, 534...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " 6\n", + " -1\n", " \n", " \n", " 2\n", @@ -3069,39 +3088,41 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", - " Mars\\nMars, the fourth planet from the Sun, is...\n", - " $.main-text[5]\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " Basic facts about Mars:\\nĀ· Distance from the S...\n", + " $.main-text[6]\n", " 1\n", - " [132.87440491, 500.84011841, 477.48345947, 534...\n", - " 2\n", + " [133.2026062, 482.90710449, 237.04431152, 493....\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7\n", " -1\n", " \n", " \n", " 3\n", - " mars.pdf\n", + " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", - " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", - " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", - " mars.pdf\n", - " Basic facts about Mars:\\nĀ· Distance from the S...\n", - " $.main-text[6]\n", + " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", + " 2686\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", + " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " Solar System\\nFor more details about our Solar...\n", + " $.main-text[3]\n", " 1\n", - " [133.2026062, 482.90710449, 237.04431152, 493....\n", - " 3\n", - " -1\n", + " [133.20942688, 570.81555176, 375.57919312, 581...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " 1\n", + " 5\n", " \n", " \n", " 4\n", @@ -3109,18 +3130,19 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", - " 6\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 2\n", " -1\n", " \n", " \n", @@ -3129,18 +3151,19 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", - " 7\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 3\n", " -1\n", " \n", " \n", @@ -3148,61 +3171,61 @@ "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "3 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "3 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", "5 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", "\n", - " page_number bbox chunk_id \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... 0 \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... 1 \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... 2 \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... 3 \n", - "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 6 \n", - "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 7 \n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", - " chunk_hash \n", - "0 4 \n", - "1 5 \n", - "2 -1 \n", - "3 -1 \n", - "4 -1 \n", - "5 -1 " + " document_id chunk_id chunk_hash \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 -1 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 -1 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 -1 \n", + "3 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 5 \n", + "4 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 -1 \n", + "5 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 -1 " ] }, "execution_count": 28, @@ -3227,12 +3250,7 @@ "execution_count": 29, "id": "ab7ea52b", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 81 - }, - "id": "ab7ea52b", - "outputId": "13a1847a-bdd1-4dc9-a281-a8faac59c3a8" + "id": "ab7ea52b" }, "outputs": [ { @@ -3269,17 +3287,17 @@ " \n", " 1\n", " mars.pdf\n", - " Solar System\\nFor more details about the Solar...\n", + " Mars\\nMars, the fourth planet from the Sun, is...\n", " \n", " \n", " 2\n", " mars.pdf\n", - " Mars\\nMars, the fourth planet from the Sun, is...\n", + " Basic facts about Mars:\\nĀ· Distance from the S...\n", " \n", " \n", " 3\n", - " mars.pdf\n", - " Basic facts about Mars:\\nĀ· Distance from the S...\n", + " earth.pdf\n", + " Solar System\\nFor more details about our Solar...\n", " \n", " \n", " 4\n", @@ -3298,9 +3316,9 @@ "text/plain": [ " filename contents\n", "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", - "1 mars.pdf Solar System\\nFor more details about the Solar...\n", - "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "3 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "2 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "3 earth.pdf Solar System\\nFor more details about our Solar...\n", "4 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", "5 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." ] @@ -3319,11 +3337,7 @@ "execution_count": 30, "id": "6bdd3515", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "6bdd3515", - "outputId": "5a214fa3-c420-42d7-dcab-574b661e0cd8" + "id": "6bdd3515" }, "outputs": [ { @@ -3336,14 +3350,10 @@ "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", "-------\n", "-------Chunk 1------\n", - "Solar System\n", - "For more details about the Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", "Mars\n", "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", "-------\n", - "-------Chunk 3------\n", + "-------Chunk 2------\n", "Basic facts about Mars:\n", "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", @@ -3351,10 +3361,14 @@ "-------\n", "========== earth.pdf ===========\n", "-------Chunk 0------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 1------\n", "Earth\n", "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", "-------\n", - "-------Chunk 1------\n", + "-------Chunk 2------\n", "Earth\n", "Basic facts about Earth:\n", "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", @@ -3437,11 +3451,7 @@ "execution_count": 31, "id": "20a153fa-fd56-401e-86be-4f7617affcc8", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "outputId": "1c7835d1-1f2c-4545-8533-d9ab7a3ad0aa" + "id": "20a153fa-fd56-401e-86be-4f7617affcc8" }, "outputs": [ { @@ -3478,36 +3488,32 @@ "execution_count": 32, "id": "228df6b2-bc62-494b-9697-03ece98d7853", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "outputId": "91dd893c-3056-4d2a-bffe-49645e584a12" + "id": "228df6b2-bc62-494b-9697-03ece98d7853" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:51:23 INFO - Running locally\n", - "18:51:23 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "18:51:23 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", - "18:51:23 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:51:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:51:23 INFO - pipeline id pipeline_id\n", - "18:51:23 INFO - code location None\n", - "18:51:23 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "18:51:23 INFO - actor creation delay 0\n", - "18:51:23 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:51:25,784\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - orchestrator started at 2024-09-18 18:51:28\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Number of files is 2, source profile {'max_file_size': 0.008937835693359375, 'min_file_size': 0.00830841064453125, 'total_file_size': 0.017246246337890625}\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.01370926015079, 'object_store': 4.0068546291440725}\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:33 INFO - Completed processing 2 files in 0.084 min\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:34 INFO - done flushing in 0.001 sec\n", - "18:51:44 INFO - Completed execution in 0.334 min, execution result 0\n" + "22:47:42 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "22:47:42 INFO - pipeline id pipeline_id\n", + "22:47:42 INFO - code location None\n", + "22:47:42 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "22:47:42 INFO - actor creation delay 0\n", + "22:47:42 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:47:42 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", + "22:47:42 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:47:42 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:47:42 INFO - Running locally\n", + "2024-10-16 22:47:44,003\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - orchestrator started at 2024-10-16 22:47:47\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Number of files is 2, source profile {'max_file_size': 0.009654045104980469, 'min_file_size': 0.00907135009765625, 'total_file_size': 0.01872539520263672}\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.101744843646884, 'object_store': 3.0508724208921194}\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - Completed processing 2 files in 0.011 min\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - done flushing in 0.001 sec\n", + "22:48:03 INFO - Completed execution in 0.349 min, execution result 0\n" ] }, { @@ -3515,8 +3521,8 @@ "output_type": "stream", "text": [ "āœ… Stage:6 completed successfully\n", - "CPU times: user 611 ms, sys: 194 ms, total: 805 ms\n", - "Wall time: 22.1 s\n" + "CPU times: user 422 ms, sys: 241 ms, total: 663 ms\n", + "Wall time: 22.9 s\n" ] } ], @@ -3572,20 +3578,15 @@ "execution_count": 33, "id": "7b1c1d09", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 205 - }, - "id": "7b1c1d09", - "outputId": "9e695b9d-f196-4cb7-c56f-3789251e7860" + "id": "7b1c1d09" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (6, 17)\n", - "Output data dimensions (rows x columns)= (6, 18)\n" + "Input data dimensions (rows x columns)= (6, 18)\n", + "Output data dimensions (rows x columns)= (6, 19)\n" ] }, { @@ -3613,17 +3614,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_id\n", " chunk_hash\n", " embeddings\n", @@ -3636,19 +3638,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", - " 0\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 4\n", + " -1\n", " [0.0077404897, -0.020559434, 0.026426662, 0.01...\n", " \n", " \n", @@ -3657,81 +3660,85 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", - " mars.pdf\n", - " Solar System\\nFor more details about the Solar...\n", - " $.main-text[3]\n", - " 1\n", - " [133.18510437, 570.83258057, 374.99838257, 581...\n", - " 1\n", - " 5\n", - " [-0.051861413, 0.0035226392, 0.030617053, 0.04...\n", - " \n", - " \n", - " 2\n", - " mars.pdf\n", - " 1\n", - " 0\n", - " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", - " pdf\n", - " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", - " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", - " 2\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " 6\n", " -1\n", " [0.07728298, 0.024971062, -0.04318075, 0.05809...\n", " \n", " \n", - " 3\n", + " 2\n", " mars.pdf\n", " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", - " 3\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7\n", " -1\n", " [0.1059802, 0.025460616, 0.02362733, 0.0390564...\n", " \n", " \n", + " 3\n", + " earth.pdf\n", + " 1\n", + " 0\n", + " 11\n", + " pdf\n", + " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", + " 2686\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", + " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " Solar System\\nFor more details about our Solar...\n", + " $.main-text[3]\n", + " 1\n", + " [133.20942688, 570.81555176, 375.57919312, 581...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " 1\n", + " 5\n", + " [-0.062105577, -0.0053322953, 0.03127779, 0.04...\n", + " \n", + " \n", " 4\n", " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", - " 6\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 2\n", " -1\n", " [0.0724358, -0.058001805, -0.01977186, -0.0243...\n", " \n", @@ -3741,18 +3748,19 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", - " 7\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 3\n", " -1\n", " [0.091821924, 0.015197907, 0.07716932, 0.01711...\n", " \n", @@ -3761,61 +3769,69 @@ "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "3 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "3 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", "5 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", "\n", - " page_number bbox chunk_id \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... 0 \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... 1 \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... 2 \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... 3 \n", - "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 6 \n", - "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 7 \n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", - " chunk_hash embeddings \n", - "0 4 [0.0077404897, -0.020559434, 0.026426662, 0.01... \n", - "1 5 [-0.051861413, 0.0035226392, 0.030617053, 0.04... \n", - "2 -1 [0.07728298, 0.024971062, -0.04318075, 0.05809... \n", - "3 -1 [0.1059802, 0.025460616, 0.02362733, 0.0390564... \n", - "4 -1 [0.0724358, -0.058001805, -0.01977186, -0.0243... \n", - "5 -1 [0.091821924, 0.015197907, 0.07716932, 0.01711... " + " document_id chunk_id chunk_hash \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 -1 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 -1 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 -1 \n", + "3 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 5 \n", + "4 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 -1 \n", + "5 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 -1 \n", + "\n", + " embeddings \n", + "0 [0.0077404897, -0.020559434, 0.026426662, 0.01... \n", + "1 [0.07728298, 0.024971062, -0.04318075, 0.05809... \n", + "2 [0.1059802, 0.025460616, 0.02362733, 0.0390564... \n", + "3 [-0.062105577, -0.0053322953, 0.03127779, 0.04... \n", + "4 [0.0724358, -0.058001805, -0.01977186, -0.0243... \n", + "5 [0.091821924, 0.015197907, 0.07716932, 0.01711... " ] }, "execution_count": 33, @@ -3849,11 +3865,7 @@ "execution_count": 34, "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "outputId": "e6a04d78-b8e9-431a-e9f5-1f9ad1aee3a7" + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207" }, "outputs": [ { @@ -3877,7 +3889,9 @@ "cell_type": "code", "execution_count": null, "id": "dc0a6728", - "metadata": {}, + "metadata": { + "id": "dc0a6728" + }, "outputs": [], "source": [] } @@ -3887,7 +3901,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "dpk-1-basic-022dev1-py312", "language": "python", "name": "python3" }, @@ -3901,7 +3915,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.9" + "version": "3.12.7" } }, "nbformat": 4, From 96d680867676672e998b7e3cfcb0abb7c9101452 Mon Sep 17 00:00:00 2001 From: Sujee Maniyam Date: Wed, 16 Oct 2024 23:51:22 -0700 Subject: [PATCH 04/19] Fixing URLs Signed-off-by: Sujee Maniyam --- examples/notebooks/intro/dpk_intro_1_python.ipynb | 6 +++--- examples/notebooks/intro/dpk_intro_1_ray.ipynb | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb index 1049bf8d6..a6b2efff5 100644 --- a/examples/notebooks/intro/dpk_intro_1_python.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_python.ipynb @@ -42,10 +42,10 @@ "source": [ "## Step-1: Inspect the Data\n", "\n", - "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/main/examples/notebooks/intro/input/solar-system)\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/intro-example1/examples/notebooks/intro/input/solar-system)\n", "\n", - "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/main/examples/notebooks/intro/input/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/sujee//blob/main/examples/notebooks/intro/input/solar-system/mars.pdf)\n" + "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf)\n" ] }, { diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index 6a14dedc7..631b79926 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -28,7 +28,7 @@ "\n", "Two options:\n", "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", "\n", "The notebook will work as in both environments" @@ -43,10 +43,10 @@ "source": [ "## Step-1: Inspect the Data\n", "\n", - "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/main/examples/notebooks/intro/input/solar-system)\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/intro-example1/examples/notebooks/intro/input/solar-system)\n", "\n", - "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/main/examples/notebooks/intro/input/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/sujee//blob/main/examples/notebooks/intro/input/solar-system/mars.pdf)\n" + "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf)\n" ] }, { From 970c22a33a26fb090e737e03c14d343cb0802964 Mon Sep 17 00:00:00 2001 From: Sujee Maniyam Date: Thu, 17 Oct 2024 00:09:41 -0700 Subject: [PATCH 05/19] fix colab url --- examples/notebooks/intro/dpk_intro_1_ray.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index 631b79926..b39e30d2d 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -28,7 +28,7 @@ "\n", "Two options:\n", "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_ray.ipynb)\n", "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", "\n", "The notebook will work as in both environments" From 4c34038b912eef57c40d781b3a37236a2d3f1809 Mon Sep 17 00:00:00 2001 From: TAKUYA GOTO Date: Thu, 17 Oct 2024 16:29:21 +0900 Subject: [PATCH 06/19] fix 'IndexError: list index out of range' in header_cleanser --- .../code/header_cleanser/python/src/header_cleanser_transform.py | 1 + 1 file changed, 1 insertion(+) diff --git a/transforms/code/header_cleanser/python/src/header_cleanser_transform.py b/transforms/code/header_cleanser/python/src/header_cleanser_transform.py index d9d0d5a3f..711171344 100644 --- a/transforms/code/header_cleanser/python/src/header_cleanser_transform.py +++ b/transforms/code/header_cleanser/python/src/header_cleanser_transform.py @@ -83,6 +83,7 @@ def check_empty_comment(code, ignore_lines): if max_index <= len(code_list): max_index = max_index + 2 + max_index = min(max_index, len(code_list)) for index in range(min_index, max_index): if all( From d45274f1f7cb68e55845a940b33a4b1b3de307c5 Mon Sep 17 00:00:00 2001 From: Michele Dolfi Date: Fri, 18 Oct 2024 06:25:57 +0200 Subject: [PATCH 07/19] update docling to 1.20.0 Signed-off-by: Michele Dolfi --- .../doc_chunk/python/requirements.txt | 2 +- .../python/test-data/expected/metadata.json | 17 +++++++++++++---- .../python/test-data/expected/test1.parquet | Bin 31246 -> 31223 bytes .../pdf2parquet/python/requirements.txt | 8 ++++---- .../test-data/expected/archive1.parquet | Bin 22714 -> 22800 bytes .../python/test-data/expected/metadata.json | 15 +++++++++++---- .../test-data/expected/redp5110-ch1.parquet | Bin 9262 -> 9286 bytes .../test-data/expected_json/archive1.parquet | Bin 22186 -> 22262 bytes .../test-data/expected_json/metadata.json | 15 +++++++++++---- .../expected_json/redp5110-ch1.parquet | Bin 11130 -> 10965 bytes .../archive1.parquet | Bin 19566 -> 18288 bytes .../expected_md_no_table_no_ocr/metadata.json | 15 +++++++++++---- .../redp5110-ch1.parquet | Bin 9262 -> 9286 bytes .../language/pdf2parquet/ray/requirements.txt | 8 ++++---- .../ray/test-data/expected/archive1.parquet | Bin 22714 -> 22800 bytes .../ray/test-data/expected/metadata.json | 15 +++++++++++---- .../test-data/expected/redp5110-ch1.parquet | Bin 9262 -> 9286 bytes 17 files changed, 66 insertions(+), 29 deletions(-) diff --git a/transforms/language/doc_chunk/python/requirements.txt b/transforms/language/doc_chunk/python/requirements.txt index d532510ba..2db4bd1f1 100644 --- a/transforms/language/doc_chunk/python/requirements.txt +++ b/transforms/language/doc_chunk/python/requirements.txt @@ -1,3 +1,3 @@ data-prep-toolkit==0.2.2.dev1 -docling-core==1.3.0 +docling-core==1.7.2 llama-index-core>=0.11.0,<0.12.0 diff --git a/transforms/language/doc_chunk/python/test-data/expected/metadata.json b/transforms/language/doc_chunk/python/test-data/expected/metadata.json index f9658c2d8..9960b2860 100644 --- a/transforms/language/doc_chunk/python/test-data/expected/metadata.json +++ b/transforms/language/doc_chunk/python/test-data/expected/metadata.json @@ -5,8 +5,8 @@ "job name": "doc_chunk", "job type": "pure python", "job id": "job_id", - "start_time": "2024-09-18 16:05:04", - "end_time": "2024-09-18 16:05:04", + "start_time": "2024-10-18 06:15:55", + "end_time": "2024-10-18 06:15:55", "status": "success" }, "code": { @@ -24,6 +24,8 @@ "output_jsonpath_column_name": "doc_jsonpath", "output_pageno_column_name": "page_number", "output_bbox_column_name": "bbox", + "chunk_size_tokens": 128, + "chunk_overlap_tokens": 30, "checkpointing": false, "max_files": -1, "random_samples": -1, @@ -32,12 +34,19 @@ ], "num_processors": 0 }, + "execution_stats": { + "cpus": 48.2, + "gpus": 0, + "memory": 25.65, + "object_store": 0, + "execution time, min": 0.001 + }, "job_output_stats": { "source_files": 1, "source_size": 50276, "result_files": 1, - "result_size": 31246, - "processing_time": 0.071, + "result_size": 31223, + "processing_time": 0.084, "nfiles": 1, "nrows": 88, "source_doc_count": 1, diff --git a/transforms/language/doc_chunk/python/test-data/expected/test1.parquet b/transforms/language/doc_chunk/python/test-data/expected/test1.parquet index 607bbd21334815df4e88c22a081b67d9d85cfb07..06089be7843d34407556d42f4fca4936d90a6f53 100644 GIT binary patch delta 530 zcmeDC!ub6&;|7bs`a=?;eat?h8yH0ygcuk!w(9>)P-K0}%J8XET55u6CUZh1%c`!T z4b~?oMHf9?rmM9rjqll2rnQ>aZ<*yk?Md@m@##oO(83(Qe--!r@@L8m_O`XQ96h+_ z{@nBH%%|_WxJ|J^h0i@`)w#uMGj#gRUWGim+2yr=ksrH#bYO1^gJ&2IV+ZpA4G~vG zj_4JR6?-!8A6V>UB_PEpDEG1Iy6;NKrE~UtPWTo%?baoJ_NPqxSHzd-@NLkw3XtBM zm@eJJKfQQP?CSL?s@|95cIGAhT64>Ty<_46|sWn|-n!GSlXY;LSeqP2^lXn-Zuy2{q zthVVr)8zl@9h37)gf_3wJi|eT3pVD8pxf?%Y4PM+(LzjW*en+2I`@cOYy+bjhuDnG R-6hqm9Kiw%3;~Woh5*r((u)89 delta 545 zcmezVnX&H+;|7bs`cIOgOPGB`H!zAa2r)2dY}NmpP|51b#*oA%D>cD1lR2T1WmWd) zoagV@W^3=R{;u{-x$fxdEw5);&CY$z+nW}1L#x`?cIl=i*Dl)EZ@(N{_V3J|Eo+vP z_-V-r^7HmKSMA?p-ORx0A>(=T)636mGJNKzKhgBH(KeFpwo!ZYaZO@It_rJ1kpzPU zi$L2V4K~K86^1plw5!<H!hdWylc zT~LU-_vo!c8Pn5-FDz;%>~-PX^OO8(B_-be|Z=$Oun9>!g^~yv)YYKnHij9+b@D?_hk8EVXRgQBilMDa}Kvy+Y@%N Z4UB3WV%s*ClvJ~FgbOe*1ULp60ssVE(uM#4 diff --git a/transforms/language/pdf2parquet/python/requirements.txt b/transforms/language/pdf2parquet/python/requirements.txt index d959b9e38..d90658fc7 100644 --- a/transforms/language/pdf2parquet/python/requirements.txt +++ b/transforms/language/pdf2parquet/python/requirements.txt @@ -1,6 +1,6 @@ data-prep-toolkit==0.2.2.dev1 -docling-core==1.3.0 -docling-ibm-models==1.1.7 -deepsearch-glm==0.21.0 -docling==1.11.0 +docling-core==1.7.2 +docling-ibm-models==2.0.0 +deepsearch-glm==0.22.0 +docling==1.20.0 filetype >=1.2.0, <2.0.0 diff --git a/transforms/language/pdf2parquet/python/test-data/expected/archive1.parquet b/transforms/language/pdf2parquet/python/test-data/expected/archive1.parquet index 7757d57bb13b5656ff7efb1c5509260328d7080d..9975c36080eb7bebeea022b6a6ff205ab73d1598 100644 GIT binary patch delta 4626 zcmd5;cTm(>wx*i~T0nt@ZkpiGK-0oV=p4v7XHXPO9VKT0MW=~MQh^@=f*?r&1A}BR zfk;wBq5@(|5-}1*Kn7QRJGHYDs} zaM}4TZjh&CyQd*G{y4moQS*pPzx_UJOzSpOMvQ0rAyD0Y6!+kX7eUMJU;<1bb~^lk zsN_=e7Fmfqo-HkPg;xQFZnHC76tddBv!1&U+-+_4|2FCC-ekZO)2!I3bJhGI!jc!GOd*X4#xVQ z$Xh-|9A%m>oJ`wQF-M#>v#DSrm(j~QQ#slF$6q;lEh z1#|WT$7#Xx;+F+Xfm0?PE9_2HkEHpw_SqHu9aZ`dy8Vt7hS9BktcZeUHSyDsXr0uS zaX>&q;_$qod92$ZNBK9mxbpSo%8|sia6bz>%l)A#XNv23q`eQgjRq|Q`?sk|tB6Ll ze|UBwq&hL^i>LXB#ifSXvNpM~%d7CM#50Zxk?BR3dF#nO9x)IQRNN*&d1P{l^)kyg z{X5OeNoVdFozEtUH>*Dm z?!cbAhPSd9yOWqFaCaGd4W3mLXS0;#_$7LDm}ZG=ek*?OV+WW&gVBo1dv_V?s=4p= zqdiUo-aFqfmlWGs9cc%MHq9dZBQXSFa5{aC}PE*OF zGvbm-rYx}H(e|{&1W(LGLrbDqMyKgC-MHxJ>it&JCDhuEoX$8UTw5@7y3`xkzU?VN{L*TQ zGJlsLK2;R`cQe(wAyeU55dGNhEUyi&JLP%I!QD46N4rd{Ty1^-lKWDTki)YL3xdg; z0jUq9v-Ee)h4)v5l$p{CmyA*zr$QbzAlzJ%lCzm&4dHtudvSSJ%`|ptZ_7X8eoF9^ zIW+sG<8^M)#Dx@k>8E_znc;dGLh4|);)_V}IN3=ebcLkgvUab{22fAxD;aHyO%grs zZ;}`4szW-dYW-ehuan9^W%DXi@jktii*g7!A*yg-PR##Nz(vXaK&tz5vYYC+?z^!( z%sAQeFwUUH4;?Gz0NYls$I1vyfv9h$Uak|qzy(1?gikT&j@`=@;JL0cnC%qoXqNM{ z#OQ#xr%UF|pfKDWX+7_tM}EjxE(2}SJNHavn}tdJfWD6vZnxbVdb*kja1rc0?r_$5RTb3;ovwkpZs?IrKwZYZmF4wD6 zJ}^yR`lVtjNH`E}$;W3(FpJuV5AQMgvGC#Edou<($`b?WuDnUMaL-1y8o=Zw;R zs}t$`V~7A(W(Ri6x6L5u%D`q&4LZa;nG53t?Ch%8>YWmbF*|;+%mH5!LYCHF=-82F zpqV%pGl+B#+^%){_zAHC>ZQUy*R}GrIem+*$*p~dR%Ami)%EM0z5Fuum*~vPJ=bc( zMn|UA^=3ack9kWS7WeD&qTL))vLxpg8AyGIckrLs@YK|{|C}T1!X4*b%4<(iF zpIePU!dtc`Cpj$J3egX@jJ4~%P$VtvmA>iV9O;{$6BiUz=($(u$nJ=c6K`clNcFR~ zR-|4XEA#DX+g&eIuH~C*QB87lkF@!DjVpI5%V5t^@Z;d)nAuz?X-YfZChbP-V7*ab z!nq}*0FJAlO6Tp`8Oq?Up0gY+_ZnPf#B}ePz(@PY-RW>S{sXeiaMxGrbu&SvBXL9? zQQzmv_$DaYu?#2UDtKN-MnWPDZPkyAhQy)bP+=FuU{Qgi1@Wmvp6ThvE5nZ=-1LEL zL(lCRd^-@T>#<-uaw!rXq zbmQw!*GFPIHnrEGo=y2$-8N!bTis`7Yi=1#mYXEWc7sd=Dn_ZKJ&U{onT7|m{8#)% zJIgyZSXt2-8ysk5r^rTDb?+@qs9fnG}B+BrGbq~)vU5z4~&;+FZ5>I&s+;Wl(#8059O$*VQ;qWzDF6Y zFE*j5ErUV!CWefYNk4d1{L&}C)lIQ&_Kvz7SVL8j^En!Th{A7LzgsUA3rodK?HS2l$IXc z*{wytdp;tdt5#WZ*hm)pwB^8xL--k>hvz`TTyQz7!N2(Kq)Tp!nv>E#$%fj?yd!R& znm#^o6Pd}r$eM(je4?Btn0w8#uo2ZBTyIa=aPF<0AC4cT+ddet>M{y+pKMYJM~~g# zwCnPKch@>Kf||Tjz3f@`=F&proVhRV)l(K)e44$1hl#7@=F!5ri*f^jM^OSZ4^so@ z*kpLl72@5pVyU-e*t}eFKm1qh?vPN83S5ZR)<>f~|%tmQw z=nD%2e()dpm6cJ|S#VM4&Vq1xxTp+?Pf%P231tTOhqC=cL*Swa2%;btgvboRRH}}pFhCv`0M6_HBpgvft8E}mQyU1d#;F72xM=Xb2f$HBMZuCT z5N4(mC~!0Yy71xPs2;$WutRXN8-&T|1_B6X+;q@y|EMTO0Wcp}Kun^tc57UA#-PMA z9wWCB#(!MM)6G21x%X!W#(?P1!!-Cw1*VoMTJs{MMd+Y_@3pmE{E}Bez{I#I8qD`6 z&w^$Uuy`^Sht(jFu>=hwg~8IGVX-(3Jdr_TW2qzthryvB(R2oh#bz+68XP=_qCsS` zI2tq#8>_*<(&%(D5l^R(Xut#&0bc0`oSYQ)X7giG`#>1pW6(Fc|FX)zQ6ySd5)dRg zf-9c@FPvlqdxk)m^dVr=NfF2s%OdFo?5-GKYyLvZ*)*o9xXXk?}Ydl|{spuy{O?gl7`*3@iu7q|)(N z_E$3LEIbQO!LgBsL^6RvAdty;CKHDvFqjlJhsGvz*fav2!6DM{G(3k)qTwhECIic2 zkm)oMgYdNwflXqPSvU?*NR|dKKLeV`a@_n85T@?wcdGvi*?$t4*gqq~0^i=K3omVg z5aIj2h|vGBzi&V!^j8lg^j9AQpXzTnZ-1}Tfj;)xpnxcg0l z@4db_4Xjbo#x+wQOv&VT_J1GouV(u@Vb))1GL$y^WboBSNgbLu_#411t4!4w(!;=KpdGq`?21o|XFNeg)>W f2_lH~8z2M(14WD_SD$y4fk2D4^YJ+vIph8YGYRuo delta 4588 zcmd5qM%uK=!= z17&DH^w}BPZQzVWVqwZKm{Ha01#b)G1$)>_$Qw`Yy11tY^%i#vN1_j|NV#0)a%(E> zqyNo!f!z|(mg=O!x|dt1!!{h6c}Nu?Z*c?&0U+QIP+A!OsralHr-#hsS?QJL{i4!S zy&x^=l{MF?Iw+@t9r7mVNL8Ts`EL!prX_msq_{pIj+$YvnUWPIuGl)vbA^w!SOq?SuZ7+LFT^c}MC7j%B+3sAH&3^(5N+8tX=Tjv8^Gbj?`h@p<8gKg?_UPt8Qxc+J@OvAFgyBWXp?S54+7P(2nQu?|u>4 zv`!4+Mzl$R_l!Eol)JOJ3D>vYnN421aotU-N%D8?2eq?^UjRZ=L;k*Y?#zS0r@M#4 z#}Ie@+zRDw0@BSqU$Z=dn3n~^@!|@7^zOalfYoCu_n$sd5>6v!G!@2U;}Gj&ybYux zV^XC=ZWyI0DS1BmkQA!<{KmH6Tr-Nu4ifc;^+xtu**(Yac@&vHPA}3@BTaaU_k-cE z&VeYekN_h1TFb@g(*;#EjVfl`Bh{O+F)dRscuq>ayhtN;S8w+$MYU5GfC%4qP zI3^^N4JQIFIhzg<%=EZj7%p=98PwAR&pcC$RVwUYE02wif9a8YiV*^?&&{< z(VBTRiFaO1heO&>v&#Hv5c!50soK%y=N?feazS$Vr^8)wMf>Us4iozPWZ&nU$6sUP z!Osn?16F(e6vuj=@1RPW`!YvI4@|m5!0ixD-#__Ws+nxx&UP0Y^gYmE^YcP9 zUVdM&y%e+t_v~&xYMh-Q)uJOuShnFaVq5qt*T_RRENbXCqdM%GMtnmbIp=Q;-#fLw z#HLYaka1$|(IT{{SDRwXOaO2qYmY~Wb1<<~d{g;B(uhbN?qd{L4vuL+?*zS`j!zzi z>l;Rv$5^Mnu}a&%BSfR+L*{`g8(ELJ%dG*8>HB;=Bf(LpCn??T1Krnac6mq|DmAbK zbl^Y~enM>0NFA+CK}DtqH+)b%n;R4mu0B!xVP^C6=;O4iW07z7ZvBBtiDYvl(?4iO z5@y^Y#}0|EGHFvSoJjZY=Txf3AN#~KjnrF9NQZ-Co9;i-OAQJ$o^`(vYzLKkP9bG5 zy09myPhrz)L#!_sB3DhIbKm%yY%e?fs^Bs%_tG7d56*MY*6rL~VxPmV<&SYbY~Fc% zK=r7W%VxhXOwSXn{AS-lFUJc1i14EwdP#5O`W#wBp&MHo4L{Fqc$)5{cL_sJ!0xbf z(%(HCr{1`U}qj zW}L56$K8iNp{y#hkqh}jr1=qSN7^{I=kn@Oa8~;DwyS>ElOkLqXaSR~6ih|+-sGz0 zB4ucKi%3M<%?~b~HEvygK%6s*!m9VrH&z{ok9d_=m0B40*EY*mSS(k_OR6us@Z2)( z045B3Z{Ooz7u=Q4hn9cRtGZoy!3M9Ub~G#0)i9pge?^k9GbFCV-!)j0cz1lq-q48j zkBz8cnpa3c)s>l?o58*x;^?d28{651*Co4BC7VPws_NJ5qP!4^_N?Jwg|>Z+*vDS} z(Hpd$P5Qp$+B|}%*eT?CxeZ@7T(PZO$I^*6nf5ZL`N0}Vaawvywyj}|(*=SS8PaM7vDLMT%1-DAUEM_t!UQuI!0z!y;{qP8My(_1E3q`KI(c*0?C*;27r_106pr)56vzUb3D_AomAiZF3=i9b5U^t)jumC@$;xIp`6c-ykm za%{g~*x}%&!FGq7BkGv}){gUV?E^9X2`yP zMV&>+bXWcmdvJt#*h%Ow+zmUx%|(uo@yLyK>2jS^?@&e!%a5UdX^$9*$Y=1hHp z=UfID>&D23XHTYGj?f*K!)}ejh6l<1tXgtHCIdNg=JoKh?0R?IuE5BTqtLZuXT&PG@iu2&V@}t|S%X`2QM<4F@$zRA<4pbs^fBbJcAqa=D$5Y!Ei3pU z4>`3H+!zzS(pB*buySnzh1{kIUlV^;2JvgJ^NnF?^K za-S>$SsAE6giGlzHHf^y9D1dQfu3RX&{U=_*@OksAw?x~s1>uUa*L7)5}t>FOze#- zz3kshL)W#iD{qBqV&&E8Fcd}|kLCpi91sK?2*JwZkhtC)0C#miao&h{6sgI88U07u zWYEw~Fc6;E0(F48sCTyjV$Ch+n41Mer3Aw3??9by1nKWv0pio!0K}oJg61g6@Xz-k zrn@2PNgF^s(FT=KZQRp<-(MYd-DUw8JYGSutQN?}#(AYaqMOmUN8XG?@QtF9LU&!L z)ZF*%`HVjAPN{Jjp|d;wg_cS$Hw1O8!`_ys?pA)$61D^*&wyvjp__sNA&Dknl9xIL ziA5DKNE|9%Amo$z0*-)!H=wa-6b6|~BMIp|K8eN!xg;iwW=i6LEHYIfB!dhJ2ln?% zt*(p%3`2y8zN3F%TeLj5+UcwrgISOH3Yy50$q|5bI)f{q z@wr?wMIc~sSsbB|N~TgwSxh>G!4YzK42~&{OJVW^0udPvIHPiDBzR1bUKg60tI|NpSqZax|obi<7xHWVqsWitih_q zClLmV!lE?s=-({x=-+H`Fr$B2g~~v6=f#6HFbrs7k^f*XGQ=-j@Rhl(0B#Z>UL696 zlRpFhUcf1eCX~h0A(ELS3WW_a%*db_ooPa$F{q|A%Hqs~Oa;kICICyYW&umIaa~R! zQ)r}rI{eNG|Hp5?3%2;zOYgovP-JopTEj9}FBt`h=SIG6+8;ywo8f=MnY);$9{Enj za&hz)*q|)G#rZc2{L)T*wOPXX+ttODKdHED0x~f(q*RPAiq^9wl)rWS^Kf4)%rWza z$G(E3&55Y_7XVT60xD-yD~0C!f&Z2w%c8$gW6M8RVr2yJk5X)b_-}}s>NkQ$KoI|L z>8RRj z1A(NjDlUrf=Yc@HAP~rpM)kp>P)hO?I!eKZ>Iwxx`JfVV04pz)6UvFefuwkUU26yefq?>;jSUP!PO~APJa8s?L=I*! ziaiu80)5h57~ySb)2lvwrn(xE>O*ljv>}%IzUkqamt(NqkQz5?_O19QH!Q`FSLbF> zp7+mGjkpiC9V=>|JLzf2*rc>YH@T+k{nZzjlKe)`*S6LO$6>wqn%*Htj}xw>MK#5G zoB?s>woH3pHEgbSwJ}^ElpXyrV@%EulnJXWR&IND$!G|Eb5tTqNh#W9(cMY(sZ~F+ z#x~hoSXXmR{Lc1GKJ<`~p`P+*`FEbkSC5tn_l#xa43lCvBAu?Zj8siN9I?dCpceIZ z6NE23f~U$QM(`Va&Kpn`67JbO-I;Wm$HF0>>?vKE|D-VAvhPM7G5Qa}w6WMFD_=Kj zIz412Pv=10v$(V`h!zY6V;xe7>v-NW&B2Gzn{m0lchqp9OW{@1q-|=}Oiy2E`|TrH z_}kmTQ^zYP4LW<*&n!B& z&HsQ?!-p=Com`3hqNrGmb3)mqWO?`1%O5g6N&0T7i}(&VQ1wGoYr3io0RvX9l4~rFTIt zF;&CNTK-kY;;}+eSNo2t7<@(b(}v6`Qf2Wtq4$6)ceRk1#InnfdmPa;_e&s32dZ(O9hKrsIVN z;ioJQCU&MfbWDn-s%Kp5qzJiHp^hkLgf19X%}QsLOH9ob`|XHLVdR`KD#_7qJ23zH zya|O%|BHzAn+11;0YiU5-5S}X#16?8W!2K%Ih^p$hDv-!H5pT`U->e$uh->e=9e*W&53oWPDuH*cIJ46sj0!2QeP3%zm zz^yg>al6 zx|ikC=kIMU{cLV|(OB%1gZp*5ia>iVEVV>t&iLB%>{)Z}f;YR!5etr1#VO_n2_MHo za^CY}PM#b%Y-L7)az1^iMPL8vj7BSbPO|p^<@#&Y2X9*SJznry90;Nq})AS2z=`!ru&|7(}}3;&dDaqp0A&nJ(|v57_x`Wh0!THu3h zZu@s&r0;J~7g<-Ldf%w zT)>ZWBnVHh^{1?XA$9FoOzqUqeu;|Lpb&{M`^NZkM1n&rgX`-pYY51 z!r;=;b=h(Lyz8lQc#C)suNkcKCg_S;QSANMAQ`AFWDpj*T`JP9C~m9o_(o?&S?<&1 z=i!o*vPGSenD#wk^7;>T7IW8UML76u{A2Rm7KLrrd%wQ&*1YqzC+*O4x^*uvmEYZ4 zY9K4_Mdx%)uyjgFSL7$i2i(H2h$y$5)_Wnu7cOuzqe5k0;+E=}Q?i>Q>4NnJ)g57$ zK{>BwCIZweH@zVy2L;Z&u^1e_chnCjzJMv((0Qq{lG?w{zkESb=-lvu+~3-@O2+x8 z{Bc|;fkKWmrt=CiX)rOR_}P@5;G9alUq9$~g_jTJZe87u57X(k4QS{2 zowAnxI4x2iX|drsR$$TFxT>VPAa2{3jcR`C$bQBbDXnA#>R=CVas3ffWBLgvPQQw2 zYSL#1JMphG6)%v4uRdm>T9Tcm=U3t_@ zuS_35=>bYFy)#hOk4ezgg4gSe`tr{oF^7vja}nEqD!ou7t}T_!6D;wzZ=E?3l&~-9 z`Np7Xk;)F4f#enmk%M4Un;a*}Cw&bsCx_)=UlZS4gk@!A+xzf2-{k0Rl-#mblqiYf zEg)VqPSKbn{H{Dql6xq^9=7X^vE11w8^`@Js(B}jC~Szxy=v2EG_6Y;qfVELW7DDD z2}*A@XQqnz*Bb_aC5|*n$I7KUW0JqtZq2vS7YmQ#68CQn9ql0=H9f= zQd8quV#W=VW;E&ag(pLiH%wlVS}zzx1Xzd2YyxXF^2|8f#FGb2%&&^LYJx5ZRzXDS z($vQ?Jc1Gq-_AVgb|*(;pUVyd%-ZBd(T*0$S0rVX#4lC64(#guFE=z*wjNE*rA%Bc zOSi&TRB%;2>$<7Vv%p=(#icY85g`~vbG|OG_Gni>KYGBqXdRVrpin*xX=e4ZZ0y|{ zp(*3Vxv8hUzq5C9ZnB}>mfUn^hkRM{e31RoFOjC1w+<9Vw79s+r=NMyvSFmP3vbn< zOYY$6=$v}^sj4QxV+gTH{A4|1YjnV~I2GOMvvHtz5G_XtHkZ73cE5MWbq2gEG4Ks+ zA3dsUQ@w`u8htLy*~68X8;I~W&q8Ws{*kF9>JxC=2$%J4N`z~C{zajWS-0OI<{$6a z?iV+74YNfwm_Oee_qjS#w4>m5QBFeFs7kqP!Q}R9VSL4D49mm1r~&X3Wwtb}@B z6V52?G!D4M!OxX>&4vV{J+N-KY!J$wXHN@OKetN)n)vZ#i(Gc!1*->*P-|QK_P{?N z?7rQG>E2mWw_jnmlOGwimLHy3mn{B5$julR%9k3A5Zp{S-Z^$JxY>-6tQfKoYPMgx zZ?Mk(PUHUdE!P#`BhM_fZ)l4hZux3sn?EG6FUXX9tFY`qTdO};Jeljb`SIT`7|pfAW9}Pw;8Yr^(vK^b{A(8 z_egQ{O8Nwoq}2K>d5|@iL6-*)Ag@?M*Qgw0AOB+8b}3TyB!b88vY%?)nlIxOUm(|n zj1FmRle=WvaXqE#X=Pj0n-nS4iy`=IdQ&q`Gtn!1Hl2FvQi;uTwOlZ^BWtMiYUQy) zu~*x;FN)lSm=1{dkE-2pJy0Io=HSI0`aW#RE;f$a;$*yE=!eorSsV&YQ{Lh^_bQ(a zn%T>Ad*y{u;@`v$oz)9*m3==3;sw)%yTl2cu^dV~dtOa`-41)AsY;JEFo;?2lMpj{ z;13fgkF*!x1_STz4YMx{Pn}#BO6{PgKU3 z=+8U+)pzfcn&0i-Dc)WIG_L@E*6w}$1RJbT1DUUhEMo^Y@Ss&U=e$s{?|0@R5F!Y) z2C&3TKe`|QS*}^h!9Y^0Pyyjp4jNjt5m1hA2Muez0zI95NN%(xUu)T5PuY+~TF5s< z#ESBj++UdaWk@VMt3q0?5r8b`n6{SlD^!583Z!wp6-eAnNC6}=<4>6G>mt*10Zcz@ z{0lq3j7jmE%AcEQxkdo8oMYNr&aY5c$CP2Tf{B-jiRw*prZH%q|LR#wt@X(OS!)0w z`&sW_SX&jDIh6dT4k$%r14#gmDHE`0nu-cp9*xElnQqF+Rvo!w+CzQ<*@aQgBoiGMY?6k!UnBPD#m?M#EDGWD1&uC!o<3 zB?3+vuS_GLlyNw!D;`58E0d@yC=v=qrCG0;6M!t&tmI(8=<+ia5ME`(>~dTgT{!&R!n|*55cdVz90%*=8if^DhaA42sCb3{x5g0kQ2bs2q*-ie9 z`hRPB)nKgFw1AV1epBl-e}v7Nq*yg8IcDIO`B$i`!#qSmK)Juo@4Ub;;|wfEf@ppL z&H?VA`PyrB>M-sepH6sZ!K>!V~L|k6+EfYBa&Xu(laQy<5zgw6r z5q_H0my~@!oh|EQT4qiL-G*T^US{G3n11RJO9DTJ@tX(?3`fB4 z{b%o{kbh_GjQl57v&esBeTn=BSA59-JL@Op-(2t@{~u5K$p7#(jQro4@{s@FB!v8D zrUO_*F&hkwguo72^8>qhga8lG(sI{s)urh{pq5LKneaqyB9UkT{Fup@NDL-wxe@h< zL^qXrDe$?vMe(Nza?|#n}Q~wD4u2vJG6rjS>HC$em9Ci zG}}i2K-ee-Ey)u@SQ6gM=Xdffd=!04Pdl_(5ZTLtLGiM5rRp;R9Do`CFW|U6^SzYZ zv@DkO{e?efOhyNNJtYTQ0^MF&4@I&wVmL5~y4JFain7FIc_N)j1e62mJAZ({F!3;O zO#48PJihT0i8jo#cJQ+NH~Xu~ll1k%NKAI3HigNLHdj$pWQNzmhNTr~L}r2YWIdR5 z`nDJ}kiD{PfWvOAk{z1iLEIGv;KO?}yAHsv^$ni~Q}jE0 zEG1+w+WAdez2fc9ePpfGh3we7#jf5WTg@q<3f5Bh+9NhGrLDO=L*Jwq5wJz?g6 z6ieR{QqSeGg(1tn5mV?p{Mt;aJ{Co`B?Q~=rm}P}`?(IQKT}vA0!xJ1{egt~m%gQV z{uCc3lRmRQSu%iM7_&$;wFaERTM)G@nELdAR%YsF1&7t>%>1|ZVSPm3`^T))|E_N{ z%RoI4*MEr;i5eOp0TCaPpKl;7K%U~`BTw@4^9~jP7R>(c-d-XooB~Qg34RY)>ixHW OAT&1!WDfiX(SHHsPqgg- literal 9262 zcmdT~2|Sct+n*U?>=R}hOGslvP0ZM38~c{*yBITL$yl-{AxoB`c(Npsoh(VVHlj$f zD^!-s+M-A)+P-^6UOnY`zka{xd!O(7&d=QUIoG+i|8=f&pL6b;G2Rplr-zHe6%FAK zIEWJj60Mc^BE+2w0>MBakT03!BZI|CVu)_o-9982I0()ON9_VwVQ>aG1HUXt6!tR@ z@)ruAh^CcE4+Me$4G0(v0ioZ4`QfZQG9U&R4-)F)P9b{{JjrlY1O(9zM!W)}_oJb| ztDk2VfB^)DfPf1OxPd@GyC4+|FH-sumXj%JZ!jpKdX{f@?W91K}(P#ZnmQEPT zee^kNrGLA?Qc$(1Yx$2sB@Dj}r;Lr@Wla1H@N}Nr0z$EI>9B>#>+9SfJ}zCZn(it) z<0N@Xd4`icBG_k+e$o>VrYSH+t_{^=EkXy+jmwF)2RS=9;q{u-=5=T=5+EF+Wmp!UPZ8DYI(dPbzu@Jpox!AFIm zyki|$csTDK37d~txXwpewMNsx&h+-3SE@B1I2%Q|vXDlntJ0O?PT~X4+V}ZgRIGR9 z72H?ME#6eSC~|>xCrdE>jgmhvlGVfDWJZP$D;J^>C3|%+AiKX2^;*6DDFk}cNeFg$ z4ygeKf#^uk?pF7es5hzCbH5BvoanT}TXqQ4mwdJ<8n3rs%r@WR(mZ__(Vk(zPndg& zP5h#>wM-paEPQe9_Hnbcdo}2hLrHeU3nj`kiWe28q=Q>ql2sEAYxVf+SEvcsOj~XH zJlMd)!-^WM45aA2QfU-P#H$ubm~6Fc31q04rYtoc$IMivO?^bFJY+7*Aj@j?~{VY}T-F{j52ta@A} zL~OvVIaw8&su-Th6x^Ss-8YxuoaY{yEx8$Cy~G-VF_Y?tbjRU0%M(YjEYX>fFs+w{As2I3+1= zph%|O4fCL=OKNWvxVQP&G@(?aH*;|b1RUvb=RTL8l&w4KMc<~n!buXwux-)0gWpi?#ee|wYnjVV+tuqP z4YEyv*?hUmWhk;EdD8fW?&pdCUiD3Wc%z52^M!Jin7+(95$wYUC&YU2N1N3|l}_Oy zkOW8?PA58n86_@lYz)Gm*&i4!Ze#Qq(|YpxWT~5{`%qRg+YK9@T04x{?&pF5$C+gU zYp33rocl<8)Vo(x2$m6IsduJv8Zo7?5<9(|?NJlatd?}CFMwAo5o*!C&8(Yix$SsI zx(c7@o}y=!CJmgVYd7wkuFOX!35|4Xuy+qw>2G;>Mm*ijp=69 z6k;k@?A9;7f4QWe)XO7Gu47_|qVjDCwm+sL#GqMF+mfop{P4c=g+2_ruu374iD|d} zDbWa?p4|BbJEB%oy7bQQx?P@Q2ez4W8@j}e>y~O_sQzhG>V$2js@;8mCVs%4%BF!e)?vICiq(!@WviCyg~ zOEk0zB}4hPjPmntZfeacR*3+O#iClVdY~s zM^3z0TH0aKA+IZmjKkeP)*lqG)f*!03AlHd-8_A`F)+z*dB+^0Jr;6T`z&kooBym?u&NY>0TKp(9EPXN5Gbq zG~L@GJk0*4N%<=zDkFk)Kd79tOG?(31-WS=q;>1^P0EvIabEY?kiztk_eLnaEyG=RcgHWx9;NYUJ z0~HT3B=qHb9iID>gv!)f#A}bD<1N7V&ar$l;v9lha%Bohy1vyljJm$!mYvK1{sZC} z)Xb9aYV-AExU-qeqL*6P6DOH~O`^QgUsF>QQmet0JGWP)&d^zFFbNOHF2pdBcq8a? z5_24%a>(Y6){X9p8ODFak0t8~q7OVZ=X-3hWzPliaN#N|zOj@8@Z?Ut;;pYZZViWZ zWp`#G$_GMXOQLQuObzwa)f>-ExmV{>c;Cs0kKW$JY!^(a;cK`)X|&P`8&Ew=cZD$D zTTU0pO!x$i6;U1+9)IOm(`#Vm+%Fd9FSX=(pv~r)wHmm2ScX%mwEGzJYsu-Iwv{!y%u<$z0X@A(e zB4@ttTl|jMxff0R2TdFYC66N1=))44ImVq@I!=X~em=z9cQn^Sppx;xcJ9iGkfF!+ zd6rYWt=g`K8Iw6j@{+br9PA$HEPkYY{^~u7t-JAbN@ZLDuAeUWa7HBko+oIKQ_UuF zGz;z}X>-An_|}6arhE>tBQc+)-W-_O`^U9550O0|_^NwaTvPWuZMW0Gx-PZ6E_~2# z_-rU_3_DU;5bh^h1b3>?a2{l=oQ%j1O3*GhXsH@Adcvf=vPURB^X*cLe1IMD-ZZ+a zYJ0$`uchI>9GR*yy~+9RK1lVN4_G%Qwv>A(vS7+L7WN|sJ5-Y{LGu)vvfDwPc zP*`rGq)kK)l3Am=MEg>dw|G?vZ{C(S41J>Nff5gFII{oom(a}>n6^imZJfK7Eyl`U zUt8jqn~sRN2x1D6Zo1DFAX9uPaF2~yf6}aWGv)jrm{+BYUM!UA38bK~Rw229oT2w8WtAETIQMvQ*eK($~O%+W|9s5MSnd zwm%&jHrBG3W2$^^JD*whrjrZr4Q}!_Ad@ptB~eFp*?0p>hbLNcmCKY`)j4qTZ}fIR z=_#MiIN;n^?t8iH5_;0(;rey)_yst!Mk>%$>f19c%w;PSbx}Akm+M^0A=v++^tQIC)P5;QA==VCJ*jrFK zL9`cRI6=DjftBlxeReWIoo3Cq%iP8CXO*kAS6J_5E-Bq}CMCVpdhia9f7$5C^mwIL zRotbFNz6+k5H)BF_*>x%V_kU`bGq9K(3@saq7c~bNe0FDnaZeqr@{puch`%&8AF_1 z1gsK$?cIXU6c_6}IgFFX4vtL?x0V7!3+iX4m-Y{T5OcGvi;xYN?TPJjw6xpN2VCQ+PEkgGU{d7_W5q*{-X!-Q=;Vaj%l6;KhqftcVO>knKkHVOQ>thG8{~LSSB){ai;BM zLf;eSsXFDN$0tNV0S41OahtoG46@8yL}pvH%^rD-_*tem>G>x|$7|dlDUwX69=%X5 z=E&D0MrSBr)VC#N#fe}@huG`MRqE=ims*(R*Dw^_d}_9H(oXx(!rM%)g9i$qS!;ir zo^JO%Im~%VN%BrrtabwG#Z-{dwP!n)rdWi(Mh~zv_cnMTWPSEWJ#|PwvGlqM_NES+ zzYrdidQpBkb2Qm{KDm7Bd<4hziNlMLSzVQ5m(Y#L{X-f|NzB@c`>gB*l#}Ts!VRyM zp3sOEV)fSxKJAyrZ_iUF96SEGkWQ?H$ad_Sn7@Tz$vr0g)#w3-^JJYG?&Q)z_$iOz z#`x3=jZYYSBd#m`q|auldg+@A1G ze2jl@xewOXxA*MyBI@CcxJPPGlB-%`U{Mlfu6(kz`A}>=CrK=y>04apGW&%yCcCy* zr)8*XUr$onOUP9>e`_r4{WUmX*HOOu&waW(7R>`MnCb4Tu`B5v&Q;8nuiTvdP(M@= z6b*XB*sGfWRsgGL%qY(H#6OokCCQu1Rec$}Gq3%sV?jJ?!8xmt*l7JxFQX5S81I^% zuC3v24-<)HSs7>Uj*;Rr>QSDhw`YDX>ErY^>)F@(9L5X<=p_^TFR7ja5)_qu{JFNX zm3nq*N$$JnYRcZ1SfYDtqYES7_i|jzTS@cY^y^`M}L!!c=mR}*Q(m^03bgf3U zwN_uJ0+bCPwP>$HVx~g!Cpc67gsDIaJ*^2~`c>m!*!gWtQr}hnifOF}30Y|f$0c5}G{R?XwA~S|N|0w}Q@n8@Na7@mH zL~@ZN%VV6KB?%a5XDkk*AVb0e7qSdNR!$0wlT!E@xhzN&^*eIN-{RUhIz>k7^Nj_F zOaL`3j_xGdl)DK*OhV8%4bd$I)VE{5W#%`X*)VwQaG^g1k5K`5wF3CP@ykFufQ+PG zHb(ZyWE~kDIoO|uCq&wL#J6rW=&(hPFzZG&_3|^&|TYZ0%`(OkR3aH{Jf?n{Vx+0Fs_D^ktL9& zWToZE3RoFuS)!b?Jduo3Aj(P0kqA->SScJyR)OG5ly)J>mu(W zMRLI+|F(vW1J`M}9yoe8f}a~Laxdr*Z|TrWX6Q$z)X@EH+y4XJZ#ZZhtpG+M08*g* zJi(CAwVw4F0(i9iN(F>BcwC;hTYt3B`?-g#zENI%oJ889Z`%NmlV5fHg@p~#D8Y^X zlz{>ZAP_I$3{Da!jggec;N;D5l8VyuiqdksC1vF#<>Y_H2#gi)Uvne>P2a|t;~dws zgN;P+Cp!{|2Ls%J7xT0n(n*1B}|JL+|!PsbM0Vk_KB}lcOVbfM9 zTF-io8u&H-b?U}22c7xh%-q7B!eH(jqFK)BSCOh z5y<){7!i;mKQI8R#I^0-8j&90T;Es$pILzR54)08!Y@nus6eUH75F)f-$f7*9)6zA|Liv?^xxSyqyLGmEczeWT%!NM z?H>C7&gKdIH+MVe|Hq3y`air3qyM*-JoG=f2%-O(X#%tXd7v!~RPY8gep+=OwM{dv=^^1FAptbh z)lF$cnq{is_hL-F6LAVyqNfqX8fWX{tYaBqV}+&QjSeUPAS@&TSLN}+Gzo9&^#^$x zKBA7Pr!~$f(AmqDLi93qA?Z;3ZGj#DFW|Th^}bqM)lJs){e?e9R7P7JZAn{81veWh zZ7jiBpJGcTYMJkpkl2Y|lgGPJiGXsT{NN7|7%Co$EY&^`B#-a>c)SI*tr4^)|K0wE z@&q025CWAQuR)~pPvR$jT^iFmzXpxp zG|)!M(%;rfM$#Hb@xbp30r1ItQ&R_ESO1RBgDUz1KAI9|FH3)G93YQXpaA1V(?O)_ z#B11mm;a7e-Qy>H#Gmv5d5Oo<^huez*?0z1sCq(-0V$fkRivJ4Z4*73eSNCX5BN2x zR2>w0tJgCIU@_n*Kn+{!8C#Jb#G~l}U%1Po@;$7eZ~~_nHGv z$(!KSO{n^GfKjIErv-;L=+yf6@u5wkAN`~D>HLt})Hcu$#PwgIc)YSQ2r1-4@I4qn v_Qw!?d@uxGU+*9xV8iU^?(HRnmEDcqEy>dfZ1w)zKM;-?1TqHxQ|P|{aRZCj diff --git a/transforms/language/pdf2parquet/python/test-data/expected_json/archive1.parquet b/transforms/language/pdf2parquet/python/test-data/expected_json/archive1.parquet index aa1d5f30e80fe802fe978486e892184e5963f11d..033452371f6caff4728ac9e8067da5dce9532c78 100644 GIT binary patch delta 7151 zcmeHLXH-+$x=lhCl?Vi>B0UJfgmhYnGy$p6&@6<64$=`&h!ho(A_>x@DuSpef;0s| zL{LO}2SGuKbdV;^hn{oaz3<#}?|Ao(cgJ`?-u|=qn(O<%HTT@>n|qJ_pa8g904(GL z(kr7ukR|OC5LO6)7XW~k9No4(bS(Hfz-3PbOw&}4Sqg+Be?Ryvp>C+@TVn#wo0XkV zoo26LZjlFl@MgV@Od+omQrlK7;8SZB z#Fdh~(rKOM-TKj^cgJa5HHrK`B7V+pm3jVr5i9XS?O9U>LwI{XKZ=h%5@ONyxKhbz;LoC)pwQAn+Pu=-Jl7 zR@02w9akzLc&cvpj5|HubooBPdyr==cE-;-Bt1i9fIi%Y>xml=*006h8|qPsL8dPZ zhfGB~%*CzDt9B6;cKRbPhJvC)plcKQQEqOxH(yqk67Scr-*TnjuyPHIB0)sKR|b=I zOh?3@_>WtOi-;flk?x&#KGw6KqpNv_&taSM{bXJY(t0OC=!2rVCYa3(C>eSr(DHDX zOf)p|7SzSd7{Prq){2K)JIq};qV_e&xmQ7g|B%4(#09)ro?hewfEfjLlYaUbKjZ6{ zWYd zJA0u}b3v;&yN;kF;Q|A1#Ti~_i9D8ubquT(JRMh%gFYl>LyNvi$uQ)1>_yHQ^Br`E zfBf`qr4^SiOhZp!)ao$U9~ug+i)Vxd=G3{Ma!a%q8TIrGV~*@X%||51k*~hsQ@{~Q z=%)trh+0s-zk~Bo0Tc>#DL@3@&IwWMqS){44GzEwN0RXM6U?arp&C*ae zL>_-RKNI&RD{Pd?y4}`fXqFt#m2u}dSaUqhI+3LBGliHpn~F0NTu<<6t}3i}LT_7tjP#y?5+i3)3_NQ5=Hhk1sKi$LL}mDsX-%4%U3O_yCl|^PdeJVs6#n(ad`q);1;C= z_Y}c1g;+0u*?Z8d&iXpwdRkBvDH{C-rA<^*1aDmk$_s8upafHs{u&3`VbFop~4OUis2ly2;(^5!d! z{8dNE*kM_3k9M*uUOqZn z^{9AStD?)oQ2iAHkho*2kgR#BZ;vc^T53`+qf?I4w*=WB|HB}C5$nZRHAok0+!cl7 zp72GlPQGf5_X^%VnCj4iw8-RbZ9DenNn@XPF?ngWsEECAn~NB7eAJO~5D zeShvt=y_M~;AD&PgkqOZ+|S18%$~`cP*HcWs)E#Scj7<$r*CI9c~ML+l&lF~HFYOAm0-&@ zRL9l>k1Ymi!FhrS&l7>GvUD%P^X^w&9&>S`1Jxf|Yn)nv?vx^xDuRTGUIe*I_rx9@ zHKG&rEPV_7JR7jzTAx|P?!=1Y;C_tnvLzXe@Q?QTwH&bax=cp3u6_g4r!=|AYv|9qy!}q($TV$o zs@RKN-DhW`Exa_d7>mk?*P4{4T%13c^<>ndwRf7(tVy-`nEkYXuW=}OZ%hG61aWZk ztLclZObvcVt#b{%2bmu^4$^b7O5Vsf;NGC}5^h1s7OIQE?A<}r`8nt=MqYpIQc1g( z)07t(s&O|{#rYX`;?hY5?oD}KuU(7pT*9cY=W$Cy$1&tX2|<3S=cP#fpir2X{ zS?3&G)3w=3KY>Dv()Mc?u+<9MhteifcqI5ex1Abyz0cmJxWu>wxyc%;$FI%C$$3wR zdPCCr9-_C2m;{f?q4$ks`OF}=Bcfa%wc1;tgd9eE_dL{X`d6WhNRRHKN7!N zppw`AmpbvwMNnRQ{M*VkL7nz_MAR49ew8Z~;{thw-G}cU-yrJ7Kck`F@a837L7?Em zKffGU*i+~vBTAHvv3l8{U6g*B0T}|;spvULHGS2PVa>=*&b4kA4R?__A;aUyQIK^A zHp{WC>a!g(^~mpe!G;LmsL;oxLp>fyXAADJYn`((RbxGlBS|rNpgSq#kK~6A55-le zN!CTnB1bpH;?!TbS)HNiPjs5P`e2m;XBxJ3H&y9bY9hALHvQIzdcL(N^eL2lfPI&G zShhxTe4SakaWy6;zohS}k(?g+TK^{Tq5F!bvyW=&;;|0x622B)QWD0X=|tZBuJG}V zMGwclZ3W`OKRs1+z`F=lx-lI)V>#IASShZ9Sv_~*QKPt*`qZ^0A41US*(Sb{@=O)` zm1z3>dzH3#74wDWr0U^-!6Vcp(Pxqt9%|bIbob6Rkzw(^NFhbfn@hJRReM^x>aUJe z#tnO@x!$jl0KT@(@e#?$N7j(%g)BKv)A&&(EIt33nu6TsXL4^xBzA>WikvILu-n za!#*fXcX>EKo)&GLJSH z#+xkh&%&w|Kd(Nm^UiW<&!E*P5>2!eZ#VmP5y)m2o4<*~72;X?iUYKCeH+jfY2f=7 zA(kc=6>cYfnz=6F#K?`@^ID6Am`A^9^W>f~k7nvA#;w_e^JSIlYFlUDrMMLp=CM0N z26hTPyVlkH8oxop1~6J*WWs2vmMiZPFzWGvT``S_Yq6%jVlQ)$g5@aDbbJ? zsiDWzYgQ6!Sf!kBnPh!G3H0qc{q$7YwtSwk-(}s^>jBovN&>g|t&YF{>}J@{);xbt zkZtSHlGC)t-I0&_O!-Bx9T#9te~@E=ltD}L?Ucuj3dtX3OL+%GOBG(fyCEfA`C{sl zz@fW>nQguL4~b#b86~1k7m?2>iS$J9#zT0}h)xCZma8e6XAryZsHrN<^GP=FyQ_U7 zEh&yQmi(SQL0@UVtakXilfW=9^W6JaAMHDR`{q9uB0^*6`bYN0p1xi!ajV^XBr;s_ z`^lOZXr0Lo^W=ky-kaI)Eo`GsZ(O*oG$W8DorloaUvjOXw!^MvS)TfkT-4*6XW2$*K845__q9FL0|gVq@GTw;e81Or)l3& z^Ta6C6bX^;x0b7AZFjHtDH;@U*5e~}#{!s7ZNS>8l_ez zhn(X6f%;+ja(;o5E`v33$}f3GU+0r~Q%vD~S(!z7bn234j&?~PO)PM3PB|=1^ty1s zHT5u%e|8vPcD-|ie6qJC-L0Ko9B{c`@W~gQnWol0#A}Bqr!x<7l;I@<4B}4ibcQN^ zy*#Of&^%V0#!ysY%iV@S245JFQtPXQeXMJ14}@noXQB(vd-3O9c*>*yiQaIg{-Hp% zBYdG}9uo^UybAY|EK5+3*O#l|%DVsw8b@Q_y)93T;x`8i5cJR9&~=rRL**B)r!Id+ zpUGcRE=hZz&v+%?C4C9c9}E6A$NcKr#rA+u@Qh^wbOf9sa7g+~#AKG;jk*xAwTvqX z6A4F^jzjL%2Cj8Y?Nd(EzE2kcJmku(5XNop>7MywQ`ExCsaYcd&wJr}_&TA~w;=ib z+ylhY8TUKxPr*1A8_&b4rG8(F>_#oWdjtbzh(XG+6rDtED_&^MD}!T9!~dreYNe*enpmnyl*4bk6>du zO`0tOBV^9mc8(x7y54eAqU=;C&N14&eukxg;huKp$z#)21?MFq8OH%HXEWbHoAjNU ztMr=3NuyS`1}5Cx^Z0Fd_H66ytn>(nYLum+YKOiGCbZP=^d8_kMb{k(xU+Yl%4E-2 zMj31&`IA|WKe;r_gWK@=#Szl$i-8;+u@|b$B0^`bE}Y^#%1{IE_$c?asktevn|Gqv z<3w}Pxt|f27*dDL!qy{MvI`vn8}V8e-d`%mR6Mo8>?(Q{YKShQXfe&W+IXZ&XY z-LF?-6D!4ktaqoB5;sL$Y~s@#ZA2gd2roohard_h7{U%=7sUXeOU0g&=0K*XG#04H zd?jVYRN~iQ*6<6mpx|(Z%%NTIrZkBqSn(orK~orngfopHKS2o@^e>Qjj1P!qR?5ol zVjhr9=5uNJ-*7pVpY;T+e@}%lJ+;_@IOdR?*e)I-r}Yc>(oqE}7VpUQ0e=l`kNiwp z6w?2F1Ytp-kjxF3)NWAv5p9;gz|0eHnO*3xk^txzR0GHD=G~M&!pwk69Ngt9>FMp^ zFR5a4QPR%NBVhLe2}d%;C3fpDH{mFTk;Hy9MiHZoR79dtidY1m=_Sz)>}Gyc7v_!l z01{}r0+h|rmyKfme}VogP!)5=h?i-v>;Xz`Vgij1vu`y5WnVTjx0D})>RXtPkP__o zn}M=ZADB0d3CtSYL8d&)78KFJyn&KqA8!N7Cbuzv?9QyAMCsrxG>>)=z?Np<>JX~>O4VQqepj?~-yyH*5t<7hAuBo1K@Lm_d>Fgyhd zhfz`XSR5XKL(^y|rUYI9xX#oxks>p?fwCi=z@MAoUtQs!E${;=z$9UeKt%)0784=% z)_$ODLO=5rhDa|+p%(|*R4~-Gt07zgaaEMoJ1jN^9PAJ0R76>*f?ro8CtXoiZDsFE=EHm^V9acBVEqu>${0E=_wl!U0Lc zpeP79jY36ZFerN}McLk!YDaUh!ywT}iUSsfz~Qk-dmIW*MIsPXl$||-ilNZpc4#!- zK~NKg#!yjs1e`)eP!T8$6^^4I;aD^Rjk2S`5ok1aHw?*q`iYO6^^vS zVo+2E{BEURGOscZn2OS~hJdo=gTG|{N6h}6#CZP5><97)moyUa>sVrb!U;+KWhBVC z#Pb_20DyA-Ey|2Vih@3kG22W<6l%W!WtT^RqJsY=;!pFKbi6ph^KXB1A65$n0(?Qj zyWz@kWfTknhaqq$;TRP-UImF)!sBtu7&Hp5gvX+hSTqoD)r2{S7eV}O*niFn3jgC0 zKlKMKd}V5?Na!?zWQ!(&vbht#p26QI{r3v)%Jx$nO_s=2!q1yk)0k%ts|d*a0t7jK zhX_BF`%Bng(O+~rXPJd&a>}JMK-tOZpZfiM+&{?vffQlps!~B$7n$Pbvh0@^fU-sN zOha>Iv5T5F@c*6tpTGKkIy>LrxjKPG_K&u%`FD1%`I}X`{d=quAoTN71pxVF`rzFY z03iEE*OUOs{-3ZBnLpUbpK<>m8H6y4>~DQ)^S{!kf5z*##ykG6H{PEW`(qEH_6&`J Z1XyJ`1)?*GubT_8a_aB8n<>#8@gLBSvXB4( delta 7338 zcmeHLXH=70mnNYK2%$=q9zYWzy$Di75Tt~1xfHn)LP8UeCQS$>6pui((?v?>6xd7DJ zSfI!=J!26z5f(uf7L{7HUB_DO$MGx;2W~*M2nO^zfVz+X;^!wccy-9N|L7ZN_eOAu zxshs0WlCbnR4Z2I3zb5X5E;*^$GFR1{p8R;4?w;yPQ4W?&wxBBqinlY{;X3?DH-)b|HmQ-ovOVD4Bdh3)S`ThJ5KXK%5SW=-QNAtmMNJ zU4t%kp#y7!e!DI&`KY6XkH>tZ^Sk(Nu6PLeCDYzsiz2;1*t-O!xRWVM~=Xv3rH=|9=dHvd|++C8@B9&$^+2HJ{8{$G6 zO*vXCb9`avLa?qG+C@}#vbiuR+dV5k`;E{ZAZ_PFIGs}p(zT?N#KK^SV_!%oe_`jj zK*$wGLXhc2w|@)`n&seqLVx{mPFYxUHbrXl_I72CGfT(OVXa z9*mY7rr*4^0`w7^ty`Ontz7yO^cW?#`J`6*_NwA3bX|Mcx!r3o#@NPA_;x-3L=QxIf|?f;s$ZwAy!O8s7y4i#R>ihszr2e4So~YT1+4J zuc3J`$}du=j{PZ>{JZI+{Lmkf;_VO}?Vl_~vu|0&Jqmii*jPDoPsm&c#W>?1t_qsV zt`#JKVwziAsZOhDw%L~HFYkk|BY4nP@*4d_TV)lb{S&8+SC!CdfLp*@(`QzEQqmOJ>X)Kn=d=4nd?DJU`ND(r{}>xf;W+erQK#8=3~OXV-8 zc6?C zdAZtm8MZeuW7p$#eRo|Gtie*Ma@-RI2KvHLDyb=rXqG${78Z3bHY3tP$tmJpTQd~- z>9SAMo-3sevY^WSIqe>sQK%VgzGN~x>L4v0e?CJ$a`y3NE~#B}GRxzwYN2MWdQ#Q( zx2TuY`$cxoZ#tb2$rAe$fd^EOWys|2a?gc`BRSf=}BMgsW> zP_Eh_cb(!GQg(DZs>m+%QpjBd@}+EVK6Fu);%4MyRyH6^GA1V3n*4HMaIW%@`I$)U z{Hfk3&BCuL#YypY+_a&16}G2NCG3Da?WJebr$u5hsvFr%IdFMi(EJMUDPO1%7wDx5 z$k@|ZmcYyl_EF0V!FN$b{kK{2v_Z8$L$tVN%xTV4abJmb^ClW?e*vWY2pNA*$V$Ck zwI;r4IUUw|+|%!e;-bqu-_L@t(NSKJ+%S4K&)MEL?oSB=XY?t67q&YQ;?x!8Yl@3o zskdc1`y{&`g^=QI>O5m*Q+W_VnC=?B-b;0hh@~l_G7lbhK9PoY2fIW#CM||2mtPBcSY7(fdB{ zQjrV3Pv0>5qBV0K?Jj(?g`}kao+Y&;;Uq zgUbPiSgsJ~R(N<^a!~KQWpyB4``j%B3TxO1U{!J}WUZIeVVYWdFiF#tH!NqFZ*Ba< zI>(Tk9)owQiMGoUM)l>m76A8Of4#(^>9@BfKC!637*8qBB?Ll+O_R*awP8l2M_ z)+X71%Bw09Y8%evO)9gLgo)35A-u`WIU($F@?7hnxa)U;`&|ObGQ8577igHZUAH{B z*Axx4`w_5Lx$e}PmR>rOnRSBweWsV50P3{+kftd9yDfH2)>~XAnjPFPsWU~U63}QD zd|;aoUcCcGfSRHPdaLcUdUP@O4)O8`O3B&1%jB-_5cBeR<#_rbgx3Hc2P@OGm$4TlDvVNV685YwkG(M(wnIlraaea)(_JQsfT>hgKmOuc;- ztW`St;A*DJS@f4D>K#%Z!iBa6YJOvpzo2D5r+5Ter~*=X^F>?OQnt)i_;hUccaSb% zAf?cFXWXI=Wj|+iyvB?_NDq1hB6{XQf2Q;L*iE1V{hpUuaH*j+X7b@wZ+BLxNp*wH zfG|0@9LJtR?M~mwv`^31DA1hBa_6?se1Rg#T1hb}vbR;fWkrFSI6p-j`?UL2>TrpZ zBNL)3b@Xj>Elr4o#}5?wsyB!22QAyi(z`XdZal;G2^Za58FWwUJM&zz^wzp)3#wCd zK$wX+yDCLu5{A1KjeRPK{AfOOVrVHb%Oll?>IEhH+$Cn&pRvt~>V$n&oz4zT@=LTo z3;X~zBz?ADR82LsMi^I%ZlrVQ4l0()olj%GQPfzVxtaL@Igr(@Sen;1PvTx8PvHt+93nMY?pYyk9#acd z>TcQ*U~lqJDL%%J*{_j$t1pCi9awjD)0!E%F}G>yQr6?yH-x^B(FWDX)9=jQtJ#I!-uFvIWlcm9$=&nf`JUHT`|>uz zyWH(H&klV|_H67znlHSldDmNzB#7uybzjeY7{2pCTx|%YH(u1SolLSf&1kWve!41s z<=IvK4dvnRZd{-zp*d>I2HApFC1m}xxS{B{QUJkQW#^W}nt=(rgB~$FFgdvq6xu*lve3uor^_I?C zKhPVoV>HQuZQLW_pDldOKkZ&`b8}bn!r3Qsk7e2pPJYo=&UOe1pB4X5T-A7ZIsHgS z;`FZ6E?(Ha$wVVy3SZG%-OxxqR_gLmU@zIa-ZW=Ga<`LhoxxAu=^lTxm5vwQw&M)y z9bV6QH<;IP*kNwXGM2RS?l;v@Pd|pV0OQ@a62yqN)uV{20@Z$XcB19mkgm6nPx&O6 zXWP9|-Bb^A`(gTi#I1GU;`z`wv%9b8rY$elS80o!8-X@PkF^^7v{m8y$S%(pfK)m_Z#O7U%pRHwcl=R4eAlPE#@kCHk5ehCytukew2)Vq zCZ>22p)_IU8S?v^wqB2sbEl`oWW1B0nb+nO~%8Sw?ci= zj$U~BS|pS0a>UIKipRL~?}YmIJeIq}mZoi4s8A=$sd*#G|9EP8%vo_OEsfaIZa8i| z2gGeyC|BoW-b2F+;VM;Ms=&IuAl#jTc(WQl$jLB*JtM5|_R7LBYF7u`MJezm9MiAy zsm+^ZTcln^O7{`iy6qXK0$1y7>BLhNSG-m0=@g1ia3U+O@q;l(TX!}&9@;TEGOn=S3K1tNsXXvBq^&AGMB3Z6nLz9t6wDz^jmU0qP0aLL65}H163tsFKz}D=)UV`!1pJ zHSFuO5wNRMX(oq>P2fn{ggmPhPDTObsMapl@I~NJGSMQ|ywBr-@tyo%29voj&9*?U zKr8ZVPd$$`GFr-i-kmv-sft31C-``sO0ONy5=jSxbW4=vs97v;&|V2>9YbkPEnSPQ z&x%Bw0Np?ydSs8I2G7j*Y!mHzt0uRxo6&87qz|j}682SHHb>Jc;i8)~Z?!F(XA0Mk z^dSSVVX%q;*K_RTN?HWZyW(=L){ifinmfS+H&?6XDbe@0r9G=btj`p>(c(|bt0V>j zQUGCl6Cws6#0N>#8Yy3ec<`isOfpHKl zcW8ec&l9EBIvkaEl6&eM^`MeF)hb*{m(0a$^epknnMyIXbctXj`r_PNT6RNVJlGJ9Y7e=uGN0Yu<8Ef7(A5#kM zUsQOm&)4PE<6QLGeI)tQdzw`)yChY!zZpAL^;(9dBwyI*3P;ZNNmDcC+4m+Gq2*?(RbUKkGManZUjEk}iF@2`5biu4U0e(j--|Diq zuQ5BL!qJ4f(@k63`Z(~)dnA5QIv@|xovK>V`NA|Sc=*Ofe1@>e+pw6n#CowSul1D+ zZ6d~+dbuBnB|0NWGyCXzaJuLWu(OP{l0(ez_$8bW1zJ6&^U_q-eY9B*In_Fg-% zLpNoc@$e!`X`J5IB;81VTdyu>>(SqCb4(C2Gur3YsBLt$EZ0$ZM%n$CZiM>CbF8p= zm~?I?fCmrH9HSV%J^ZxQ8v1>xIM*PdO=vPk@;7}!sbkq%hKMnjMkkAY*yWuUXI!r&_az-##xN|t_n!_}mRUPi zbe%zBKI1;TY*o{QKm}~o*+X&FU?|4YObc{LwL9^bx$^AxO z&%)Vz8{;gDim!(<>T{@E@gIG=g*M6&T$q;O7}2y}TW~=ac^kuK_m*;EWUqxx;!`3FaD;ga5!7@P z+r35}mtTh#VG$7&QG_rbR6rseA{|rGF}X%c5}{Hr6e2nYfTgCLmJsVAj>B5P#y z-UFCKtjh6y;*}$Y%w3*@2py!av=3;~*JIrm39(}!FeIG5ctl}ep=~4z{F?%Rp?e(z z@AIo<&-^Rj%Xl9~_kM^Js4cI(UNAjn=X3&AuzZ5H?&n^JCG26`V06Z_bu&COASIzvpu@x!5aM#!+~+ zzAeT^s^@U9sum42bMR`yF}V-9x4)l1{?f4V@C}QATk`7ZEADY`O{)wx-hy;N^c0K; zm4ZY$Q%D#x9I1{a!(bF=6jB|Db|xa!DHt4@f`Y>_By|iCg+r>NiDU{4j-%ix2&@wc zOLiiWVQ4XJ9FBs9W5_TZ5{ZVZlTjEP+=<9Uppg_5%2}O+aiU;x>R2ZX8i~Xp(O3$c zh(^LmWG81h7R!pJIPHfuimr*3qTc%i0KfRKA6D)E1<&~Z=-N-uM)60u-@=$Lc+78< zpYDkilmC4y#x=_SH-#vR3fCW{^j8Qe;QRz#+FTOSF%AGnj02>^{tevX8lm5!$@4DYTWyTb0 z5HaYqs`WDpN+R=GfA6gr7gMp1*kA2qe#ZFE(tYQK=jooumEp~E0PxD}&q)6n_Yc^k zcuD%`aS||fiSA(m=1f=wfE&L8==>V8^m+@yngR`9!2b`A|A{&M=&uKd(4U5e7AyGg zr-t@_cSjat1^<0JmIH$Sul6gZxBt1;Bk|9!a;7K$mC|I=|FSNHxMx2BMcKeyqV~xk@0m5Ly=&H<8D_Q()(#C}gUCa4 zwnM-WkT3`&->AAGC0Yss34%bNaFTDR1{#f0CI+BYLVf)pAc!yojs#c*Asi462~Ci^ z;IBNe-zb10rdDoq5C{xZz$`3a5Mqu+0>USz0pbu8gK_zhDWqUR5DCHu1w*@7pl?_Z zo(QhLt6vNWU;shDAmAnlJU}3zPF-e!%sBlF%bR;^R5d6Ld|FAZHq&g1Bezflab=0} z82Q#p80&ELd;q%$o6qA^l(xKV-mu@}FXXA*nuM*d3|<+`Xzu6*mxA-b64xwLHWiKJ z#Bmkw^D!{cK5~zK_n{spZm)YnvwhRs=N<7M7xZ6eSDJ=!x!JK19MtHLbp5@Bt<6w2 zLXXPFyPbA+eV5gUvTRuB%4=cwm8^Jda?*Y_sM%^4e~*>j)>!aLsw`*qb%WW!6B3vJ zpK$g##iUyi;cv__JP*RS-ohLEI4^z4$bI7w96m;?EYF$R^IF6*>-vVzu(tU8@yW5_ z8D*T#28nrD-D1O1R7Tmk6!OwiK04HJBP!s8&6}8ImCbv@AnD2qoq`4T^GVA2By@6p z0nIG_mXJ(b3z3T(tadrLy!>C@P-0{t|be%G?Ttj%(Hkrk$Eb(lj z`6($vogfc1yXlS|Eh$+!(42Y?e#=5s6e%Gx5(ZPLI?#3|REmFF0sl6^*1AwVCfv@ z6>hbmQl+=QC@#};(#{~IpF{~7z^WRH!_Ek@KEkMdgmG@Y(p;uWG;r? z4$C}B>UWbcxkGJ#ZeHNCVH>`Gi&lce&A&Y+qYm8(}>pI6=t7pA6tNX**{RRmDCEYhHxK=+V zStxk;!j3Sc#NY^5*tr+6_A&=0z~)p^98v)(v3k=K_2#7u1a4Csy1i=@OBhZ2!=N!T z?`d6j4$`)7)AQInb)hp3D|Ok#QE&1KHUSa2wv zuHV*}y1cx6*u%Gb@Nw7BaXo+Bk$hRs_BLn|o`;0bR93>D2^sEd^X!<+P?BC9Zax#z zv~tMHV6aGKyL|Lq53>9L27i5X6vcc~g&5(KKAzF ze2aw0`l%SD&7@avtlx|%-s-TI+-qzo;(9q{D|tQ-O4j+Hem^ zQ5d!>;E<7r>!THj1^e+9#0)`sMtD~>HHj8JOvI>c()uL8#yov!vSTHZG zV~b^7Z97V3Ls@MjilXc?|73C>mUXZD;zCutufpjRQv$EsjvG#8sDmv+2Dl#Xt=H7Uk$JoAPp9FRtqiJ9D@m*(HuFt(B z6y;wUJAHP!mbzL2=inKlR#vAdTVfyixXQ#1nV3UQwFr2ZhLtJ39DO zebzxf`pwEKs|209UCL+YHec*>>dvFOAPWQnc>by z^BU(jdhcj9$U?@}yq*l}D^m2=(6jlHVu{-rjJ(ueetnZ2H5_i(*zZN`CoMj=pz1y` z>!umt+u{NHM{hJ248HpmtNpS0_{s5ZNIL&!h?fq|K+f7i+P$bZAV4mmV&rs-)6+?* zQ(e@pH4;|BS5odYJc1-GIqAT8E||Gcx7Cz$updEl3wpfJET%0B-F;f~QisPRWd`WX zO-@5MiH80+ooGv89tImri&f*T^GZ}!v#t%PiatJcw?F=n*g9ldJ&KQzLGEVd2>{W- z;yi{oNT_EsAdXaSaf#Y}IP2=>6akHudWuci1suDLUb*45b{`JQt4lRtaApILFX@#Ltvg0A=j6T$I8V~!nN z$u5)jTQAGK&KWabnSW0o$ys7yj%=~l$g$1fsJd}cV7i{!n7vSc-;v+=%i&*INnr^DYWe4EIn)K%Kq z_5R`q%n_EM+Dbf%*Ou03Q;^J|Wr#16qbAqa`YB{w6a|JoQ(IUd4;6_ZXDC14O6fW6 z-Jk4!BL=yI9X1eaHQdi7<1OEM{5Ib(Kk0H-o4Bd@qdo3&y29+XDzA-u=0D6mj8VJ8NFI?MWja_o!75m(Kp~62Zn06n$nuD%G%=A?7}YF23uh56_rKhKCq!k6 zLOPp~*j>A+Yc#W!MNr04u<>&f=}tarkoHhSasWP9coL-qj3g#~=27x;qQ z&7GwOZtf{q`dsGxrsA`%6-%+A?IAIK1>Mk%lPR5?Ne0i+xFK4u=Oc5dX+RR8nY2NU+aJ@SxF&!Moi9TT*PDsLPt}noVMp=QmV;HduXF zb9m0n42!(ywNk#YV7aIs_5Q_d)m+Syjs0lzD$QzrU4dFdp6LVGHX-bKOSrh3KNV>a zSS?4rpWlcaHY%tWHGNR(lMuN;6UqB>cj1%NwK}+Z(AY(_4COCZWzMdS2~xaZ!J8SJSNK(;G-$J zA$_BMc7DgoVuQ-!h0QY#9j%Exv+Bp;F?!f+Ngy;X?-%HYHm;Dk6!g{A}Tv?fwk1 zlX5t5zJ)E(>a6!I%V*P<(~qo|9lRJdq5tLN2Bi)uEz zn~C0ga`%E?fa3TvD6Iz_EJf=hRonxlvaTg)rR9zk8kT|6T#83CCH5C;pkc=z%tc-@ z>zOl1^O3Sk*zBz3Nz%)oE6;!U(5WF(iw$Exm3n!;pNoTY2qq_HzPCbiB-WF!xv9MS z@m|~$Zys6i!%~g~j`aaiC`rXX_!5Uz0%!DLzf`W-CfTYYunl1cp&8R%^?d}@zRSXtlvU_@1&liXlj_wIxJ{iieRrbA`B z+EkoiPp;Z!r;>SjH6MN8=e|%2GmRROOST;n4X!!4ZMp4<>{FTZ0_cagiv(e+J_@bd z;O4&NToa#~OoIiw!Z!!JGLJvRuFkShC&WJ(o_|3S1(DPe-r8d#s~izNW!lIqEq43i zaf!iOGqThsbaoZju7od`q?nk>IWBk$JH^gjwrwe)f4!P6mp9#A8E>*kdFSG4T%8WO ze{ZVQM&1m8nR_1!hBZo7TGMVE?YSe60iJM$XLBAruO9VI#OScj9vQv4w8;0foT(nY z)o<}nFjve%dJ`JhlC)wl2lXTq>`=FNm%l;!z2w^QI7;Bth1+=!VxoUwf|`!2^2}rl zYMNZTbb2BP(K)ReCtrBV3UxOpIVuL{odkEFN2Gid*WF8r#PiA?m3uw)F}Pkm!}AH( zv6W8eWIM+_x%`Sd$2ZwXh|xBo%)aDY8VtlxGEZmMqdO0b!#g#D3O`;3XVnVnyzWiO zxFXzqgUyr+H0dd{Fj*Yy`|#4c;^7-n-ELA|&#oWzg$gWf>$;Jqb+x9DgFK=8>SSVb z%}mM%ET1%slIsNM%Uik;X2@~B2`hcKTu^KIdtFl4e0VzU=%JwXmxk#n>|BoIF>NRPgRcWru{{Fw{6{Z-xe zpt*Km-WJW1Uv5cU)@m&-D}2|f^SC-_h~42Z_B`8TY3hBG3j-wDsm}zKxA>ULtM{&J zTMtD(s|Z?tS1xKeUIz;EGF2>V+*7Nu^|rvNWb+VP^mOlXlTD;qk+a(M;oZwS4~@YQ7@4&f>voU!%BG`#X z0CxYtGlfV)Z4SypldH{!fMqtJm95h)#REh8%V5&0&%o9VyICzR@0RAV zad6Djqzd`KF^XO4qe$q$`Jy-zsba)YZr&&3bWc~dqQ~wx-gINbKVJXXNXXw9(p$u( zz5HZ<*#(T15l5~0K8rruIoY{#`DdR;WPIqOF{)ClU09#?N)z}j{fzoq0AN?*JTZP-nD@QR4 zkO~F!h#S4U(j9-(pnL+O@u|M!G>Jo|Of;9Tu&*{#si8uiYyXABY4MAtj#P){L$6*6 zgTwb8(4?gfPg{=DO&UVXIuRg^JP?R>tbqMYqjX?QpZfDF^1*!(th=+`Izk?4M8F?U zIHBFy7rjk$loO8$iKK~+KJ+6_Y%G<_6IH6=^8zvhL*w!>*$^PnZ6he!$w0hrPr%@d+DW0LyNe4=EL&5be{l2l; z)Z*2pf-HV59bGF!DU~{(DsBZ|!}Ij|)_SCck>s54?M1J*EXF4LJ4=U*ok!f^{eDR; zD<|e2-|9JGv?cpUi)YV6nSo{XvJZLQ@Ja=ZGdM4EvD2S6m0W}^1SXq=QS0(q-Y2T- z?3{0Ya8=ZbZ3Rz?-6hu?wS+^O<+Go@&O+&&u3x{{T3qQn;D+^#ej^&{5x&}fOJP^o zRf+kWZozCJHJ?0^gGJ!;D(f=s_@_C?1|{c@7qQLWD~tEMCUGcq7Y)^VZ6`^o$aO0n zKFHo_eE=_*Qe@APk+EQ5I>YA0o7%>^?z-2RT`xI>vH$XR{+PKuD2t zO1ZvwHl$$F%vkg5C^yjqk&pP{>FUEyev2s157_?qGN(i*aW)}{(uG5>q1|xS#E8Mm zw9`8NbiHG$yvEz~mAE&`(24ix@V#ltFkkd$} za^#(Jb>2-JtP?t8LlcjL4ncA?B5qeS6(oK%-a|+JG1w$}zs04io*aXC&(-M|#>s6G zm=cNJNjWKvAHw-X6W@L|DM7`SKt1YG0}wc}zOvECFY1t*sWn2M_X!o=h?;V`7Ss{-&ei4rR3P7pb#C(6SeCoe0sF_ zo!I;)L~azD?)uB34|sU$ucl;DjwrEs%kFozS1q`7Z@)jx0`W{|skn1-Mq$0I)l;7J z!v@|eJTU~l`{H(eR!LPK8=@h8Ap%Wjs7(&1=-*H}hH#yhsr`=>?NN^-6I5L8{n>@430)6I;Xmdr>vH}~*pfxv( z1tBs&?-WaMVL>2vz!Ky7)dd*W_m&?eC`f({Dj>YZ#`lDW1cdF|#y59xRFGFF!JqWi z*CrO|2n(Xi9g*%X@k9BK(qEYQ+mM)e)`axE1;+Kg#IW_f{DTTm)_{~r_yLKB0V$H; zL-`Y?X)i>r7l7$^jelY1Z(~yZrt4`pAt|W%L0M}j-fH?Bu$bwO4%2!?W4R!Qxl_1 zz-VKX)wgU>M{A;d)d@bBUy*BqoKh7l=$0BP?EIU*;><3Wkn> z5$XPjOh3jE&fhZgH=S8CcmfcqKLw8y0|K!Eey@FE#R`yNjN97CDkwjYv66O0G7-f= zi&+qDfrykq2B04r)^sP01rq-!f3`KUfQ*v(*0pMIQY#w03(St{2%TMhBdMXcPWxEq zLi71gpHnQ1q76*!IznSFI%I8se*D5gVIwZ)f!41FZiWODQG?{`i_+B4B&ur>)X{!u zbqzH&f;L*q*N;Th)JCgo`uJ(6`=PZ+8h%=SBs5V|LxZ%11Tgxjt7*XgV-0Huju!SK zaBKmDhyZ5fCRw4gtcccNL`4uIbpNsK{}0};*=cLe00sb3;QI9i0|w-8Ek8<7#tiTq z6%bxyxSWjoF|@EnkmJ4{dwv~GdLt3pk&MCrH(h^WVNEo85UW3BpyYcHNE~oR4W+h4 z892{HYdfGcbx;_cEn8I7)U?z!QNLmY`bzw-xncjNZEeg$u|Kkd- ziMCIv_SbISzXx%1_t*LG?^#ScLKuz^F{uu zARnh&6@xnL+5!38}LEDHP+F9GAj+Syq<>wKLb zU`?E=j7uA9jKyMYfgj_t#A0KxM*dh+EY=@u?9Zeb8DX(I7$tzwobd$e-=9D&hCv|$ z>{v4f*(wHTHV$NHbj6xrvD<8cIv@|!w*UpKC6k|7v}V+qrnd#E`l+e{6(b`%CXs2G zA^7dGVLXXy7&I}+isGW?8tQ{{io!diDOjtW7yyK$YV=om>@KE62;=^fJQE)gXBXt6 zW)`Ea9+xhw8D3Pu}3xF4J9M5=uz5I=Azw7%8f2jnRg4+Tx4d#`I717k)D)(+_@QCcj-YUfn6u z)ma1OqDBeCTE_wSv_lxF1F#!?!xzX9{RtmaiBGUoq>CCLkH%1d{$lDNGIU~%@!#aX z;WY~UMIZ4OeL!Afu}ppHb^-XHT@;3%I2%BUsqZV&pzn2CbEbU@hR{#=jTuy&2HM97 z6N7j5W$Iw$`8H-g6PYgpQ-qQJz=i&ozOV88Ej|n;93!9XD8Mg{QN`|X0G!gc#TwZ% z^x=SBX6R=IhuP_j^0)qBzC=I!$7sX-l-rCt&<@1)U!qv7zCH*h6-o#Xiy}oT6GKCl o3E|-(F;c*iIf5J#EQQuoL93v|dV#gx|ML%|4t&^V1ANf>UlB8)ZU6uP literal 11130 zcmdUV2|U!>-~Sl<*qPB_vP9DmnXzxlj9s>|HYCOv#xnLTTUo}^VoPWt`w|rr>L!_x zqLNgKWNSeRp}qV6&Xix>oBMn2>-GCT_j#W4GT-kxpYz$?pR+pOV~e%Jz}R3Yn2se3 z3Adnyk1PUd41*>B)XeAN_vn|-m8wP?2!{ihJRzVmCj02$oLJ9tv z2mT8MP{h>AZ4Lr~fdZI?1q?zivmju+V(K6cK`}U&w{HMBkVqxNctydYy)2@ySde>= zTz^-;m?D4y1OtPBham6*fq?R45*Sub_!*W*;|_8Y=#q)NhOVGL63+?n0f}wd?6b}E zHrXcMyR?cyjautEF}*JRE1zEY?K0}Rgj=?0@iywweEUKFt16&*zepPEc2)(}>~b?H zr9Cx^dn>btbY1e2J$>_O+C0~CO^An+49?tE>`e-D2OaD<`4Ms~=5i5q`sMu5#`ZCs_$aVJmU4%lia{J-kjsvRCwQ~Jq zB?RPCoc!a+)8Potx0#j~<6aE$Y?Bt<)By6^5#Odk2^r!Z%dr+pSXs62?Tk3EVB`P^ zxE%H&T}s^~UgFG)ZBmhTPDA>3lt)Fpwx{Yg#(b!;-$}c3UL>X~i|?Y{``XqJv!aog zkhl1!I;e=4fS0j5s5;nk6wftLo@>J0bOig}%TJ(S`gmSxCe|%zhf}}x%M8Qh9l^oD zGgM7#>iyD!f-zj3JeyaDmZdp7RgF)V-vnG^JKI$vtTI7QVb_&YliuSb)QLoMUxTv| z`RQmQnNk*f;werd7RQ>@*?>bEIS+90VjQ_EQwW;{YuuhK?>E@xu|d$3723st;sTYi z2+8l3rdABhwg^irno$hG*{P|m<}$-H?hSC`dx%4lx3o_pZ<=89NWQ(Lrh}9+X`3k{ zvofCQ%4?`gs1i8~(Ny2uE!yJM*gkOWB*Cl9Hvc)E{VvA=Nt#08bV%mrx9P`mjk^TB ziG}QgS~=Seh{z8SP#JzhdDWXAdkJIv%Y&g_jp^A&Iky|pLPL5vxu|@yxdW`bY;SRj zxbdJOl>NNo@hyIZ#)G+u#nTk)yq&K3Bwx5O^q5?_0(YLdYv%JCh=-T9lCZl zYe=poO}p^A`KC&3p)RA0dz&1gwPrMEM9Blu?L+2e4$yRtE@+si0osVok=x6Umoo_s ztFV1y!o|bKl{uzTGs4xm%`$snf4Z=43XP*{YTl1{5F!IFJ7woYN|#AeFk;2k)d}$? zQVdW>ZNOYSz{mwF225L_}E51 z0(Kz_p~oV%sP6Kxw)DDCCM_AQNPzbsc-1-#_l!eSS0!T}X{$><#G`NAMR(I1XIu;>=$6iKaM_J zh0d})lTFxB={W>{JJ1(z=~m5U)#M@mT#?r>`}u6dr-~7;>H9pJ=UaW8&-Kk$513?) zNR%x{&eqXurOI(!WvfaNy#&2()#$R$Be)`QADikP!iyoTsoKT4iOjS;Ll2fLVJHz&n|4pxtG>MhotnVNOJ&azv1>w%HCLccGrUOBAxn$Qu@ z3%l^9cc`?$XXAu^2`U437wt_P9#qTLe?!5)IT<6hbv`<98TK4DJLH2u16j4QawX|M z;&zJ}86cxB<1&X5&2zhm)<+tqgW>n2xs7rfyzq*Uu)-91;TE-92uux|Jgc9>`{5(a z3%4=Rp4iOk49Lh1Rcv73QdxP-Ii(woc*&!b-j=ccCK9DmRWX72&m8n?=(ZEC)tq`l?*>Quwx=!=_Hmz|>NAb4%9f?+ z7)Ni?mFc~!8_0gx1|g1ciF31dbu$j}pVPi~b7uOgNaMLf^KrH#ZkKZO+T4qz*ij^wB$*NvuA<2R8#b=4ty?I~p9?<@}NmlyI(d zpt2Pz{+s)qw^>r|M4ji>XpnC;J-_jk%CfS+hisF%kURU?ge0}>o$&;$fVRu{OnnRA z`53#AjuMsSY@LGnk0(Dl;4+9S)lYaNcx}s9;*yK{t|#=}kKIbygi9?}Nl@=Uz>?&z zo%yD9H)h#sz^0wg?38tGcRYEz0&2Lc&?^T8>AoW4e3X4cAPetGS}-MKN<}9bh2&Ne zMY+JoN(-D#tZgMep3UhF)JaUc^9G_Cx?t2?Y?p7dWs9rZ(r{aldq{I6*=T@O>vw+Lh zyxZ-2BJZIFTW>U+ULxxb9Z+tMxEu2@%xo%hPj@3 zqqo1*z02;2gV$FN$+QnU=IM{!ZaI5;DOS56t963~JW*S;%KW&jZt2`mq~FQzNE?@J zVj&iJ!#3TYr&k_X;ai!>lS+kGWMwknwym0jMv>-&1GZ2 zd;L(WbK%B|?d~mySIv3a#}pL8Mc+~FiUvnGT{dpGgsb|RH(?NlvM2*)PIQks!VJU; zM~x?fvSKgnR)MFFgE)kc(Uhy;ju;sm>a&qc*H7!f^n;gQWNGTUE3C9(Ms!u8rk}f8 zPmp%FJ%l&wJY5O>0-ar6)rruE@PJmyL0nM~(s8*&MMQScsr17MR=M55v#k~{s?!Q9=iU}qRPrCi_08}Wr;h?+qv*Q1p^=W092=0p zcEHFRFI?ekV97^i?acO?P?L_yb=s0bd!Of$j;*Y(hd?`-ToE=JNmjP`m0121VxD&n z+N-g_t5ckvxIr+0PUGfmvK6k<;&p{m%q3)o*;0v;qA9++n$Z2eaO%wIj(E}QKBjGV zr>@8k*vJ>?7~58ky4>k;sBzXwg(XYr-4$KE*OOf|Q10aX^rGH!E^%jiwbSf}_gBY> zNn7~&s<^D_isF%qysnCT{&}1n0wRhFh5PstpprbbLBH;{H`NZYwfJhM)Uc5@`sMBdR$v;nY()2m#uFLkh2a4*B8kJ`{m{tE>O8M!GHx|c!{EG1VsB}~b4o?) z#?~ZOQ(+4*0aZvNz*}s^AD}XJR$m#N>J=zXpWS^$DWOfk@n*r$+r%xKXoc}F7b|U& zU!-cUXh;?t`0J7JpYmgF?wa`+Zatq{*@ii#`5x~v?!DjP>G=YesLm^U^mpa66X4wh zXsWz$l^JxJt6Z9=2MsZV*5k9&0(f7m-J%Q1Nnv(y^zz<*CS?fE-r8IdKW-oAwIAKz zh&ILRC>0#yFnqDZ=hk!a)IE_jjyqw7kCjW=)LlOFgjS!IqrK3NdNYIHXTAWq4$q{@Lyg(}7}y#cd<2^wrCBy;GZ=kHHQ^4a8oOAs z-3ezxis(MgO>U)AGtNQ%Y2`1biW9MuC@cEHP#Pj6Hejp%_ID&r9?xCf4%G{M@maQF zbI^~hoSa#E3tb#PCba9(@*D2NXR;(G)_}wErlGi-C>q5}&T5ba%q`3@NfGxszRDtO zAf0V{5{c!F?bF@;dj_4eEk}Hx>^_#J zyO~0z1~uXoP@$6&`LfqNlJtHnU9!o@QhWHZFfiAI*1@~8A;364umy`V+4~sV+A}q0 zWoFFd&+1piX^}JNZFIF@cT2+#GfUjvBlMngKCu^B*nDP`}`y#jGPn0v*AnpKXeM-VXOYfpuF93FExyuN#2)4akp5pbP%s!>b?Ygjp= z;tl+IscFdldvF zSVmðfH*nX@SkI<@ZTUcC}%(>ge-T+7zz?Zj48o}z!pLA{-1tBHg9U#QW}qrQ1PyHwo}8+I)OLN1d} z@br(?z0tdE_^yVSRUNPF=`>+Ac+IXBe~RLCdM4+Xyh zapV>A-F#KIgmx_5u7Q`e#Ne)Mq52q;JdAEU{L*$~g{LC8Nlv!;NQIxl+1(;R`k3@LMli+v?d!a&t)o)|z zoOQCM9A`!@qQs{-N1JWUYYj7ahuO#)`;iIfe!nO*pE!5*NVE36bAg9z59i-6u!CFk zIhofuT|u4bhzoiflz9U@Qe^X0Lg5X@T;sb6GIrXfw!~xp49lYgwd>8(z z$zj+=#eu#=`E9f~l9n(1$$qGZl>0f@iHLL`n}zouA}M#!B2;6&BS@X&CT7QzL3Vmq z)M#pu?Lj7tn9YTpMCFxkkx}pO@5jYsT|7j*B=~|8B=934crkry?5U2@zH(uYNt^33KU+b{NTl1bTcZqaQ@{X{>!sx!$YHZ z_P&3_9{lx)Rm#S>farqlH{ZS9k(5qNO^jWn?SaNs7>mtd1&`?*ISuz&mMz!3=A#8W z_Iu@`%i+lfqT9yJx^p1ZhaV+((hBZ7aK3kkJsnhj(yy)hRw?bev(Hx6 z=qb}8&{j3?BZHcBky#!YqvXUw?`#vWZAL%kB+|?~IQ-JsL79+rH3RdEH?#bSs`-10 zr7zfnlm<#Z8V}nhcxtL-sB1(zrY>BG`*cI0jlEDR&(=BdT_=c3$YnSTEfMAzTGzYW zo@jUaNzT`mY}d)|(czfDmUL^1Xd~N!uL`DtPZPohpZl)daI(9M7PkWH3Yf1vPfK%{X379(nCt;TH(a zYGJi4xi;qt1x4(hRU_M=%xn3TT&sIM%(O_0aU3%V*$Zt9{bg)aSp{nOp^ug+b`n6YgI9>P5J}V~L zZPFvmA;%YZPU5vX_u)qmJH`v#l(-wY>YjQZ?|N(sb9ld#w6eWUU|@sOseN@HpY-00 zqczHhE#1W|n+)YDra(TMxi9C#>;*jQvhMp`e!H8y{6&@r2$MZlM}M5#1j#+c!A}k* zr%ZpEL|;N5lllFHvS-basvbp4bN-2u)17WC%{2yd=Oy?a@C!_KI!l>2YYX=t3>rt- zJDPg+Bq}e2>L-4DCHHs{`?5&dM6mtXCEX2IRojeaRkz+1@H|X%gZMkBN7GGtOS6l8 z;R83S5sCw^dX`4cS`J6!&zv}Boqb_Eqf;{0oBZAa|C)P+lgOnU+~&Gn9O zXbjyjJ~CVkTx<6H-NoiXzsP-Ou#ZM{oM^j-^Zv~c4YJwE|Ir?ev!XS@(7sfeCPs^WkA2{X=FH4`l+b2=Gw%x#+8-JbQT zgj|YvIejwI%)(PE*59mOV!w-wwDcGERc~|8o1>*M%L-vqX(@M{>Jj11wz*oP{Ro1H z{53I&9HU#8csH<_FXH)%N?5j08XbB~`ZG+S3iQD{VH#bK1HrEvxaZ^$G@3}iS6!~& z<;#Ab<5Zr`$OfxZQ_TlM&4aL6XJMeO6?f32^y|U5wV6*VlR`?SJ3$Zc&liNFk`(NN z&K2t3dQCYRfzHSZouf#8l;DnYy~H4yo*S;}rdYySkzvflCI(Oa&vCrt@mPEkEe z%D!^C;n0ISOuf@JHQxK6Cddjl&z<+X%|wAiX-~Odig$=y^`GSlih1|om_DT_vUK|M zsWZQxhXcn_>zsA>yZv<5O$rJOCkKXyG0(cMCWA#!x*)&qMmDel$K9ZHkFo_}n|?mW zmgd5OKTFWtPt>u4E0m?d%dPzSZK^Tz2iJrjZ&5Zvy zOs74OuZRGqUp4-PoxhDq^}EVnn`x~E&b5|f*jmf~pstT;!TSd$K?WwTAd)*dfK2^Y z&-#XJkc`YD1IT{W`xn;MMP>{0{6hysVOcGFEbQI1^EfXfX@& zD-}tnG64P1u&z7lEU+Y&KR9I6%>u%)@d=iw3p{OBklbccuQ>7houe#OvIp}-mBc^H zt3!0RMnNXhat_u>9_6pDfR1&}lS?_(R1JQaegOLhDr9dJFKrEy1{v+Gsp{oP)WD#% zRkhGWk|s%83*(JeSJy^+d#P!tscWK1XtcMtrbOenDx;=W*4?F-|*UvtC%P2KvK^WuK=4n8kpIN6;@3W@LqK5sL- zNeWoB28?_VgM77@(M|r1`hRPB-C(TOw1ATytpx%Gzc%e!4Rg6-@-PDbD|LOCllu@b z$oKVK5SV7%f$c~nIW*io-1qzE_ixSIm4Hl5__ic6ZTy$k)jyA`A|K@oonHxXZ8^WQhRU z;2Uvm)3-)s12}(duYmgsK>4S=$v46;3;Q=^Kd)$O`WTiO>%q@u<{Y<1U>INy>WV=UGP50n9UpsWUDusfOj%&ZNg%rtG}uk5X?3>0uUJ0_87nIZT+ z*)rZF6>SWOY8|jk#Uuk*O7EL`H%6WIh;mrcP>JSR12VH)QhLMG{nQl9&5U9}s3_UTnfD}{TH>A|HGTxkN--03Z6MjPm)l?nh>7*S+ zaQ0&AVDxh%W`8Cz2Le-s(fxsh`Io+L@%$w|3?@@Xf3gbzeld(9*2)2JN(+z0;Tif& zfmUYdX9kDa=#2dL_F)d9pZ#Ohnf}zb8D*dzi0i*ZvDod~L2&6{VrWPNIb4Yp9IQkP n4GoHt2DZ&%zCnS~7>#Y1ZD_H6V7K>Q{(;mXAdoF^>-PTv+sV&< diff --git a/transforms/language/pdf2parquet/python/test-data/expected_md_no_table_no_ocr/archive1.parquet b/transforms/language/pdf2parquet/python/test-data/expected_md_no_table_no_ocr/archive1.parquet index b67a3f5c2a22a518b3cf9ba4082e7f5048f94fa2..58bcfcf6f93e90ec149ad8b7898db02bfb0a6aa3 100644 GIT binary patch delta 3996 zcmd5;3piBk8lE+7JrWRq)q+>?Y_6q1s5O`*oM z*lD|$ODHv{v|SP*wM9|wIxQD-;v6V8mBiGPqsYB5W8Ka|qCI7mmAa+JL9 z=T8={g1V+IlDf{*3U3DQWn^qM`eEIK`z_`&jYS`H`*>TC`Mt9Q=F=5U)uP-VJ zJYdw)<>Q>7TKas9<}u~_ZNYcVP9BMBZ|1L7-MJKr$S!IzxVcc`(d{5rH=&cvoqQZ` zEkYT{x;9$kALZQnpK6DY1wLWAl|SiNXp&Ct8+HD;3ICwcK2Xpw~m zAIpk;kmkS@nH)DMM;t3clkCk;C|or$w)^0{O|QH2}S2M=MmFgUu~Hlg@V zj5`s^JhT+NXzbDY{xuYEF|{vEO94AjSAy;S!|xhT+R1UidhJzb>*Hs z7*WF!yI^cOg+J1|vrtI{ozgt^TcIdT&D_&X$IKXQKV1)c$)%yS@9%eD1!Ch$;>2=|03WYrTCYz^N%Q zHAFv}5SJNL?iK-hc}Bfi+!YmaPxh74nIb!&(J{4#!0ncwR5JEw*vDA zlJBVglGP27wN}l@RA!~KAg71Nhq8$&-qrYu<_Jaqbh$Q~>}*^|3H}?}cIP6=$!tY66cx}p1{-dc!^Fh(3JUe< z&DV&8;gs1|InL|Hk0oh@`Z!BhvA30$-npi3UB-$@R|<4K zv(ahQbkA5p(qI5TIW@EIf`Ay@*Gsi`kuqpd-lCiwY3=Xs;?>snQnQ$Ni|*pzrhc?< zq0?~jNxSMyh3ED%KlAntI>qCT{bkdqFAEl&);m}b zMmn;ZDfKTh{e1GqYFbSua^6-J1+;rUa$pa*c7^{fO#e0rum4Kiux0a_p_k0Sw&RX)71T?tfLhovF*Iu zSxkgV0^%???Pgk~y{Zsq-MTclnHl8To)NvXnT=W(AoEwoLddxG*?ozN2^(#uH0rtP;g{i?*Vl$iI9iPDTGY?}a0SVc z4=*Zg$IFS0ORFdhDCxGnrvkcddW@g3*}`nh${<&+{qsVMv%0&ZLSh>ornKH75Ch&cbi>>YBYOkr;hxBnv~SdI6}Jq z`tc5#MB$_FholSbCH2(b)^URR@{vpp9Q%q&z+BEW!ZZD1#pz@86K%IduXZD^iaDgh ze0PSd;y!>^dDos-$mR^(-KBp0Ej)rRG%sssBhw}GQuA3zS7VxgB(~v!W=jY^wh$gT zN>NRd5*RIJS`zN|aH~4#%yMSH)`y*hweoRLhSVX;W&zP;!pb-y(Dy(v);z7jAxj$B zyA6h`t@OY0Rx|=NirO3@JT^7ZaqKw$9eL#gE^z%+M|OQWnjQN-!Pc+u)V7R`Rfgv) zOXg$R#n#-wo9U@H8?%N$OQWm{6WpDNvP>6D@c6EDrRmu+RmTW~(o^fXq)VIfJSKHH*{X^#MWmvZ!O|C3R#8+@ zOcYIq={}OISV{aAUZF|dB_v-dD`BRA!(se)GyFIh9Bz$-1e9s2C$wr6rlQvLwHTC( zQiR|2g!m|wT>W09c6Y2m1Dh)RC-aCSp#8c*@cwLQMi(Pmn+0IsLC~}%9oo1$2%eY= z8Cq$GJvk0wdB*`LZnXiVVughC^upnd`OtAI9nrHV0IVPn>d~|2rUU=@`+#ws2Vn3d zX<5FuOkydr3Fm=oxbZ}pDMI0kQN(ca+WR*4rEivI3ma<;b>H20uE0G$#@z_WDqGtHcAWhB~ch0 z3JpgiQE50Lg9PGez7!^oLSZo(WEPvvX3(Iupc1^Y5E|59xiw1&+jR=So-P1B%J)xp z`O}-kS;<2?4cvgUP_2Qg$XNi}Qv?kdtl^#}$z&B=lpC~@3q1Hi%o6T3@zzRc8=-9j zaKXAv{6y!-;%x4tfH8^VL!#oLt9UY|#;~z41uN|h24=|JTb@x(*Qxmz@n5Z)jSe>bL?+n%O+BpOfs9zp>U{dV?0RlWi0_2Ael|#7=t7NmByri zWD=Ri!INkt3YA765?Bn7xuiFZ$)p{2{j$zqW%l1DCjA++j|Zfu zr)>2|1hzbgh+prYS04)T+X@QtTdWdH_mf$uM^{Be2#(Fa47nMh^-BRPzZ5|KOGr6A zG&e#q;FA_I;{Y%#xatxg0!Sd@@E{IP-2{?NK$;02G$a@=sZ6CBQ;f+Z04C~613ss` zRLZJfL;i|EeZD5~ds>JND#OoJK#MjS*3EG2L>Yh`;V+-r=b&ZGA0+%e9~Jt$%nnO0 z6T5f&1pTAP+3KV8&pD$$%D)`3{IneNQRXjamuo8Z3gm#-Axu{R*n+B$Qhyou2k4K4 zs_}vFYqij0TdZhx4S=1v3d!0TSHu%SfWL065{NHMb?w&;RsIVzr3+#I2galTpZV04 zG=#AKuRW>&|F?QdMF{&Ju3`oFUnx0-FUpN5k(UA}31bmT@j(^Nb|_)QP81C0YOw+T EH!E?>IsgCw delta 5239 zcmeHJc~nz(_RmWo>^lTd3Xd(UiCIa4fS`a1B8w=SP#(z(geAlzC{i&{HqnY=WYI!c zWK#i=Z2+k?Ac$H7TC3K&uv?HSBBQunoV<6x_jB*(-uwOD z`{wh2C&fUMA|R@F(d#nVq9><^&_*Dv+t$2`EL1ObL3HIRV(^!4v~y6_2XLWK>uL*P zoQ~G!buWn~f}#VbZHm-4yhN)8yAX z4^~d$*Od<22m3XldQ?4Kn%_=i%DP*g)8FNFeW~CO7x1kp*M_h&nCHU3S{bR}(I+o| zj*9<=;&j%-ed6sC6G z_O*?k9_-lXgbtaY)eL&|%_O%UnR;Y-tB+Q+$F@UMKzbBVf7UoPMH`Qr}LDJpzI1Ecv{NBpfLJJXyLeW^*x+A+LgLny46^6}CtWgxKN7|BlZ1T zj+1tZ{!$bmmoc_TpXZdGkJk})N(AZ^O-H_AhUh}3t z*XzQ1jl5y|xw^D@m&cIWPT*O2kR}MO8uj7zrbjRipEi{P5fyI8GdIkpbJy1zADLKp z+ZfJYTj6taT<5{w9o zo}Q7KmFr(_IdUn*(LZ}S_15L%0vjEztxtImT64&6%IkB^HKZ)?QX-m56ff=__L6|m zvp#UZ$YiQz5qN4}YiC^ARb0n2f9K{e3ZgY0xt&tE`&4Jep+`F(6ZKX5&948rG1HI` zNzgudZO3Ma9ntUv%P+9{fYu{KKt3m(Uoeh>w%A!;E zd$CKqk0jbBr{1=$zkNr(?3CSeRl1LfG5j3tI#7Se*`h6S=nV*k%w46^WP-u!E}h-w)lU9HVcq&p zm#yy4o7blJJUCG`HF10waqX`963;==Oh<`J$dFX3*a0bOa+iQi8r|D@Vbc3WY@C|U ztRyEi?pK2a3zPHVJ&|-Q=Z$xCHS_wH`seLv#vub|ZiT`j&R*-8@rQG(^`IR53;UQ^ z;P#A1y?w*<*be8ztADvF)$EpbjZW#o3}Vc#Uy1Lr+(CX%mU*^2?F`?FK}A*(V< zE*fLH`@F5HEnBy5nA4`|4&ia4@iiI~-Q=^r*^+rL!;F$RRhk#VA{L`nfA0!t02nLkje6NXSv_CI1GU6*B|hIvZU==K3-GccXd-h;DGTPLpklR zI{$L>ss3Cg?MD`0Tr;mpTOFRvL{=Gn<(0l2R&++>YavP?SpL@zitD8`(dSZj!O=Fe zdp?qD-=)8a?ll=Vlh^U?8hc5^s9#qNnYkKnqb-_51VXP-i>gK7LDLd7Uf)R6>1{5; zE7nG)H60tD+cr0&8y4TrP3VVod;4XbUwCnhQ(Cequ-ahL z$qgs{bxxd*~g8&3inP^7rb6x9V0~G@^+0|ZS{mOIj&%%f6E0s#Pu- z^(?@P-4M~S_)8Bwr6R>Xb`;qo3H6=Vvo&)=y{$blW|z+fSG8Kf*?U|G$13o5+jrhN zaeb&NB2r{Csm~me9xCm9h&$gXXMEb>{9DDz6nWiNWp2&`469V9K&Np$`PtmaxxqF2 z1?i~?n<;*+o?j||y|lgcxrU?ENdtNuarxqKf!Fg4M=#|@v>rlFThBtnUN-tQ_0W1~ zBL{@_6ZMufMFCP22t~!iAQ2C4^793GFrS^sPk>nj$u3!bn$3;N`8G6O2=dvX}7{&+(AwDyPBZQg4BtC4v ztk2=fv=X`TdqvBZ{F0(%h^?>?3?q@k@t_BrEz=c%iENo$RuYrLjb8C)zATA{_QHI@ zGC`QjOk~MCEOTVR@9zadC>mZtF4Iept;d6cn8ZCYJR}4K2~d2z46z~?EE5s^oyq%X z{~kXW1q+2R-yVboJebLmVfHTjI>5SWm}i~(af(J{+z-TXiSGm$S#Sq$_(ftoAHWhN$ZWe#}|U)VCQoQOh}Yl@FJ z*tV3Zt12SzMu{6R23k$2nBfTQq&fDW-2dvgT-X1K`n^;T&sn4+Z=Dh+T7u|~N&x%1 zLj1&Ko!A#AFaF6g7WrkZn6SvA>mv=qT{} zw|3C)9e_agscV*7YFMalCltMDe+j(!TdWH6{IS@ay=COrc5M@g9)J4%vpQOCh4(^x zv!^w@UYCzZGG=e0Akb zL}70<09Z*qu)J~qv7i61oBU_`;>Xrofo8Fz%_{k306TnM9Ae`lYO>U*%G}~#$@1Pk zr-lMiry8KZsakqj*u!vaMQmZqJu6IL2oy9@_ek(+r2Y-jOgpmn1!h_^?x3?H*WINo zC;EanYQOw&>GzLy3HNx2TMTtACFGyw%oU=O2~3zopi{{XbO$1tLMAe~G#Zgap^-_{ zC>o6llj*c5vg~3IXb_XYrZcH51{6i7u-MEfHr>FL0FhZtm<^GmXaq8WO{CIgnRK8; zQHTtfNM^$SSTa*Eg+Pv?P#^~;n@JEi;V{VN4snUCk*E#8F1D;> z`A?bsyNRiN#7vy-rYqa#`pR4d#L5R$6n}>`Fas6eKN={0Sim5x6@M=k`&$|=@1K%y z#3Ob_b{7C_(*?lD;3J*)Ha z5-5&j3Z6otQOI-$Ji|d&0Xl#{doX}MIU%b)#eTpT{BBM2kA_z&bVe$E<7VnHfW*#y z3t-1Du2kuFs1Kk&v-n}{15>dneziE#PQyzU&nTiw-c|O032P|l`bW|R%c)+`Sh-u# zSx(uf(iM(*ec}O}6=}W~z*hDyC;U^re}Fc{??+x55Le)_=&maOcCNo=5MK@aZ*=`P zzj~~UE^QR{kF(45lbN*+h5h5ul8OH9-Cq}lee!YdH$`FpLpvb`_-A_AD&+gN;r_2^ x8{)|IddPqNh(qgjKR)8(K|EOnbMQETk;5ut^7gd&Z7`5ioHRrr{M-VF{|5+&ilG1i diff --git a/transforms/language/pdf2parquet/python/test-data/expected_md_no_table_no_ocr/metadata.json b/transforms/language/pdf2parquet/python/test-data/expected_md_no_table_no_ocr/metadata.json index a20db3e30..c276aa899 100644 --- a/transforms/language/pdf2parquet/python/test-data/expected_md_no_table_no_ocr/metadata.json +++ b/transforms/language/pdf2parquet/python/test-data/expected_md_no_table_no_ocr/metadata.json @@ -5,8 +5,8 @@ "job name": "pdf2parquet", "job type": "pure python", "job id": "job_id", - "start_time": "2024-08-22 16:04:27", - "end_time": "2024-08-22 16:04:42", + "start_time": "2024-10-18 06:09:08", + "end_time": "2024-10-18 06:09:12", "status": "success" }, "code": { @@ -29,12 +29,19 @@ ], "num_processors": 0 }, + "execution_stats": { + "cpus": 25.5, + "gpus": 0, + "memory": 27.42, + "object_store": 0, + "execution time, min": 0.066 + }, "job_output_stats": { "source_files": 2, "source_size": 605137, "result_files": 2, - "result_size": 28828, - "processing_time": 10.41, + "result_size": 27574, + "processing_time": 3.448, "nrows": 3, "nsuccess": 3, "nfail": 0, diff --git a/transforms/language/pdf2parquet/python/test-data/expected_md_no_table_no_ocr/redp5110-ch1.parquet b/transforms/language/pdf2parquet/python/test-data/expected_md_no_table_no_ocr/redp5110-ch1.parquet index c6ea75b2163e7dd3ac4e407bc8b01a98d318265c..52b40288b21d67e3035505b78e9f5838a17fc9f8 100644 GIT binary patch literal 9286 zcmdT~2|Sc*+aE(rQb}eqnq-~q%oth$xB_iL!+hEwYy4 zs6*N;MT(B3vex&^$g5M%`}O;s?>*o5-9Pg@_kG>h_P_4yxtHf*POwBlIiQkIRTC%} z3gQESB(Eqe3iIcKK)fIj$d^j-Rz{%|<;XOYyf?)a3WD-M#bp6jUMMG&6M+Ru^8Uyp z@Dl}4#L~Lf5Cj4P4KN!U7=)Z=LqK`p${=ORS1N9*F%#Y&eYBBuXmN*MLvN% zuBk~#D@n<~+BMe{Yu0=}uP4{WcR> zn8B#>;UHn?lkTE$hM`Tb`ta$R8c2#a+2Qbp7|Q#mho@hT!S+CE-6%P?;+|Z$6h&T{ zn?ZTrKV3cIKG>F4+&*{0(~!PNaf@zJZP)uNFD@qfj-IP)t=%50%-CmohZHrAznmJ` z6ytFk#F^JJ&A4LNT;pnExPUJ|_F=}DR1hE?T2-Rd_U@w55d6ldc%-6Yl+B{MlgLx6 zeq^m}5@Wlr=9;)&?VG%5!NEg470(LpJdvvzE#KZVmXSM5jM<2Ey4*5SJ^66NQh5fo zsJ92d{rn?%iflqSzrpAH0VSdBJ$t4*6EE>tI24dPrOFDP6ct$ZUC$>({eho07QJZY z<7Q2x1<&N`9ISs9oB9ROg27;{gR8I|&s(NB_z-$CF1PoM8ZLCnziOJaP060==?iJU zeKZ?)dk1*xcqO?}XJ7kqi4vHfpt_at+nSbrQ^GqrAA9=hK4^IJG|xRuJ=d~$;rIJl z#cA984mve{=px$5mdY)Ph()&T&z5?g4clEJo54PP9<2fi&EROg z;aa;x3=BBfg$)Ga$l#U1YtL7-!n13mOEWaQEM}`rZZNKnXk3uL7v1?;G8GkvyWQVQ z7Mwq-Y8%UIrn~P+0<=gr%iXl@1EO}ru*HG<0sH(q+$R%b#UcbhC+3Z1s79I7Ixh=O zIu5%!$)@^W>#vZgn4ihSa4w!`)Cr)keFCq}sd_yBDLHXp?y$AT`TSjD^7vgD_!gDk z1=)lY4Kr)GSHX*EMIx^D9o5me%9^K*SyRNSl5u?RK~?S=Gmk4cC~C@~J(BdX!s!Y2 zkKaZQm~8(*`!tvKY2&L2RQr_R2)X6q$Y+eSiAg?wT*@=+?H#2&J!)}bfyzx|(ViQQ z=O2Wfv^okn1)=%eYZ_?7AHy+YHZn@@0$;)&3y+x<=y76xW)~0x092Go%*Z6thwFN#^ z|1IlYR!pC}x4G=Ix#b08(UT7D*JvsN?Rl`2Qt3J4tIu<0&AAKT>?VaTI9in?n;XP` z91G5U&yP8AV&I6C85zp?^r04Q{io9!t?)UC-hwsZ3)MWZwx0li(Jo{LPUgi zx)*a`D!jZ|dd>mnZ(Y4AVwaJ{7$ zLvPPgKPR{cqEGC+wC1>2d;Y^?GmdXDG2A23rz*P#joq9a3zR*LNtGAjR(aad&ER8^ zU>;%VRGf>HpZSRa5D9$f4kxE%&)pPriM#ICj?bWS)uCl_6FEwi`6@DJ120lLBo@zl zS%|b;s#(w|*&JcAwLYGU!|H*wR8Q~I0(JebUFt9VlDEmi+ECbfv$MsU@_nVvdbWpg zwTW2W-#pur)RDDQR8q}#=#oVpALUv@YsRIsTT+Cd-q&6~Iwo^2XD8qMnw;+ihrYY!9}Sx@y%KQL(|q^1ZRXM??7C&ZC#qa$jN$r z<@)AAeyk(!v0H1lXd-SN88x4oDXS=WVVrY(mt!WDafMf!8>w{1t?;3;@c}`Zcf5^Z zud~}JD6XV6mHjnmniO$659#P|z;WiH587P}&ofLd6f2&=zq+`Y4_Fn;mdr%2N553Q zTUpdZElnNR0C^iL7G@+7m~>9LcDHwoRNtJ`$E);A%fbT_It1LoswqU>0w1-0Ks&u4 z|CvEhNE7Fys>Q;+k#(-*EBX9qwVMWOJ#MYNulU=iu5g6vrEv0i|5%R2*V38`SDx;h z?>F(-?^rSkFM5r=j73x59~YrJ`Fkh|9ZbUSZUBK#y^qAHzvqd%DQBOMo0zt3i1xuT zo%HcmpWjn$8M}t6Ie|wL6T1sC+%;uk@#qGDS|1??!(O6Vz}*8Ith4$W15^BFcTDiN zW!mig7*_1%IKO)swGBaZ;>*%O?!8o`EEC*&_ zFuN#3oL7K9wYVO)oln#Jwm32yEhY*SL&G$llWgTF4pZ)_UXKqfR1We#=%&7&3(DTs z?)ygE;H*0`ficvyt5GsLsU&ILru~h)6*n3_MLjn0UK2{+%()%oa%Vs=11INIO1os& zmUQ9d;^#26Va@!vxg6!22YL4vy*ug{rle|yHNK#@7I%G5*S*m4Q$57)1bVXLJ)x}; zqH=cTXe>l-+GEH|`*VJ6RkD}e^uU=B$wLbTFOE-Ait_K$T{ho?MFwgc&B}icZLI*$ zCLPYLfR8!p?`nI&Z!|EV*-x*d=bT(CSpDJi>*6fyz2k?TL`OYJl~H?iXJHSyORhcB zH}?yjOGnpb=ecvPCoAABVm-WOu+AHx%Vxzf_h$p8p|+4gSjY~^2)p9gtv=%$o$2Ly zPm`X9NleNVcS>N|_lC;pKh#;wTb~{7;JxvW$+KJJw^{G|`ih}>=W9>u;pa5#US0~n zJ413HJN8BAbZwASa&lM1C&&lv!mzLix2)ECA;cFha1y;zWq-n!nwgU_nK|<*4@WH&_+O|^3tg=QN_5~261-+Z|mr4w6^J+Q&|3tmj*g#Jmb!aI#t%4R*Ah~rXK?uLFR z_o9a(cZ-J4_)sOMzG{kbMh<*B4&M?uZP?J&Q17pWHNeI-T9^R6Mqo&5ROeap? z%4tgCX9qj6uQQb|kod3OZc>*sdKEPJt-DXvZB+=pG5XHbL=uux=1B5Y%KI_#x?1oColzhD`J?7=k!LQVJ5Ht-3CFgjka&W`-}bFDM}p$_ zCqCa8SR-83Aw7`PA})LgY-*G1B=MxL@#W;OEbME-n+ve)>>PV!?wQ7^=$ZBPcs$<~}7w{f=fyjNP#_&gfL~V0# z+Ha|;@hlijP^G*z`8OUWZo zTq#et!c|ssRX^*xq0Y0wUCza&I1?T&7)W)#Ca3mjcRxRRz`1xGg>RrpE);2I^|E~I z-5Q}OeKF3Z^?Rha9Dur|*kN<_iYsYJ>XaN7u%^&WD#Yh1wvp^w?O-y!B7 z@7&=VJ9HJZMI?y7fPwv7lO@toc)K_^zH3yaLZ)zX$JJ22k}?Ud5Ak6M=Ir7}uYVQX zCJXJ#EQIqnB!-+QO?<3iToZ>rGP;F4lz7YY?fI?Kl9AFodfI8pt~F)GwlVW#R4pr^ z-q-lk^1F=vZ*lN*WnHx)!l)0d+btV~^5)r7gVfLN7KbK$JkcVX(|6wL!A+>OEpA7^ z9}sq*Zo_m&_SEfH$~#Dpj9M#>%&bcidm-dzj1A#Si9!f&#vkt-yBE}KMo&@*UI;Ne zAk{ZmZ-3|Jf%Prd~4kAjb^;z;DYcGK=4IV^ZwuY`zNn;=XV%v5xLgWO3$L^A^YV4XX1-r_;9|Q4%Y1_NR@SHImiadK?O@7@0d!ngIi!m^WUhf?r zJ$mpD6DQ%a>%VR0lMcEj_3Y$U@f6WZ=ZZBy$Z$CEPTJmd!l4JJ`wpetI+tVS&n0qH z`VwV+*G;Dc-mSXL&i3f{&qtI5t=+G?Csx0Ded5(`DWwiaX;g51W~6mF-(uz-VRj&Cm-)_MhaI(rk{s7t=qvB93QA&a$; zZwQDL2wA>>AS*|f{E!S750A&?O<9aKQxS5dri6r`;Fx}Thrs@Ki ze%AOGc77R?!Z(#a57Tmw0A#tww6$Dcp{|Z8!)OH)FB21mL3XCnshd3DNz;Wh-5Sgg{P8~ zQ55ALkz+xU;=dv%@Jn2)N2kkd<#}5NL?(b47H4+~>yf(&hP?zM-1|* zKg%2z(N2^P^3(JK*w<9TlPP2@MTLYW5m7`cm4sDPbfr>pWITzCCgSjD zG+7akRRS^vk5a;7DXusSiKIlNsGx`_6osrT@V7Or9=iqGmDq96h<>!C8JHb5!;U;} zjf}BkM(=Oi{vYsu)p1*G5jZ3PAO*^g&zJyYxo4#Y17??>sete*GiI0L%Iw17=N`KB zrg>>bsdPY2+XFMo&$|A?!m4Q0p=N)|L812`kRad;S`m$r174w{aMp@gRYkli3M~&L z3`P0>Mp#9y( zWQp+8s=lP``{is|AJZ~(G5EgCn&6fROarXvcuB{f(|ehT8({h=BbEey4C6Nu7#NO# z-}}#gn?nAby)*Kk*v%sUk^Lp|AKdXF|L^RdkbiT-gZzKI=p+Bb%P{hPYso|YgNqRI zpIHuIjU{X_FcJbgY|Rgx;t>Ko1WU`^dsLT}2Z1^+1!ln!vBr|9spsZ5V#~y2xdtznAh**S@_8MmY#NKvp|xU1D)(;=}OV3`#S(V0A9dxd**$q zxM^7|>-!6T%$ST0`g)2Ews@Mok{*g^XGC{k5_PR*6cl6#%kl&olL#mW%6I+%fnnmI zW104WAbEV_ClG9yZSA0C`ET}Dl_%=!g%X+U1Z^^tA8oFppuh~Tg$+w9Fo?_s>&m(? z`}A!w6oL-RKZPIo4OmPo{Msyj%RqZ2TYrZ=%8GVqx(8u*D1Z;gV5SbhuJsL{2UGMr zd@Lm-FI#^*G$4<{(}D3~=^!(86145V$$!JEgTNADraw?n|I)V< z&!6JMWYTBmlO-Mag)*B2Q)|E}oCQJ4f~ij*7-gn@R&ZE@&a8hMAJ#+ky?@L;{qJ&{ z*#`Q7xc*C&K+w&$>(i literal 9262 zcmdT~2|Sc*+n*U?>=R}hOGslvP0UzFwy|%?zKb!&lCfk@LY8AGijyUY>|{x@wGl;< zU7@m6))qxd(Q>|LMvhLL_dUPg`QGz>-~BVsbKlo}E%*PruX}rLMg$WqoE|O;S1^D> z;2=&ANVHbss}OfC2m}LxK)z&W9~mrG5<_ys?)7nYfrH?jaMT`v6$WR3Gw{oTL}5Sk zAU99|MKrBUx*!k)C_unq2nhWi%nxVfkpVHlc#u#RcM92y=t+jNA|Qx%Fyb{BeGm=( ztNMBN02n}U2ncwFnIsycu72njaf!zMb3>2(1dDFka;iKB}g=foFFN6EN zvyPIdm1a2EBZ7VAI4;~8$oN(vTRJ>gE*`JKDC;W_u4_ zhqDmbkL_O4s}zmTiqG7x3Avpta=8ccLKzX-HbGiRXteH2-Jz#%R*oq&_y25W#(Yji zQD|N%HcZz!VDB3?CJv6RDredc)L)17=iJVSjb&u=9@KjDBqOXBOwWi^9Db#kAo#d2 zly|HH3lHbrCt>{w3)lW6tJ-K1*qPqG`)akuLnp&1R~F~d>8fC*wEH#akt0dA#S0}$GYXgRQ_{h$Ey*g0M>Tu=^(s__Yo;xC zd>L%u;bBFMRt8dZUn@6?Bob7LB#gJ)wgfU%OjDK`kLBlC&d5K^be}zNO~0FioV3s- zqoSWjI>lwKk3ZNn=^kgDhO0m&m1FAP7KRR&iy!T-&FPM;7_h+@Y9`*WKvh{C+cQ+x zb^ZOaz`$L_%$tn`U)7HsHK@Mv%W?*`c3CNrnak9T9q8sMg@wdoV;=F+S(f>f_V;ho zUPz4>J#chdEJ(wJ#_!}Hdo5<%<@4;EU?U5Jb9#p9dF_e6(ReY7gSgXnrkK-X239?; z93nR0)|{*YO;rfbWD4%j((0Q_aLV(IUr023WnSE@wdfS%&y;=MqfvQfDI+Y(+dbmm zg;yl~a?ExGukf;t{i;H+a{nAH#b20MEZ+|ee(8Q<&s|==$g6+#jLO`t+qZ8=K{zET zZlFk}y$$oAsLQHv@!UK7Yno8X(p$N>1OkqAxO1P+Ps;U)wpAJ`u{-##?#WboZ|;3Z zp;Ltt)S7tbTiQjcL-R^VZT;40iOMD-nIfl>jNw7bAcxL9)fJc(L zjdO_DR8XD*d*ZhAyLE0sgevYTqi&eODkNRS~7Aw>=hmK@VjL9>~DrjES7zp6sr`?1SKlrHZuJqf5YbJrp zjC<>bu5#OdJ}|)*n>wU8n&08hp6koMPg&2py24Qs#;{}2s)OG^_2qy7^Bb8fjyu)r zCiSyTfYE%V%4I0BBYD#3rOua%0A95%egwluv-5>=m6*QFIuY!nhbP5)2*;aMMHNpI zAdmz|8csVpff*$(ZDa%@oIMyAEpBc21k-xz#bl|Qr~6P=GTTjSo?2Us>E0KD0VkMc z0&Az<8lV3}dfdBTLkN};VxfDsaT+m&Ux}Sw&i1H@XjVc_WYqs~>k5v&qP*mKJ zNY;1{g>nVaR_HU(!joFZ7^#yrO8$beNgm_K>w(u?9;lRg9>UZQHOdZTK28(=(k6DT zt1QvLI+P6M+cwJ2yS1q`t5`V#G!~~SlpUNV`82Kkj-lo~_QN&H74DZ;++O3~WAEx#XIxAg8Ko zO9A>UZl^Gibahi$&w*2tY~Hz7g&{-6^q#TaI)9@CDtCeXo|!M&ZMS=6oIo>^;v5lM zR?>8ToA5CE+a{%Nkf@9Z=BWZ@f5Vvzr<#`;lZWk$=j`%eyEp?rP90x*NKn06#%e6{ zTu?##>-Kbc(&QO_Ta#RNUwL0-=ZYg=y*?!VHP6}J94?YD=FWxSlkC>zj1NOKAAy65 zx(-!5!Z@R^-0$!_m?TuD+9F?LBNYeG4jzQFo6}RkU2Jo*C z&!A?Od{^slr^1~~Wfr|u%bq&Q1Z)xImHw8Rf={gmSMJ_fkvc`{cZa$iiZnVS@Mmg9D*lz>K1Q*&2f7; ztSh@S6Hz`85?c~=n_+6Gr>@>;Zpytnm%{sAMttHuNgWQ;(rcXXdf(% zn6K73kqj-|l6ia4Y)nt@?W5ZAZzh!oM0QxdIJ3)3p<@}(6RK=v%Ftf(9s-q&hjwyTR)h>a zvCFfV;%(J(J<6EOIg*#Oed2KUNN4e5tqa%gQ*7Lgrc)~83UK{&!ACP9>GwTFgB)wN zkfT{}ubj6QEQxPFY;3}34?7m~Me6OLnf!p5*8l?CB`qD64W3U#MJ#>&Zv{GbG_a{ZR7F~g@!S}Xg6;xpeZwa5q9BJWS5 zyQ+2uoc>lC?#q#>64RTU@9u+CtNDm^V`59We=-ZEbaUY#Vz5IcX(C6jTV+J4!*6kg zA?~G^4Oh^N=E6|s*Qk5)%{v5=^uo`+?L9|#v;A~KGPmIwVRtDKn_&o~y-zDEPT)On z=?I184occs~vs!OylMQNK?mGIUbdBf1BDjq2Dz=mT7pL`A7T7hYMoY}^?XW4wL z{LS?xZn^1*m`fn05b34|YymRGmjm}%oAxKoYBf_X{EB&9%IL*Hsh&UzTBt`M^+m6w zyK!pQ{l+eypUXGmSDpI~RREiEbDurojJb&k5AK{#9Bw~X{PFyMBq{>2RylO8h6q*@ldY30*|}vCEko7 z&MqQWk-qj`!54~)Rh}Hi(PJ0KmWJC)fuRNUGt)~4hd+wBS=2?y2F&)vb~*4Yf#{sV z9CEZOJ1>rAZD06Rq)>2P^d;m>^k*^6o)f0rc=ENV<8fb1DjP#T-u)0!L(q6=;G;V` zz{kQGJkGwm`ZtvqxRLN0Z+wRJiAn7mY~(UgrvE{ePQxeBtJdT_ceBju_p>F`wPhI$COs+>8Bjml z_9~(8Df3jFQqhx>qM!i%>7KZ)U5@%$W-TJKty-p!Jx2U2(wlVslcVF+AB+@9CRC4J zEEjX&>k*?fkT2@nma^hVG@wK5_v9*d_0>%+%<^j(if%qV+c{~gb!6dPCfDIZh0m?D zew&_d_dGSsd0J8OZdI&S0_x>dkm2>`yOyR{gug`(urv2Icp+qc4n{q*Pd~ZzrV949 z4w}CZ9+P@WemQeA*=jzyeEWO^$Mnggi;-Dfm1CFDjmiB(>P$(@S_%g&?F5vP=_JAp zu9cotj}~I}*9|`7m&R|$Qzsld{-uyktcAq(%XKk-bH9@NOoVIE1NIll+BMwCrGxO( z9>IG*augf7k8@CY$RMQU(dyV4tVTaU4cIPlI(9%JbsP@%9h z;l216|Ne3xtc`E)x#>mJqZx6J)Sx6+)x^M}B+6X*WNGt}*nCcBv3#cA;xd=nFP=5t zv$HxaLrv>OlHz`1uA139BVq4v!2x@Y^VNUp)7iCX7I@KA=Rl2ZN$+s3Lauz}*6c@m zp$edA&|}74odhr*tgJqxFy9mZLiV&IZ!TB$74Yu7_G=CW@vH^sEkk0X^+vr6KR#x> zXL6>thPypXB$j1moVhzjip#J^X_nrO`Gus9S`8AqmZREQ z%db;6$MjNn9TSX-$=RFaK&FsAe~+xy-s;prC+PslHtPKoYnvi7f;;^o07VI45DI9R zoD`nug2T#Uq@1MWG16ob7DJSmmc_{7U5L)ovQk)iXWY-oWkI5-zaoeHOI(|Kr^s-9 zys-eD381>g!QGiQQ3v#Img;X;3K9wQ$3v;wqy^N)da z02xVrY>w=)@j5a(a8NAB5l6HybOg6!Do6Bjfz=r;^hz_=<- zMwUpHl9iStmu(Wt2w-a2NCkv9nJzEbu1_uWe(oWwUzAs;lSmu%9cy4Z*{JJJENqHK z5pMX02ozWVfp~$&;3RR<7)f~y4rhjyRgjcckjCxBO5-IZH=9Iw|C}57FX}eO9Otke z9c<1-f3gFSbU45r_%KfklV}K{1cL6cM?bZrhRMHB|1V8%>Wt007HFi;v76kg8@qO` zhPGDGc&LuwNZlOfpc6lw`TODy1Lhb9U@sC#_Vsu0cmIBIu-eTHMD&W|>Uv1C@o%kd zitN8}C6ei~-jN)B-T}TOvco1fBhh4Zr8BiFH{7&ra`&g{{$W?LO4zWZuPXa-EnCw^wM<iA=LuQ4$LOdBF%Rp4hgeiuPNc=&lb|Fd^c=)bdZM*kCAS@b`$xkUei z+dcIEoy`;aZ|-)`|Bn}a^nZ96M*p{#JoG=f2%-O(bbwM;Xu;UVE7ApsQB z)J$kZnq{is_hdwUlW=$}$#Hki1M>oV6+bHb4!47ihRO^}U*0)r{Bl{YiTasf;$-T9P&vcsFY) zEiBPWk77e5YMSkqkl0OFlP9=QiGXq-{m>r3F;qMhS*m@&Ngm(%2?TR$Sv_b?{=5B6 z<%!x_Aw()WL7hb9#~H~(H3i`PFIs zCV|#c7XCJtGLlv}iU;992!K!Cn;JR*yV`er9#qjE@X?ewd0F^d;Q)Coo&xk2O$Ujp zlb~+>UH&^>HIJY4k$%z##3g}1(Q`@6nXe{S#(l}Vc#PbL)L7eXx(_L~8X zk~b!(8B_IX1HDYuPje2f)2aFI{X-i>Kemrrr~N~0Q_Da-;MdzTQDXz=qk+-P=nDE4vrFSCXd_*y{b;KM;-?1Tq5tQ|NyH)LM+T diff --git a/transforms/language/pdf2parquet/ray/requirements.txt b/transforms/language/pdf2parquet/ray/requirements.txt index 1577d024f..af70c0354 100644 --- a/transforms/language/pdf2parquet/ray/requirements.txt +++ b/transforms/language/pdf2parquet/ray/requirements.txt @@ -1,7 +1,7 @@ dpk-pdf2parquet-transform-python==0.2.2.dev1 data-prep-toolkit-ray==0.2.2.dev1 -docling-core==1.3.0 -docling-ibm-models==1.1.7 -deepsearch-glm==0.21.0 -docling==1.11.0 +docling-core==1.7.2 +docling-ibm-models==2.0.0 +deepsearch-glm==0.22.0 +docling==1.20.0 filetype >=1.2.0, <2.0.0 diff --git a/transforms/language/pdf2parquet/ray/test-data/expected/archive1.parquet b/transforms/language/pdf2parquet/ray/test-data/expected/archive1.parquet index 7757d57bb13b5656ff7efb1c5509260328d7080d..9975c36080eb7bebeea022b6a6ff205ab73d1598 100644 GIT binary patch delta 4626 zcmd5;cTm(>wx*i~T0nt@ZkpiGK-0oV=p4v7XHXPO9VKT0MW=~MQh^@=f*?r&1A}BR zfk;wBq5@(|5-}1*Kn7QRJGHYDs} zaM}4TZjh&CyQd*G{y4moQS*pPzx_UJOzSpOMvQ0rAyD0Y6!+kX7eUMJU;<1bb~^lk zsN_=e7Fmfqo-HkPg;xQFZnHC76tddBv!1&U+-+_4|2FCC-ekZO)2!I3bJhGI!jc!GOd*X4#xVQ z$Xh-|9A%m>oJ`wQF-M#>v#DSrm(j~QQ#slF$6q;lEh z1#|WT$7#Xx;+F+Xfm0?PE9_2HkEHpw_SqHu9aZ`dy8Vt7hS9BktcZeUHSyDsXr0uS zaX>&q;_$qod92$ZNBK9mxbpSo%8|sia6bz>%l)A#XNv23q`eQgjRq|Q`?sk|tB6Ll ze|UBwq&hL^i>LXB#ifSXvNpM~%d7CM#50Zxk?BR3dF#nO9x)IQRNN*&d1P{l^)kyg z{X5OeNoVdFozEtUH>*Dm z?!cbAhPSd9yOWqFaCaGd4W3mLXS0;#_$7LDm}ZG=ek*?OV+WW&gVBo1dv_V?s=4p= zqdiUo-aFqfmlWGs9cc%MHq9dZBQXSFa5{aC}PE*OF zGvbm-rYx}H(e|{&1W(LGLrbDqMyKgC-MHxJ>it&JCDhuEoX$8UTw5@7y3`xkzU?VN{L*TQ zGJlsLK2;R`cQe(wAyeU55dGNhEUyi&JLP%I!QD46N4rd{Ty1^-lKWDTki)YL3xdg; z0jUq9v-Ee)h4)v5l$p{CmyA*zr$QbzAlzJ%lCzm&4dHtudvSSJ%`|ptZ_7X8eoF9^ zIW+sG<8^M)#Dx@k>8E_znc;dGLh4|);)_V}IN3=ebcLkgvUab{22fAxD;aHyO%grs zZ;}`4szW-dYW-ehuan9^W%DXi@jktii*g7!A*yg-PR##Nz(vXaK&tz5vYYC+?z^!( z%sAQeFwUUH4;?Gz0NYls$I1vyfv9h$Uak|qzy(1?gikT&j@`=@;JL0cnC%qoXqNM{ z#OQ#xr%UF|pfKDWX+7_tM}EjxE(2}SJNHavn}tdJfWD6vZnxbVdb*kja1rc0?r_$5RTb3;ovwkpZs?IrKwZYZmF4wD6 zJ}^yR`lVtjNH`E}$;W3(FpJuV5AQMgvGC#Edou<($`b?WuDnUMaL-1y8o=Zw;R zs}t$`V~7A(W(Ri6x6L5u%D`q&4LZa;nG53t?Ch%8>YWmbF*|;+%mH5!LYCHF=-82F zpqV%pGl+B#+^%){_zAHC>ZQUy*R}GrIem+*$*p~dR%Ami)%EM0z5Fuum*~vPJ=bc( zMn|UA^=3ack9kWS7WeD&qTL))vLxpg8AyGIckrLs@YK|{|C}T1!X4*b%4<(iF zpIePU!dtc`Cpj$J3egX@jJ4~%P$VtvmA>iV9O;{$6BiUz=($(u$nJ=c6K`clNcFR~ zR-|4XEA#DX+g&eIuH~C*QB87lkF@!DjVpI5%V5t^@Z;d)nAuz?X-YfZChbP-V7*ab z!nq}*0FJAlO6Tp`8Oq?Up0gY+_ZnPf#B}ePz(@PY-RW>S{sXeiaMxGrbu&SvBXL9? zQQzmv_$DaYu?#2UDtKN-MnWPDZPkyAhQy)bP+=FuU{Qgi1@Wmvp6ThvE5nZ=-1LEL zL(lCRd^-@T>#<-uaw!rXq zbmQw!*GFPIHnrEGo=y2$-8N!bTis`7Yi=1#mYXEWc7sd=Dn_ZKJ&U{onT7|m{8#)% zJIgyZSXt2-8ysk5r^rTDb?+@qs9fnG}B+BrGbq~)vU5z4~&;+FZ5>I&s+;Wl(#8059O$*VQ;qWzDF6Y zFE*j5ErUV!CWefYNk4d1{L&}C)lIQ&_Kvz7SVL8j^En!Th{A7LzgsUA3rodK?HS2l$IXc z*{wytdp;tdt5#WZ*hm)pwB^8xL--k>hvz`TTyQz7!N2(Kq)Tp!nv>E#$%fj?yd!R& znm#^o6Pd}r$eM(je4?Btn0w8#uo2ZBTyIa=aPF<0AC4cT+ddet>M{y+pKMYJM~~g# zwCnPKch@>Kf||Tjz3f@`=F&proVhRV)l(K)e44$1hl#7@=F!5ri*f^jM^OSZ4^so@ z*kpLl72@5pVyU-e*t}eFKm1qh?vPN83S5ZR)<>f~|%tmQw z=nD%2e()dpm6cJ|S#VM4&Vq1xxTp+?Pf%P231tTOhqC=cL*Swa2%;btgvboRRH}}pFhCv`0M6_HBpgvft8E}mQyU1d#;F72xM=Xb2f$HBMZuCT z5N4(mC~!0Yy71xPs2;$WutRXN8-&T|1_B6X+;q@y|EMTO0Wcp}Kun^tc57UA#-PMA z9wWCB#(!MM)6G21x%X!W#(?P1!!-Cw1*VoMTJs{MMd+Y_@3pmE{E}Bez{I#I8qD`6 z&w^$Uuy`^Sht(jFu>=hwg~8IGVX-(3Jdr_TW2qzthryvB(R2oh#bz+68XP=_qCsS` zI2tq#8>_*<(&%(D5l^R(Xut#&0bc0`oSYQ)X7giG`#>1pW6(Fc|FX)zQ6ySd5)dRg zf-9c@FPvlqdxk)m^dVr=NfF2s%OdFo?5-GKYyLvZ*)*o9xXXk?}Ydl|{spuy{O?gl7`*3@iu7q|)(N z_E$3LEIbQO!LgBsL^6RvAdty;CKHDvFqjlJhsGvz*fav2!6DM{G(3k)qTwhECIic2 zkm)oMgYdNwflXqPSvU?*NR|dKKLeV`a@_n85T@?wcdGvi*?$t4*gqq~0^i=K3omVg z5aIj2h|vGBzi&V!^j8lg^j9AQpXzTnZ-1}Tfj;)xpnxcg0l z@4db_4Xjbo#x+wQOv&VT_J1GouV(u@Vb))1GL$y^WboBSNgbLu_#411t4!4w(!;=KpdGq`?21o|XFNeg)>W f2_lH~8z2M(14WD_SD$y4fk2D4^YJ+vIph8YGYRuo delta 4588 zcmd5qM%uK=!= z17&DH^w}BPZQzVWVqwZKm{Ha01#b)G1$)>_$Qw`Yy11tY^%i#vN1_j|NV#0)a%(E> zqyNo!f!z|(mg=O!x|dt1!!{h6c}Nu?Z*c?&0U+QIP+A!OsralHr-#hsS?QJL{i4!S zy&x^=l{MF?Iw+@t9r7mVNL8Ts`EL!prX_msq_{pIj+$YvnUWPIuGl)vbA^w!SOq?SuZ7+LFT^c}MC7j%B+3sAH&3^(5N+8tX=Tjv8^Gbj?`h@p<8gKg?_UPt8Qxc+J@OvAFgyBWXp?S54+7P(2nQu?|u>4 zv`!4+Mzl$R_l!Eol)JOJ3D>vYnN421aotU-N%D8?2eq?^UjRZ=L;k*Y?#zS0r@M#4 z#}Ie@+zRDw0@BSqU$Z=dn3n~^@!|@7^zOalfYoCu_n$sd5>6v!G!@2U;}Gj&ybYux zV^XC=ZWyI0DS1BmkQA!<{KmH6Tr-Nu4ifc;^+xtu**(Yac@&vHPA}3@BTaaU_k-cE z&VeYekN_h1TFb@g(*;#EjVfl`Bh{O+F)dRscuq>ayhtN;S8w+$MYU5GfC%4qP zI3^^N4JQIFIhzg<%=EZj7%p=98PwAR&pcC$RVwUYE02wif9a8YiV*^?&&{< z(VBTRiFaO1heO&>v&#Hv5c!50soK%y=N?feazS$Vr^8)wMf>Us4iozPWZ&nU$6sUP z!Osn?16F(e6vuj=@1RPW`!YvI4@|m5!0ixD-#__Ws+nxx&UP0Y^gYmE^YcP9 zUVdM&y%e+t_v~&xYMh-Q)uJOuShnFaVq5qt*T_RRENbXCqdM%GMtnmbIp=Q;-#fLw z#HLYaka1$|(IT{{SDRwXOaO2qYmY~Wb1<<~d{g;B(uhbN?qd{L4vuL+?*zS`j!zzi z>l;Rv$5^Mnu}a&%BSfR+L*{`g8(ELJ%dG*8>HB;=Bf(LpCn??T1Krnac6mq|DmAbK zbl^Y~enM>0NFA+CK}DtqH+)b%n;R4mu0B!xVP^C6=;O4iW07z7ZvBBtiDYvl(?4iO z5@y^Y#}0|EGHFvSoJjZY=Txf3AN#~KjnrF9NQZ-Co9;i-OAQJ$o^`(vYzLKkP9bG5 zy09myPhrz)L#!_sB3DhIbKm%yY%e?fs^Bs%_tG7d56*MY*6rL~VxPmV<&SYbY~Fc% zK=r7W%VxhXOwSXn{AS-lFUJc1i14EwdP#5O`W#wBp&MHo4L{Fqc$)5{cL_sJ!0xbf z(%(HCr{1`U}qj zW}L56$K8iNp{y#hkqh}jr1=qSN7^{I=kn@Oa8~;DwyS>ElOkLqXaSR~6ih|+-sGz0 zB4ucKi%3M<%?~b~HEvygK%6s*!m9VrH&z{ok9d_=m0B40*EY*mSS(k_OR6us@Z2)( z045B3Z{Ooz7u=Q4hn9cRtGZoy!3M9Ub~G#0)i9pge?^k9GbFCV-!)j0cz1lq-q48j zkBz8cnpa3c)s>l?o58*x;^?d28{651*Co4BC7VPws_NJ5qP!4^_N?Jwg|>Z+*vDS} z(Hpd$P5Qp$+B|}%*eT?CxeZ@7T(PZO$I^*6nf5ZL`N0}Vaawvywyj}|(*=SS8PaM7vDLMT%1-DAUEM_t!UQuI!0z!y;{qP8My(_1E3q`KI(c*0?C*;27r_106pr)56vzUb3D_AomAiZF3=i9b5U^t)jumC@$;xIp`6c-ykm za%{g~*x}%&!FGq7BkGv}){gUV?E^9X2`yP zMV&>+bXWcmdvJt#*h%Ow+zmUx%|(uo@yLyK>2jS^?@&e!%a5UdX^$9*$Y=1hHp z=UfID>&D23XHTYGj?f*K!)}ejh6l<1tXgtHCIdNg=JoKh?0R?IuE5BTqtLZuXT&PG@iu2&V@}t|S%X`2QM<4F@$zRA<4pbs^fBbJcAqa=D$5Y!Ei3pU z4>`3H+!zzS(pB*buySnzh1{kIUlV^;2JvgJ^NnF?^K za-S>$SsAE6giGlzHHf^y9D1dQfu3RX&{U=_*@OksAw?x~s1>uUa*L7)5}t>FOze#- zz3kshL)W#iD{qBqV&&E8Fcd}|kLCpi91sK?2*JwZkhtC)0C#miao&h{6sgI88U07u zWYEw~Fc6;E0(F48sCTyjV$Ch+n41Mer3Aw3??9by1nKWv0pio!0K}oJg61g6@Xz-k zrn@2PNgF^s(FT=KZQRp<-(MYd-DUw8JYGSutQN?}#(AYaqMOmUN8XG?@QtF9LU&!L z)ZF*%`HVjAPN{Jjp|d;wg_cS$Hw1O8!`_ys?pA)$61D^*&wyvjp__sNA&Dknl9xIL ziA5DKNE|9%Amo$z0*-)!H=wa-6b6|~BMIp|K8eN!xg;iwW=i6LEHYIfB!dhJ2ln?% zt*(p%3`2y8zN3F%TeLj5+UcwrgISOH3Yy50$q|5bI)f{q z@wr?wMIc~sSsbB|N~TgwSxh>G!4YzK42~&{OJVW^0udPvIHPiDBzR1bUKg60tI|NpSqZax|obi<7xHWVqsWitih_q zClLmV!lE?s=-({x=-+H`Fr$B2g~~v6=f#6HFbrs7k^f*XGQ=-j@Rhl(0B#Z>UL696 zlRpFhUcf1eCX~h0A(ELS3WW_a%*db_ooPa$F{q|A%Hqs~Oa;kICICyYW&umIaa~R! zQ)r}rI{eNG|Hp5?3%2;zOYgovP-JopTEj9}FBt`h=SIG6+8;ywo8f=MnY);$9{Enj za&hz)*q|)G#rZc2{L)T*wOPXX+ttODKdHED0x~f(q*RPAiq^9wl)rWS^Kf4)%rWza z$G(E3&55Y_7XVT60xD-yD~0C!f&Z2w%c8$gW6M8RVr2yJk5X)b_-}}s>NkQ$KoI|L z>8RRj z1A(NjDlUrf=Yc@HAP~rpM)kp>P)hO?I!eKZ>Iwxx`JfVV04pz)6UvFefuwkUU26yefq?>;jSUP!PO~APJa8s?L=I*! ziaiu80)5h57~ySb)2lvwrn(xE>O*ljv>}%IzUkqamt(NqkQz5?_O19QH!Q`FSLbF> zp7+mGjkpiC9V=>|JLzf2*rc>YH@T+k{nZzjlKe)`*S6LO$6>wqn%*Htj}xw>MK#5G zoB?s>woH3pHEgbSwJ}^ElpXyrV@%EulnJXWR&IND$!G|Eb5tTqNh#W9(cMY(sZ~F+ z#x~hoSXXmR{Lc1GKJ<`~p`P+*`FEbkSC5tn_l#xa43lCvBAu?Zj8siN9I?dCpceIZ z6NE23f~U$QM(`Va&Kpn`67JbO-I;Wm$HF0>>?vKE|D-VAvhPM7G5Qa}w6WMFD_=Kj zIz412Pv=10v$(V`h!zY6V;xe7>v-NW&B2Gzn{m0lchqp9OW{@1q-|=}Oiy2E`|TrH z_}kmTQ^zYP4LW<*&n!B& z&HsQ?!-p=Com`3hqNrGmb3)mqWO?`1%O5g6N&0T7i}(&VQ1wGoYr3io0RvX9l4~rFTIt zF;&CNTK-kY;;}+eSNo2t7<@(b(}v6`Qf2Wtq4$6)ceRk1#InnfdmPa;_e&s32dZ(O9hKrsIVN z;ioJQCU&MfbWDn-s%Kp5qzJiHp^hkLgf19X%}QsLOH9ob`|XHLVdR`KD#_7qJ23zH zya|O%|BHzAn+11;0YiU5-5S}X#16?8W!2K%Ih^p$hDv-!H5pT`U->e$uh->e=9e*W&53oWPDuH*cIJ46sj0!2QeP3%zm zz^yg>al6 zx|ikC=kIMU{cLV|(OB%1gZp*5ia>iVEVV>t&iLB%>{)Z}f;YR!5etr1#VO_n2_MHo za^CY}PM#b%Y-L7)az1^iMPL8vj7BSbPO|p^<@#&Y2X9*SJznry90;Nq})AS2z=`!ru&|7(}}3;&dDaqp0A&nJ(|v57_x`Wh0!THu3h zZu@s&r0;J~7g<-Ldf%w zT)>ZWBnVHh^{1?XA$9FoOzqUqeu;|Lpb&{M`^NZkM1n&rgX`-pYY51 z!r;=;b=h(Lyz8lQc#C)suNkcKCg_S;QSANMAQ`AFWDpj*T`JP9C~m9o_(o?&S?<&1 z=i!o*vPGSenD#wk^7;>T7IW8UML76u{A2Rm7KLrrd%wQ&*1YqzC+*O4x^*uvmEYZ4 zY9K4_Mdx%)uyjgFSL7$i2i(H2h$y$5)_Wnu7cOuzqe5k0;+E=}Q?i>Q>4NnJ)g57$ zK{>BwCIZweH@zVy2L;Z&u^1e_chnCjzJMv((0Qq{lG?w{zkESb=-lvu+~3-@O2+x8 z{Bc|;fkKWmrt=CiX)rOR_}P@5;G9alUq9$~g_jTJZe87u57X(k4QS{2 zowAnxI4x2iX|drsR$$TFxT>VPAa2{3jcR`C$bQBbDXnA#>R=CVas3ffWBLgvPQQw2 zYSL#1JMphG6)%v4uRdm>T9Tcm=U3t_@ zuS_35=>bYFy)#hOk4ezgg4gSe`tr{oF^7vja}nEqD!ou7t}T_!6D;wzZ=E?3l&~-9 z`Np7Xk;)F4f#enmk%M4Un;a*}Cw&bsCx_)=UlZS4gk@!A+xzf2-{k0Rl-#mblqiYf zEg)VqPSKbn{H{Dql6xq^9=7X^vE11w8^`@Js(B}jC~Szxy=v2EG_6Y;qfVELW7DDD z2}*A@XQqnz*Bb_aC5|*n$I7KUW0JqtZq2vS7YmQ#68CQn9ql0=H9f= zQd8quV#W=VW;E&ag(pLiH%wlVS}zzx1Xzd2YyxXF^2|8f#FGb2%&&^LYJx5ZRzXDS z($vQ?Jc1Gq-_AVgb|*(;pUVyd%-ZBd(T*0$S0rVX#4lC64(#guFE=z*wjNE*rA%Bc zOSi&TRB%;2>$<7Vv%p=(#icY85g`~vbG|OG_Gni>KYGBqXdRVrpin*xX=e4ZZ0y|{ zp(*3Vxv8hUzq5C9ZnB}>mfUn^hkRM{e31RoFOjC1w+<9Vw79s+r=NMyvSFmP3vbn< zOYY$6=$v}^sj4QxV+gTH{A4|1YjnV~I2GOMvvHtz5G_XtHkZ73cE5MWbq2gEG4Ks+ zA3dsUQ@w`u8htLy*~68X8;I~W&q8Ws{*kF9>JxC=2$%J4N`z~C{zajWS-0OI<{$6a z?iV+74YNfwm_Oee_qjS#w4>m5QBFeFs7kqP!Q}R9VSL4D49mm1r~&X3Wwtb}@B z6V52?G!D4M!OxX>&4vV{J+N-KY!J$wXHN@OKetN)n)vZ#i(Gc!1*->*P-|QK_P{?N z?7rQG>E2mWw_jnmlOGwimLHy3mn{B5$julR%9k3A5Zp{S-Z^$JxY>-6tQfKoYPMgx zZ?Mk(PUHUdE!P#`BhM_fZ)l4hZux3sn?EG6FUXX9tFY`qTdO};Jeljb`SIT`7|pfAW9}Pw;8Yr^(vK^b{A(8 z_egQ{O8Nwoq}2K>d5|@iL6-*)Ag@?M*Qgw0AOB+8b}3TyB!b88vY%?)nlIxOUm(|n zj1FmRle=WvaXqE#X=Pj0n-nS4iy`=IdQ&q`Gtn!1Hl2FvQi;uTwOlZ^BWtMiYUQy) zu~*x;FN)lSm=1{dkE-2pJy0Io=HSI0`aW#RE;f$a;$*yE=!eorSsV&YQ{Lh^_bQ(a zn%T>Ad*y{u;@`v$oz)9*m3==3;sw)%yTl2cu^dV~dtOa`-41)AsY;JEFo;?2lMpj{ z;13fgkF*!x1_STz4YMx{Pn}#BO6{PgKU3 z=+8U+)pzfcn&0i-Dc)WIG_L@E*6w}$1RJbT1DUUhEMo^Y@Ss&U=e$s{?|0@R5F!Y) z2C&3TKe`|QS*}^h!9Y^0Pyyjp4jNjt5m1hA2Muez0zI95NN%(xUu)T5PuY+~TF5s< z#ESBj++UdaWk@VMt3q0?5r8b`n6{SlD^!583Z!wp6-eAnNC6}=<4>6G>mt*10Zcz@ z{0lq3j7jmE%AcEQxkdo8oMYNr&aY5c$CP2Tf{B-jiRw*prZH%q|LR#wt@X(OS!)0w z`&sW_SX&jDIh6dT4k$%r14#gmDHE`0nu-cp9*xElnQqF+Rvo!w+CzQ<*@aQgBoiGMY?6k!UnBPD#m?M#EDGWD1&uC!o<3 zB?3+vuS_GLlyNw!D;`58E0d@yC=v=qrCG0;6M!t&tmI(8=<+ia5ME`(>~dTgT{!&R!n|*55cdVz90%*=8if^DhaA42sCb3{x5g0kQ2bs2q*-ie9 z`hRPB)nKgFw1AV1epBl-e}v7Nq*yg8IcDIO`B$i`!#qSmK)Juo@4Ub;;|wfEf@ppL z&H?VA`PyrB>M-sepH6sZ!K>!V~L|k6+EfYBa&Xu(laQy<5zgw6r z5q_H0my~@!oh|EQT4qiL-G*T^US{G3n11RJO9DTJ@tX(?3`fB4 z{b%o{kbh_GjQl57v&esBeTn=BSA59-JL@Op-(2t@{~u5K$p7#(jQro4@{s@FB!v8D zrUO_*F&hkwguo72^8>qhga8lG(sI{s)urh{pq5LKneaqyB9UkT{Fup@NDL-wxe@h< zL^qXrDe$?vMe(Nza?|#n}Q~wD4u2vJG6rjS>HC$em9Ci zG}}i2K-ee-Ey)u@SQ6gM=Xdffd=!04Pdl_(5ZTLtLGiM5rRp;R9Do`CFW|U6^SzYZ zv@DkO{e?efOhyNNJtYTQ0^MF&4@I&wVmL5~y4JFain7FIc_N)j1e62mJAZ({F!3;O zO#48PJihT0i8jo#cJQ+NH~Xu~ll1k%NKAI3HigNLHdj$pWQNzmhNTr~L}r2YWIdR5 z`nDJ}kiD{PfWvOAk{z1iLEIGv;KO?}yAHsv^$ni~Q}jE0 zEG1+w+WAdez2fc9ePpfGh3we7#jf5WTg@q<3f5Bh+9NhGrLDO=L*Jwq5wJz?g6 z6ieR{QqSeGg(1tn5mV?p{Mt;aJ{Co`B?Q~=rm}P}`?(IQKT}vA0!xJ1{egt~m%gQV z{uCc3lRmRQSu%iM7_&$;wFaERTM)G@nELdAR%YsF1&7t>%>1|ZVSPm3`^T))|E_N{ z%RoI4*MEr;i5eOp0TCaPpKl;7K%U~`BTw@4^9~jP7R>(c-d-XooB~Qg34RY)>ixHW OAT&1!WDfiX(SHHsPqgg- literal 9262 zcmdT~2|Sct+n*U?>=R}hOGslvP0ZMaY-8V&eHUYlC1c5+ge+N#;>nUkcCsYd+K3{_ zu25MjYl|YKX#4INdG(a%{rdf$?|r`SJ3n*Z=UnGn&i^{seO5*U6D*t_E(%vLfJ5LQ zP7p}6R^p2gcP~4ELIXja>MTSadv@&;GA&OE`SvVXMi*C%YsBZD=>OL8v48X zd3FI9KyU~Mc))-c2n3W1Qo-;d#UEignxOUugA%G|`F8i7(ivcAxBa~DWzn3hPho{; z9_Oj-+*iCAU$&bc8EL+i+!6?_3ZgJx;SD~ajJyOc0at3q6I@l(G zstYQK`ca4V9#0M)SeVOaI0ahRaRm(H*t{=Pq0aVn@U_+@{w0#axoSTRpNQtDmk)-s z5ZRCHUeK!)jn9hD+^Pw=l`L|p2l8AQ5!yCET1jZM?n~XKr*Bq{DKz*0WM;;ERz*>0 zUMV(A*EwMKYc?hhj?F5k+xOL9gZ1a!%7~3+Wbz)=diXdatQSnrh*TVYshA-6s4$dw ztOE-V=iMV={Sgb-{wS;3XcE|&-oEonwZ;P{!zfo4=h5k^bj7%ngut_QeSQ}e>RovS z_Z4%CH`OkRTyVaVC7Aw3(VrK|>Y;x!Bg2Q43(<&@y*e0>-QS3MtycdO0=?-d1Uo#3 zR0o4Vbk5N3R`->tH>uZizYI^D=(HtRbO_X!e6}tcueV#wHrwLTJbf6^o}tfAoO_8) z{Gz?JObuErd~xpfanrPWHRzE;Nw&ocB}y|27v-m?EHxg<&$FDtKge{S-FH>Ln}eLR&?Tdy zpGP{$Wv(y3ziHAv&N>ZOfl4aJ)W0bV9WEC?++CZ~9a%A8gE7=hyl#Q2vO2PBsIKeU zyJdlaJBpb%8VkOtA3AJMef`*S2DWxtDUq4W)QuhJ<|&1R#A0I}@zPnA`IPo|Z_=Ji zjThZ_bXqJ(!-d9g=OBA6X58WP?3`dD3xsoehUt0jiof1?A&Y~!-FBv!(_;o!J+2%g zHsIErtO89{2+w2+?$6Tdn@e!Y^Nn9fG<<1Z+^n_e6ywj7ea@p%d1WahEXvzG;_msE zB>i&CRt2x{vW~r~La=iG94*CTOe~h~h6X=(Kep#CFJI)A0-;ovnO%Q}EwI-EcXkoQmcUf`rFU=kNd8d+-Lgv$4ut96 zB;MXI)+4C<$ZOwrJJCZ=m{OK!EwR%w@;zYIT$P z*(Si(e7VYHD6%7Y(&&ZG=ZXMcwM~8m!-uo;g>sdczRWri?865q#Cixvn^i>>PZ1!H z1V|cAJ34_GB`$4b1R|W-9~doeZTJ||dh+>Xshg+!P*yVA4QrlSTa4-M=Yj#pnPmcN zr`{N!`$&4!yH`U9mJwp1d!}(3F(tneJH4FkQ4`UunslizfLAjSYTmxhw3}o2^=JbK86A6dRSQ+)6J?O z#8j@(tyg^ia!J2)FOM*}j)@_P%C{xh?wGa^gGNDZOR6IC!~04X`Y`CiD)~qzrrma@ zL?d{5a_1LpNt#XR(mTWJc6p8+*k;CU;1V~kQ|ghnB)xxRTj+AH)E{@BF(X->Yds0E zWPBK1>uq`_ex2kQohsBoeUWI!@aq#NtK@{|#mo@qp+_Ux4s}((dhWv`A<7RSw)#Zx zE4}S?sb+eKS$jCLBF7}>o{N~~)geRas0q29JxA|J7#FbsAoy#x_S=tD5#Cc&+>uDu zcn^hg1<_XMQ_#ZWTE`fv6E#Zyg0V>+wsRaTHwRkfu6 zeFnE(7)ZLhDXeF~DM>bO-K)ZoA!B+^S#O@ZUILXn&wkg;7wxvwy)sUqnMrYuh%GB= zy0=AmnEg$Y(pN}SMg;RzfwI5h%=wee%Z$mxcE)pdd9WRv0UxH0E)0@LZ62{y8q z?#x7#4}`>)MBQST8tSR5H=3Jrug<0LzLOCjy}gUsHkeYw*KmE(aHSPCpmLb*3UR)- zoGy-;_z4;-qBJf%{>rbWSKrd9Yy9?({cWB}?<-6A?K>zu$K-2<&W8A(#TMEJOC#p1 zb&e-P3pZunS~MHe(|hx_&1TMrtW@Y%zT#C(=|b6{rgAJ^JEMD~2(tL|xWP2KOf-Bugxy43Qz@Ikx5 zv!Spt>_}xnxSwbd+_6I4X^^pUG9o`HL91N9rE1LZ36s{!9-;Wmw@WSf09)j}X>?cB z_JC7gOT&FRGF4)Flk?qukZLs_ux?CjDfdof!IW++>_-fCs3c9~=yj`%D0TQPt}w*C z5VPS5n$cVs%KQ>_7vH>1AW1L$%$we`bT`^hH6(Kzo)&hOBC#2UK-&AXvf>2Z@s^HI zSZ<@FjYSTUS);l{`%;v)cvT5+-j+8EeWK!l5)W)Rvj6dy(9IQ?wnv$5oV%9I$I4${ zTjG|Rj)=JkVhWLNy3ZCMQ+z3KkF{xk(yUf9<@_I*SEY<*(;|(kwo@mKcDpPD#mI6Zy>Sv~x_78s$bF-+6kPVpaiS2UWSpv~Hg*oJC zRd!w&&Dy%~he)B|yyy$a>F7^lnmxx&x#h`MqmIUXHmPh3{cz`fL=8dXfq{?i>;NAN zYw$Sx&gyR}&v7H+HQw?W*2gEcYp{{aM4A5kRXPnHMXy-nYZd2OM-=wlxdU~RDAqb} z6I8sn2{XGbLOF6UO}*=am7sse*?IdmdVi9ALktfuj4^Ipz@qsQU1@Ne81ERXvWDl5 zj=(rf>H0I#vkA_;??`YlJs<+1}1%ZmBV#Lec3MQ`1L)^BXYKf9-oAfO^BU3 z(XXQnds()9zFWEf=z;u{D16>Ajq~7V+GFAN?A(Baab3i}WU4|K#X+_4^}5k_pwL7s|yP z_9XMfaF`}7k_ud85h>Y(`x z;W4Qf@ynT`$yW2p7DKX(ynROeYa;aJBS= zdbAL$zi#kpzchY3o;u;!@y~^HVl5=LW7ov|&HYO5F%hmt57?h4Yu9immkz>Dc?37c zr(S4$!r&WmrRm_?C*xbu5V|yT!$ZUv6{*2-?MiRdEj`92;=o%kJjT*Nph97L!aMOX z{=MZsSR3Eov(t;Hhcn_HsXO`NtC(r$?lHy)suA13fBVq5a!2!FD^3{Lt)7i0T7I?u_XJ3tNN$+s3LN2~?bM`~MPz6vl z=n-SDP6AjStgJqxFy9mZT=tYCZ!TB$W$@0t_NxvB@vH^sEJI?W^+vr6KRjZ*YjV1_ zhPypXB$j1moVhzjip#J^X_nrO`MIQz8p&3n_YoB545>TRk&Y}85KUjAe+e?QtrJu(f9 zNK-<8R6$qK0ef}OhWl_BT=>ULxDb>80&M~;QPa;&Ktk7Q)^iAu=mu0kc!M*AEOmZ3 z{r58kCa(Zb2OpvSlbYp5!~rd0VqlUgHS-joMoKJ zGICBN3|7WT1|yBflQ4KPUKWFsBayJuPEI&^+(rjpsNa!8{ubB9-YGI%|K3;t&je82 z;^6K~`{Zsy5R(w}O#^g`KJ|3$x6J%zV>WaiJ6z~b&SR7Zj#hwnZ+sX?2au7}$HvGW z8LuOwBM1A_5QQMJ!RU`h=<7yQKj9Xw-s1YSFB0Hv>?UL0*IB(8LIS z&Uc=I*`d=cBeNzkUnedl*lQhleCV!CHxV@fD#(tVK7L+9gZ`Hf6)>)flaVEorDUb$ z$nro?$dcro@FX%$o+KwN=S-B6$4cRxW#x%ZBxx6CJPu2gmz0*lxkyTqu`YNQDQ6cf z@^5R{*m0c}>yD#$Bl@|~JokbQ@sbF#t8Hk?_YBx|4rS-nByGQql3+v z=udVak`4yA0|)c8Fo}jBN+9SCd-M}KYMA^R_5arNhR)ciYk@}gN=ERg{tTNoL(yv1 zb5zH#@vl=ihB@fO4`=>9xx;`l#sOH11d@II9sJ$DUmUD?M9M6ah2+80_awB^Z;YbjiRRpqr z1tS77;0HQjmAJOtTO-l~oa+lK;F<*}|F9}qCHyj_uPXa-E?d(_wM?A}ek{`lxHSUR z0PQUSnbgn0|?fRe_(~_+116;o;}${LkJ&q5sap8U0TzWzql0;u8H2F89#? zcNS0Rzq#5$|39Ae(f{FT82!IB<)QzyHZV$16P}<>AP|g!A2sO{2tfojSArIS;7U+;rP0*X2!y@V9KfhceFNpSH&6n-+KqoI73dzgF#aWx;Zv)f-c!7pnQ{St}Rn2%!-(R%HkjiMIttDw=A@61_rG+J0 z=}~N`L`}1u5)wNJYw`p)DiKf)q#xP?IEIRcB1^RoILYHXKY?IQEvpBu$$z)Mp*&Gr zD}+d8C#aLC{5T^y2??rujm>FVflj0rXiwUMTBmIx?M%?1wNK(FejOUqI=?!N-z3ml z%EI5qQby7WNAV!+3jy%qy{VxCu&aH?=Rp4X)lYK{t<$Ob@BKsjh<}acKX-30A*}3f>~2Y(PGG6`-~NGc%pi~v@Sj5e1@Fd>1ONa4 From debe44956c01dcb661493c9d0d0cfb45c76da58a Mon Sep 17 00:00:00 2001 From: Michele Dolfi Date: Fri, 18 Oct 2024 14:06:24 +0200 Subject: [PATCH 08/19] update doc_chunk results Signed-off-by: Michele Dolfi --- .../python/test-data/expected/metadata.json | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/transforms/language/doc_chunk/python/test-data/expected/metadata.json b/transforms/language/doc_chunk/python/test-data/expected/metadata.json index 9960b2860..7eeaaa279 100644 --- a/transforms/language/doc_chunk/python/test-data/expected/metadata.json +++ b/transforms/language/doc_chunk/python/test-data/expected/metadata.json @@ -5,8 +5,8 @@ "job name": "doc_chunk", "job type": "pure python", "job id": "job_id", - "start_time": "2024-10-18 06:15:55", - "end_time": "2024-10-18 06:15:55", + "start_time": "2024-10-18 14:05:09", + "end_time": "2024-10-18 14:05:11", "status": "success" }, "code": { @@ -35,18 +35,18 @@ "num_processors": 0 }, "execution_stats": { - "cpus": 48.2, + "cpus": 27.9, "gpus": 0, - "memory": 25.65, + "memory": 25.75, "object_store": 0, - "execution time, min": 0.001 + "execution time, min": 0.021 }, "job_output_stats": { "source_files": 1, "source_size": 50276, "result_files": 1, "result_size": 31223, - "processing_time": 0.084, + "processing_time": 1.266, "nfiles": 1, "nrows": 88, "source_doc_count": 1, From 6908816f0948d28662471b8b494acf144b395ab8 Mon Sep 17 00:00:00 2001 From: Michele Dolfi Date: Fri, 18 Oct 2024 14:23:15 +0200 Subject: [PATCH 09/19] update ray results Signed-off-by: Michele Dolfi --- .../ray/test-data/expected/metadata.json | 17 +++++++++++++---- .../ray/test-data/expected/test1.parquet | Bin 31246 -> 31223 bytes 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/transforms/language/doc_chunk/ray/test-data/expected/metadata.json b/transforms/language/doc_chunk/ray/test-data/expected/metadata.json index f9658c2d8..7eeaaa279 100644 --- a/transforms/language/doc_chunk/ray/test-data/expected/metadata.json +++ b/transforms/language/doc_chunk/ray/test-data/expected/metadata.json @@ -5,8 +5,8 @@ "job name": "doc_chunk", "job type": "pure python", "job id": "job_id", - "start_time": "2024-09-18 16:05:04", - "end_time": "2024-09-18 16:05:04", + "start_time": "2024-10-18 14:05:09", + "end_time": "2024-10-18 14:05:11", "status": "success" }, "code": { @@ -24,6 +24,8 @@ "output_jsonpath_column_name": "doc_jsonpath", "output_pageno_column_name": "page_number", "output_bbox_column_name": "bbox", + "chunk_size_tokens": 128, + "chunk_overlap_tokens": 30, "checkpointing": false, "max_files": -1, "random_samples": -1, @@ -32,12 +34,19 @@ ], "num_processors": 0 }, + "execution_stats": { + "cpus": 27.9, + "gpus": 0, + "memory": 25.75, + "object_store": 0, + "execution time, min": 0.021 + }, "job_output_stats": { "source_files": 1, "source_size": 50276, "result_files": 1, - "result_size": 31246, - "processing_time": 0.071, + "result_size": 31223, + "processing_time": 1.266, "nfiles": 1, "nrows": 88, "source_doc_count": 1, diff --git a/transforms/language/doc_chunk/ray/test-data/expected/test1.parquet b/transforms/language/doc_chunk/ray/test-data/expected/test1.parquet index 607bbd21334815df4e88c22a081b67d9d85cfb07..06089be7843d34407556d42f4fca4936d90a6f53 100644 GIT binary patch delta 530 zcmeDC!ub6&;|7bs`a=?;eat?h8yH0ygcuk!w(9>)P-K0}%J8XET55u6CUZh1%c`!T z4b~?oMHf9?rmM9rjqll2rnQ>aZ<*yk?Md@m@##oO(83(Qe--!r@@L8m_O`XQ96h+_ z{@nBH%%|_WxJ|J^h0i@`)w#uMGj#gRUWGim+2yr=ksrH#bYO1^gJ&2IV+ZpA4G~vG zj_4JR6?-!8A6V>UB_PEpDEG1Iy6;NKrE~UtPWTo%?baoJ_NPqxSHzd-@NLkw3XtBM zm@eJJKfQQP?CSL?s@|95cIGAhT64>Ty<_46|sWn|-n!GSlXY;LSeqP2^lXn-Zuy2{q zthVVr)8zl@9h37)gf_3wJi|eT3pVD8pxf?%Y4PM+(LzjW*en+2I`@cOYy+bjhuDnG R-6hqm9Kiw%3;~Woh5*r((u)89 delta 545 zcmezVnX&H+;|7bs`cIOgOPGB`H!zAa2r)2dY}NmpP|51b#*oA%D>cD1lR2T1WmWd) zoagV@W^3=R{;u{-x$fxdEw5);&CY$z+nW}1L#x`?cIl=i*Dl)EZ@(N{_V3J|Eo+vP z_-V-r^7HmKSMA?p-ORx0A>(=T)636mGJNKzKhgBH(KeFpwo!ZYaZO@It_rJ1kpzPU zi$L2V4K~K86^1plw5!<H!hdWylc zT~LU-_vo!c8Pn5-FDz;%>~-PX^OO8(B_-be|Z=$Oun9>!g^~yv)YYKnHij9+b@D?_hk8EVXRgQBilMDa}Kvy+Y@%N Z4UB3WV%s*ClvJ~FgbOe*1ULp60ssVE(uM#4 From fdcf1581cdf10f325782094a197980301dd9991d Mon Sep 17 00:00:00 2001 From: Shivdeep Singh Date: Fri, 18 Oct 2024 14:05:11 +0530 Subject: [PATCH 10/19] Multiple fixes for semantic order transform - Update Documentation - Fix kfp pipeline - Update boolean(True/False) options to be specified as strings - Fix s3_ray sample - Store full path in ray store type and basename in filesytem store. Signed-off-by: Shivdeep Singh --- .../kfp_ray/repo_level_order_wf.py | 12 ++-- .../code/repo_level_ordering/ray/README.md | 28 ++++++-- .../internal/repo_level_wrappers.py | 71 +++++++++++++++++-- .../ray/src/repo_level_order_s3_ray.py | 9 ++- .../ray/src/repo_level_order_transform.py | 40 +++++++---- 5 files changed, 125 insertions(+), 35 deletions(-) diff --git a/transforms/code/repo_level_ordering/kfp_ray/repo_level_order_wf.py b/transforms/code/repo_level_ordering/kfp_ray/repo_level_order_wf.py index 6c14abfd6..38a829fab 100644 --- a/transforms/code/repo_level_ordering/kfp_ray/repo_level_order_wf.py +++ b/transforms/code/repo_level_ordering/kfp_ray/repo_level_order_wf.py @@ -68,15 +68,11 @@ def compute_exec_params_func( "repo_lvl_store_ray_cpus": repo_lvl_store_ray_cpus, "repo_lvl_store_ray_nworkers": repo_lvl_store_ray_nworkers, "repo_lvl_sorting_algo": repo_lvl_sorting_algo, + "repo_lvl_stage_one_only": repo_lvl_stage_one_only, + "repo_lvl_sorting_enabled": repo_lvl_sorting_enabled, + "repo_lvl_output_by_langs": repo_lvl_output_by_langs, + "repo_lvl_combine_rows": repo_lvl_combine_rows, } - if repo_lvl_stage_one_only == True: - res["repo_lvl_stage_one_only"] = "" - if repo_lvl_sorting_enabled == True: - res["repo_lvl_sorting_enabled"] = "" - if repo_lvl_output_by_langs == True: - res["repo_lvl_output_by_langs"] = "" - if repo_lvl_combine_rows == True: - res["repo_lvl_combine_rows"] = "" return res diff --git a/transforms/code/repo_level_ordering/ray/README.md b/transforms/code/repo_level_ordering/ray/README.md index 42e68f13f..84367636e 100644 --- a/transforms/code/repo_level_ordering/ray/README.md +++ b/transforms/code/repo_level_ordering/ray/README.md @@ -7,14 +7,31 @@ testing and IDE set up. ## Summary +This transform does repository level packing of data and arranging them to prioritise semantic dependancies. This +was done to prepare long context data for [Scaling Granite Code Models to 128K Context](https://arxiv.org/pdf/2407.13739) +. Quoting the paper. + +>To create long-context data, we develop a new approach that packs files from the same +repository together, arranging them to prioritize semantic dependencies. We identify these +dependencies by analyzing file imports and create a directed acyclic graph, where each +file is a node and edges represent API imports between files. After breaking any cycles +in the graph, we perform a topological sort to establish an ordering of files based on their +semantic dependencies. We then organize the files in a repository by placing documentation +and build files first, followed by the ordered set of files with semantic dependencies, and +finally the remaining non-connected files. These non-connected files are arranged according +to their folder structure, using a depth-first search to traverse the repository. Finally, we +determine the dominant programming language of a repository based on file extensions +and presence of build files, to organise repo-ordered files by programming languages + + This transform can group the data by `repo_name` and apply additional transformations like( sorting or output_by_language or combining rows) on the grouped data. This transform requires the input data to have at least the following columns: -- repo name: Name of the repo, it is used for grouping in this transform. +- **repo name**: Name of the repo, it is used for grouping in this transform. -- title : Which is usually file path. +- **title** : Which is usually file path. -- language: Programming language of content +- **language**: Programming language of content The input data for this transform should be in parquet format. The input data is expected to have code data arranged in rows such that each row represents a file. The required columns in the input data shoud correspond to a) repository name b) file path @@ -151,10 +168,11 @@ python src/repo_level_order_transform_ray.py \ --run_locally True \ --data_s3_cred "$s3_kreds" \ --data_s3_config "$s3_conf" \ - --repo_lvl_store_type local \ - --repo_lvl_store_backend_dir '/tmp/mystore' \ + --repo_lvl_store_type ray \ --repo_lvl_combine_rows True\ --repo_lvl_sorting_enabled True\ + --repo_lvl_store_ray_cpus 0.2 \ + --repo_lvl_store_ray_nworkers 1 \ --repo_lvl_sorting_algo SORT_SEMANTIC \ --repo_lvl_output_by_langs True ``` diff --git a/transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/repo_level_wrappers.py b/transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/repo_level_wrappers.py index 6bf9abeb6..f328884ed 100644 --- a/transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/repo_level_wrappers.py +++ b/transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/repo_level_wrappers.py @@ -3,6 +3,7 @@ import uuid from typing import Callable +import pandas as pd import pyarrow as pa from dpk_repo_level_order.internal.check_languages import ( get_dominant_language_repo_packing, @@ -20,26 +21,47 @@ SORT_SEMANTIC_NORMALISED = "SORT_SEMANTIC_NORMALISED" -def semantic_sort(df, logger, title_column_name, language_column_name): +def semantic_sort( + df: pd.DataFrame, logger: logging.Logger, title_column_name: str, language_column_name: str +) -> pd.DataFrame: return sort_sem( files_df=df, logger=logger, title_column_name=title_column_name, language_column_name=language_column_name ) -def semantic_sort_normalised(df, logger, title_column_name, language_column_name): +def semantic_sort_normalised( + df: pd.DataFrame, logger: logging.Logger, title_column_name: str, language_column_name: str +) -> pd.DataFrame: check_and_update_title(df) return sort_sem( files_df=df, logger=logger, title_column_name=title_column_name, language_column_name=language_column_name ) -def default_sort(df, logger, title_column_name, language_column_name): +def default_sort( + df: pd.DataFrame, logger: logging.Logger, title_column_name: str, language_column_name: str +) -> pd.DataFrame: return sort_by_path(df=df, logger=logger, title_column_name=title_column_name) def get_sorting_func( sorting_algo: str, title_column_name: str, logger: logging.Logger, language_column_name: str ) -> Callable[[pa.Table], pa.Table]: + """Get a sorting function based on the specified algorithm. + + Args: + sorting_algo (str): The sorting algorithm to use. + title_column_name (str): The name of the column containing file + titles. + logger (logging.Logger): A logger object for logging messages. + language_column_name (str): The name of the column containing file + languages. + + Returns: + Callable[[pa.Table, str], pa.Table]: A function that takes a PyArrow Table + and a file name as input and + returns a sorted PyArrow Table. + """ if sorting_algo == SORT_SEMANTIC: sort_by = semantic_sort logger.info("semantic sort enabled") @@ -74,7 +96,26 @@ def sorter(table: pa.Table, file_name: str) -> pa.Table: return sorter -def get_dominant_language_func(language_column_name, title_column_name): +def get_dominant_language_func(language_column_name: str, title_column_name: str) -> Callable[[pa.Table, str], str]: + """ + This function takes two column names as input and returns a function + that can be applied to a pyarrow table. + The returned function determines the dominant programming language in + the pyarrow table and returns the filename with the detected language + prepended. + + Args: + language_column_name (str): Name of the column containing the + programming languages. + title_column_name (str): Name of the column containing the file + titles/paths. + + Returns: + Callable[[pa.Table, str], str]: A function that takes a table as + input and returns a new table with the filenames modified to include the + detected dominant language. + """ + def dominant_lang_per_repo(table: pa.Table, filename: str) -> str: """ This function takes a table whose rows are documents from a repo @@ -137,6 +178,28 @@ def lang_distribution(grouping_column): def get_transforming_func(sorting_func=None, superrows_func=None, filename_func=None, language_column_name="language"): + """ + This function takes three optional functions as input and returns a + function that can be applied to a pyarrow table and file name. + The returned function performs some transformation on the input table + and file name based on the provided functions. + + Args: + sorting_func (Callable[[pa.Table, str], pa.Table]): A function that sorts the + rows of a table based on a column. Defaults to None. + superrows_func (Callable[[pa.Table, str, str], pa.Table]): A + function that creates new rows in a table based on the values of other + columns. Defaults to None. + filename_func (Callable[[pa.Table, str], str]): A function that modifies the + file name. Defaults to None. + language_column_name (str): The name of the column containing the + programming languages. Defaults to "language". + + Returns: + callable: A function that takes a table and file name as input and + returns a list of transformed tables and file names. + """ + def my_transform(table, file_name): out_table = table if sorting_func: diff --git a/transforms/code/repo_level_ordering/ray/src/repo_level_order_s3_ray.py b/transforms/code/repo_level_ordering/ray/src/repo_level_order_s3_ray.py index fb42b6b81..4d65abb76 100644 --- a/transforms/code/repo_level_ordering/ray/src/repo_level_order_s3_ray.py +++ b/transforms/code/repo_level_ordering/ray/src/repo_level_order_s3_ray.py @@ -49,15 +49,14 @@ } repo_level_params = { + "repo_lvl_sorting_enabled": True, "repo_lvl_sorting_algo": "SORT_SEMANTIC", "repo_lvl_store_type": "ray", + "repo_lvl_output_by_langs": True, + "repo_lvl_combine_rows": True, } -repo_level_flags = ["repo_lvl_output_by_langs", "repo_lvl_combine_rows", "repo_lvl_sorting_enabled"] - -d = ParamsUtils.dict_to_req(d=params | repo_level_params) -sys.argv = d + [f"--{flag}" for flag in repo_level_flags] -sys.argv = ParamsUtils.dict_to_req(d=params) +sys.argv = ParamsUtils.dict_to_req(d=params | repo_level_params) # for arg in sys.argv: # print(arg) diff --git a/transforms/code/repo_level_ordering/ray/src/repo_level_order_transform.py b/transforms/code/repo_level_ordering/ray/src/repo_level_order_transform.py index b3152c44b..a43feda87 100644 --- a/transforms/code/repo_level_ordering/ray/src/repo_level_order_transform.py +++ b/transforms/code/repo_level_ordering/ray/src/repo_level_order_transform.py @@ -18,7 +18,7 @@ import pyarrow as pa from data_processing.data_access import DataAccessFactoryBase from data_processing.transform import AbstractTableTransform, TransformConfiguration -from data_processing.utils import CLIArgumentProvider, get_logger +from data_processing.utils import CLIArgumentProvider, get_logger, str2bool from data_processing_ray.runtime.ray import DefaultRayTransformRuntime, RayUtils from data_processing_ray.runtime.ray.runtime_configuration import ( RayTransformRuntimeConfiguration, @@ -27,6 +27,7 @@ create_store, create_store_params, init_store_params, + store_type_value_ray, validate_store_params, ) from ray.actor import ActorHandle @@ -108,6 +109,7 @@ def __init__(self, config: dict[str, Any]): self.grouping_column = config.get(grouping_column_key, repo_column_default_value) store_params = config.get(store_params_key) validate_store_params(store_params) + self.store_type = store_params[store_type_key] self.store = create_store(store_params) self.group_batch_size = group_batch_size @@ -126,6 +128,16 @@ def _create_batches(self, data, batch_size=1): batches.append(batch) return batches + def _normalize_file_name_for_store(self, file_name): + if self.store_type == store_type_value_ray: + # we can store full file_name consiting of full path in this store. + return file_name + else: + # since this store type uses filesystem as backend + # can't store full path in store since, + # store is currently flat filesystem. + return os.path.basename(file_name) + def transform(self, table: pa.Table, file_name: str = None) -> tuple[list[pa.Table], dict[str, Any]]: """ This step is used to do groupby with respect to `self.grouping_column` and update @@ -145,11 +157,8 @@ def transform(self, table: pa.Table, file_name: str = None) -> tuple[list[pa.Tab grp_flow = {} for group in batch: # This supports only flat folder structure, so all - # files should be in the same folder - # since store uses filesystem as backend - # can't store full path in store since, - # store is currently flat filesystem. - file_name = os.path.basename(file_name) + # files should be in the same folder. + file_name = self._normalize_file_name_for_store(file_name) grp_flow[group] = file_name self.logger.debug(f"Updating {group} to store") @@ -286,10 +295,15 @@ def _prepare_mapper_function(self): def _prepare_inputs(self): store = create_store(self.store_params) - files_location = self.input_folder + store_type = self.store_params[store_type_key] + p_input = [] for repo, files in store.items_kv(): - p_input.append((repo, [f"{files_location}/{file}" for file in files])) + if store_type == store_type_value_ray: + p_input.append((repo, [f"{file}" for file in files])) + else: + files_location = self.input_folder + p_input.append((repo, [f"{files_location}/{file}" for file in files])) return p_input def _group_and_sort(self): @@ -361,8 +375,8 @@ def add_input_params(self, parser: ArgumentParser) -> None: # See below for remove_from_metadata addition so that it is not reported. parser.add_argument( f"--{cli_prefix}{stage_one_only_key}", - action="store_true", - help="If this flag is set, transform only builds the repo grouping and doesn't write output", + type=lambda x: bool(str2bool(x)), + help="If this flag is True, transform only builds the repo grouping and doesn't write output", ) parser.add_argument( f"--{cli_prefix}{grouping_column_key}", @@ -402,7 +416,7 @@ def add_input_params(self, parser: ArgumentParser) -> None: parser.add_argument( f"--{cli_prefix}{sorting_enable_key}", default=sort_enable_default, - type=bool, + type=lambda x: bool(str2bool(x)), help=f"Enables sorting of output by algorithm specified using {cli_prefix}{sorting_algo_key}. Defaults to SORT_BY_PATH if no algorithm is specified.", ) parser.add_argument( @@ -413,13 +427,13 @@ def add_input_params(self, parser: ArgumentParser) -> None: ) parser.add_argument( f"--{cli_prefix}{output_by_langs_key}", - type=bool, + type=lambda x: bool(str2bool(x)), default=output_by_lang_default, help="If specified, output is grouped into programming language folders.", ) parser.add_argument( f"--{cli_prefix}{output_superrows_key}", - type=bool, + type=lambda x: bool(str2bool(x)), default=superrows_default, help="If specified, output rows per repo are combined to form a single repo", ) From 5605b3b29183e80397284d0b9295c332430c1575 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Fri, 18 Oct 2024 09:36:56 -0700 Subject: [PATCH 11/19] Update README.md I wanted "an unified" to be changed to "a unified," not the other way around! --- transforms/code/code_profiler/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transforms/code/code_profiler/README.md b/transforms/code/code_profiler/README.md index cde2baa6c..691f6ff4b 100644 --- a/transforms/code/code_profiler/README.md +++ b/transforms/code/code_profiler/README.md @@ -1,6 +1,6 @@ # Code Profiler Transform -This module extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-language data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form. Our framework abstracts language-specific concepts by transforming them into an unified, language-agnostic representation called universal base syntactic representation (UBSR), referred to as a concept, which is consistently encoded within the proposed schema structure. The current version supports the base syntactic concept for importing/including package/libraries, comments, functions. +This module extracts the base syntactic concepts from the multi-language source codes and represent these concepts in a unified langauge-agnostic representation that can be further used for multi-language data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form. Our framework abstracts language-specific concepts by transforming them into a unified, language-agnostic representation called universal base syntactic representation (UBSR), referred to as a concept, which is consistently encoded within the proposed schema structure. The current version supports the base syntactic concept for importing/including package/libraries, comments, functions. Table 1 outlines the fields of the UBSR, which maps AST nodes to a structured schema. This schema captures syntactic nodes (based on AST node types) and the relationships between those nodes (derived from AST edges). The UBSR framework currently supports 21 languages, grouped according to their syntactic paradigms. From 469a90eba1c233ef20c427d7c92a36cc41b7db50 Mon Sep 17 00:00:00 2001 From: Sujee Maniyam Date: Fri, 18 Oct 2024 13:36:39 -0700 Subject: [PATCH 12/19] intro examples using DPK release 0.2.1 Signed-off-by: Sujee Maniyam --- examples/notebooks/intro/README.md | 18 +- .../notebooks/intro/dpk_intro_1_python.ipynb | 7308 ++++++++--------- .../notebooks/intro/dpk_intro_1_ray.ipynb | 1324 ++- 3 files changed, 4550 insertions(+), 4100 deletions(-) diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md index 07b63f513..14d56e8e9 100644 --- a/examples/notebooks/intro/README.md +++ b/examples/notebooks/intro/README.md @@ -7,7 +7,23 @@ This is an example featuring some of the features of data prep kit. The code can be run on either 1. Google colab: very easy to run; no local setup needed. -2. On your local Python environment. Please follow the [instructions](../../../README.md#-getting-started) to setup +2. On your local Python environment. Here is a quick guide. You can find instructions for latest version [here](../../../README.md#-getting-started) + +```bash +conda create -n data-prep-kit -y python=3.11 +conda activate data-prep-kit + +# install the following in 'data-prep-kit' environment +pip3 install data-prep-toolkit-transforms==0.2.1 data-prep-toolkit-transforms-ray==0.2.1 +pip3 install jupyterlab ipykernel ipywidgets + +## install custom kernel +## Important: Use this kernel when running example notebooks! +python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit" + +# start jupyter and run the notebooks with this jupyter +jupyter lab +``` ## Intro diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb index a6b2efff5..91bb79060 100644 --- a/examples/notebooks/intro/dpk_intro_1_python.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_python.ipynb @@ -1,3667 +1,3667 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", - "metadata": { - "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" - }, - "source": [ - "# Data Prep Kit Demo 1 - Python version\n", - "\n", - "This notebook will introduce DPK and showcase some of it's capabilities.\n", - "\n", - "Here is the workflow\n", - "\n", - "![](https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" - ] - }, - { - "cell_type": "markdown", - "id": "b15976e3", - "metadata": { - "id": "b15976e3" - }, - "source": [ - "## How to run this notebook\n", - "\n", - "Two options:\n", - "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", - "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", - "\n", - "The notebook will work as in both environments" - ] - }, - { - "cell_type": "markdown", - "id": "eb8b0d5c", - "metadata": { - "id": "eb8b0d5c" - }, - "source": [ - "## Step-1: Inspect the Data\n", - "\n", - "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/intro-example1/examples/notebooks/intro/input/solar-system)\n", - "\n", - "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf)\n" - ] - }, - { - "cell_type": "markdown", - "id": "39a0ab6e", - "metadata": { - "id": "39a0ab6e" - }, - "source": [ - "## Step-2: Figure out Runtime Environment\n", - "\n", - "### 2.1 - Determine runtime\n", - "\n", - "Determine if we are running on Google colab or local python environment" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "1fe354b7", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "1fe354b7", - "outputId": "5c153f72-08ed-4d6e-ccc7-dae851e7fd8b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NOT in Colab\n" - ] - } - ], - "source": [ - "import os\n", - "\n", - "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", - " print(\"Running in Colab\")\n", - " RUNNING_IN_COLAB = True\n", - "else:\n", - " print(\"NOT in Colab\")\n", - " RUNNING_IN_COLAB = False" - ] - }, - { - "cell_type": "markdown", - "id": "8e7c104b", - "metadata": { - "id": "8e7c104b" - }, - "source": [ - "### 2.2 -Download Data if running on Google Colab" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "3309799e", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "3309799e", - "outputId": "99530315-6dd5-405d-dbde-61e2332e441b" - }, - "outputs": [], - "source": [ - "if RUNNING_IN_COLAB:\n", - " !mkdir -p 'input/solar-system'\n", - " !wget -O 'input/solar-system/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf'\n", - " !wget -O 'input/solar-system/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf'\n", - " !wget -O 'my_utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/my_utils.py'" - ] - }, - { - "cell_type": "markdown", - "id": "a5dc2b68", - "metadata": { - "id": "a5dc2b68" - }, - "source": [ - "### 2.3 - Install dependencies if running on Google Colab" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "1fcec577", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "1fcec577", - "outputId": "0f77fc39-ffeb-48da-ce6f-1750d8d3ad62" - }, - "outputs": [], - "source": [ - "if RUNNING_IN_COLAB:\n", - " ! pip install --default-timeout=100 \\\n", - " data-prep-toolkit[ray]==0.2.2.dev1 \\\n", - " data-prep-toolkit-transforms[ray,all]==0.2.2.dev1 \\\n", - " deepsearch-toolkit\n" - ] - }, - { - "cell_type": "markdown", - "id": "243322b8", - "metadata": { - "id": "243322b8" - }, - "source": [ - "### 2.4 - Restart Runtime\n", - "\n", - "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", - "\n", - "You do this by going to **`Runtime --> Restart Session`**\n", - "\n", - "Then you can continue to the next step (no need to re-run the notebook)" - ] - }, - { - "cell_type": "markdown", - "id": "e8b10be1", - "metadata": { - "id": "e8b10be1" - }, - "source": [ - "## Step-2: Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "356c66f7", - "metadata": { - "id": "356c66f7" - }, - "source": [ - "### 2.1 - Basic Config" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "e4YMZrBuFycl", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "e4YMZrBuFycl", - "outputId": "d7ee9449-4f21-4c9a-fa54-14b7f28d764a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NOT in Colab\n" - ] - } - ], - "source": [ - "import os\n", - "\n", - "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", - " print(\"Running in Colab\")\n", - " RUNNING_IN_COLAB = True\n", - "else:\n", - " print(\"NOT in Colab\")\n", - " RUNNING_IN_COLAB = False" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "33345487", - "metadata": { - "id": "33345487" - }, - "outputs": [], - "source": [ - "import os\n", - "\n", - "## Configuration\n", - "class MyConfig:\n", - " pass\n", - "\n", - "MY_CONFIG = MyConfig ()\n", - "\n", - "MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'\n", - "\n", - "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", - "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", - "\n", - "## Embedding model\n", - "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "b15e6827", - "metadata": { - "id": "b15e6827" - }, - "outputs": [], - "source": [ - "## Add parent dir to path\n", - "import os,sys\n", - "\n", - "this_dir = os.path.abspath('')\n", - "parent_dir = os.path.dirname(this_dir)\n", - "sys.path.append (os.path.abspath (parent_dir))" - ] - }, - { - "cell_type": "markdown", - "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", - "metadata": { - "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" - }, - "source": [ - "### 2.2 - Setup input/outpur directories" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "outputId": "4d5511fb-1c6f-47df-e5ea-2c1b354d262f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "āœ… Cleared output directory\n" - ] - } - ], - "source": [ - "import os, sys\n", - "import shutil\n", - "\n", - "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", - " raise Exception (f\"āŒ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", - "\n", - "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", - "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", - "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", - "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", - "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_embeddings_out')\n", - "\n", - "## clear output folder\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", - "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", - "\n", - "print (\"āœ… Cleared output directory\")" - ] - }, - { - "cell_type": "markdown", - "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", - "metadata": { - "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" - }, - "source": [ - "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", - "\n", - "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", - "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", - "metadata": { - "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a" - }, - "source": [ - "### 3.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "outputId": "c50847d4-f2c7-4559-f5f7-d6a3d025027d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "šŸƒšŸ¼ STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'\n" - ] - } - ], - "source": [ - "STAGE = 1\n", - "\n", - "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", - "output_folder = output_parquet_dir\n", - "\n", - "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", - "metadata": { - "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" - }, - "source": [ - "### 3.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 657, - "referenced_widgets": [ - "97b603697cfa4b4ea4e6735b6768ca35", - "e87e8d3262c54cfaaa8768505edacda3", - "b78aa40816e44f7fbebcb24ca68818b3", - "7053c9606a414e978636a7e241909504", - "da0787b239764847a731083997780a85", - "553f3c16839a49d79591d0fc4862bed6", - "c0eb5bc8f6ee427ca42204b3c56f9a4e", - "9d184ed175f0403fb03c2e13dfd04e0a", - "724778729161445c98b187031ae4f67c", - "1cb3bbf7d724411cbe9831543a4aecc0", - "06f9b33494984e4885d5aad813d1d2bc" - ] - }, - "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "outputId": "01d207fb-983d-40b2-e5f6-e38e3789110a" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "22:43:02 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "22:43:02 INFO - pipeline id pipeline_id\n", - "22:43:02 INFO - code location None\n", - "22:43:02 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", - "22:43:02 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:43:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "22:43:02 INFO - orchestrator pdf2parquet started at 2024-10-16 22:43:02\n", - "22:43:02 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", - "22:43:02 INFO - Initializing models\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "e92bbc86f5e34ee4ad7dd853a5136c01", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Fetching 10 files: 0%| | 0/10 [00:00\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0mars.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...101107bc0c9a-f863-48e3-9aed-bd289af040bcpdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf
1earth.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011e141f7a4-3e45-4f04-88d3-60e0a81b195bpdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdf
\n", - "" - ], - "text/plain": [ - " filename contents num_pages \\\n", - "0 mars.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", - "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", - "\n", - " num_tables num_doc_elements document_id ext \\\n", - "0 0 11 07bc0c9a-f863-48e3-9aed-bd289af040bc pdf \n", - "1 0 11 e141f7a4-3e45-4f04-88d3-60e0a81b195b pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \n", - "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "1 2024-10-16T22:43:07.205350 0.921915 earth.pdf " - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(5)\n", - "\n", - "## To display certain columns\n", - "#parquet_df[['column1', 'column2', 'column3']].head(5)" - ] - }, - { - "cell_type": "markdown", - "id": "e5058a21", - "metadata": { - "id": "e5058a21" - }, - "source": [ - "\n", - "### 3.4 - Understand the output\n", - "\n", - "Here are some interesting attributes to note:\n", - "\n", - "- **filename** : original filename\n", - "- **contents** : text\n", - "- **document_id**: unique id (UUID) assignd to this document\n", - "- **hash** : hash of document\n", - "- **pdf_convert_time** : time to convert this pdf in seconds\n", - "\n", - "Let's inspect the **contents** column. See how the text is being divided up!" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "f870e624", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "f870e624", - "outputId": "0b4c054f-3a8a-4db3-f32f-17bd1466b102" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'_name': '',\n", - " 'description': {'logs': []},\n", - " 'equations': [],\n", - " 'figures': [],\n", - " 'file-info': {'#-pages': 1,\n", - " 'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',\n", - " 'filename': 'mars.pdf',\n", - " 'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',\n", - " 'model': 'default',\n", - " 'page': 1}]},\n", - " 'footnotes': [],\n", - " 'main-text': [{'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.35137939,\n", - " 654.45184326,\n", - " 169.88169861,\n", - " 667.98492432],\n", - " 'page': 1,\n", - " 'span': [0, 4]}],\n", - " 'text': 'Mars',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.09541321,\n", - " 630.68127441,\n", - " 210.66503906,\n", - " 642.34405518],\n", - " 'page': 1,\n", - " 'span': [0, 12]}],\n", - " 'text': 'Solar System',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.84518433,\n", - " 588.96014404,\n", - " 479.40917969,\n", - " 623.02520752],\n", - " 'page': 1,\n", - " 'span': [0, 205]}],\n", - " 'text': 'Our solar system is a vast and fascinating expanse, '\n", - " 'comprising eight planets, five dwarf planets, '\n", - " 'numerous moons, asteroids, comets, and other '\n", - " 'celestial bodies. At its center lies the star we call '\n", - " 'the Sun.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.18510437,\n", - " 570.83258057,\n", - " 374.99838257,\n", - " 581.07043457],\n", - " 'page': 1,\n", - " 'span': [0, 54]}],\n", - " 'text': 'For more details about the Solar system see Chapter '\n", - " '1.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.22866821,\n", - " 542.98168945,\n", - " 163.86282349,\n", - " 554.45288086],\n", - " 'page': 1,\n", - " 'span': [0, 4]}],\n", - " 'text': 'Mars',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.87440491,\n", - " 500.84011841,\n", - " 477.48345947,\n", - " 534.55810547],\n", - " 'page': 1,\n", - " 'span': [0, 196]}],\n", - " 'text': 'Mars, the fourth planet from the Sun, is a cold, '\n", - " 'desert world with a thin atmosphere composed '\n", - " 'primarily of carbon dioxide. Its reddish hue comes '\n", - " 'from iron oxide, or rust, prevalent on its surface.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.2026062,\n", - " 482.90710449,\n", - " 237.04431152,\n", - " 493.07443237],\n", - " 'page': 1,\n", - " 'span': [0, 23]}],\n", - " 'text': 'Basic facts about Mars:',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 453.019104,\n", - " 477.48171997,\n", - " 474.9703064],\n", - " 'page': 1,\n", - " 'span': [0, 78]}],\n", - " 'text': 'Ā· Distance from the Sun: Average of 228 million '\n", - " 'kilometers (142 million miles)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 440.79351807,\n", - " 431.73287964,\n", - " 451.2142334],\n", - " 'page': 1,\n", - " 'span': [0, 64]}],\n", - " 'text': 'Ā· Rotation Period: 24.6 hours (one Martian day - '\n", - " 'called a \"sol\")',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 429.10913086,\n", - " 365.9559021,\n", - " 438.83737183],\n", - " 'page': 1,\n", - " 'span': [0, 44]}],\n", - " 'text': 'Ā· Moons: Two small moons, Phobos and Deimos.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Page-footer',\n", - " 'prov': [{'bbox': [303.13299561,\n", - " 87.20314026,\n", - " 308.11428833,\n", - " 96.51646423],\n", - " 'page': 1,\n", - " 'span': [0, 1]}],\n", - " 'text': '1',\n", - " 'type': 'page-footer'}],\n", - " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", - " 'page-footers': [],\n", - " 'page-headers': [],\n", - " 'tables': [],\n", - " 'type': 'pdf-document'}\n" - ] - } - ], - "source": [ - "import pprint\n", - "import json\n", - "\n", - "pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))\n", - "# json.loads(output_df.iloc[0, ]['contents'])" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "e1a10c2d", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "e1a10c2d", - "outputId": "c1d992c2-faa8-40cd-c375-857970201daa" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'_name': '',\n", - " 'description': {'logs': []},\n", - " 'equations': [],\n", - " 'figures': [],\n", - " 'file-info': {'#-pages': 1,\n", - " 'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',\n", - " 'filename': 'earth.pdf',\n", - " 'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',\n", - " 'model': 'default',\n", - " 'page': 1}]},\n", - " 'footnotes': [],\n", - " 'main-text': [{'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.30961609,\n", - " 654.45184326,\n", - " 174.04208374,\n", - " 667.93347168],\n", - " 'page': 1,\n", - " 'span': [0, 5]}],\n", - " 'text': 'Earth',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.12528992,\n", - " 630.69073486,\n", - " 210.66503906,\n", - " 642.27935791],\n", - " 'page': 1,\n", - " 'span': [0, 12]}],\n", - " 'text': 'Solar System',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.87112427,\n", - " 588.96014404,\n", - " 479.40917969,\n", - " 623.04595947],\n", - " 'page': 1,\n", - " 'span': [0, 205]}],\n", - " 'text': 'Our solar system is a vast and fascinating expanse, '\n", - " 'comprising eight planets, five dwarf planets, '\n", - " 'numerous moons, asteroids, comets, and other '\n", - " 'celestial bodies. At its center lies the star we call '\n", - " 'the Sun.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.20942688,\n", - " 570.81555176,\n", - " 375.57919312,\n", - " 581.08459473],\n", - " 'page': 1,\n", - " 'span': [0, 54]}],\n", - " 'text': 'For more details about our Solar system see Chapter '\n", - " '1.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.15542603,\n", - " 542.98168945,\n", - " 167.32983398,\n", - " 554.36669922],\n", - " 'page': 1,\n", - " 'span': [0, 5]}],\n", - " 'text': 'Earth',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.91053772,\n", - " 512.46295166,\n", - " 477.84887695,\n", - " 534.48431396],\n", - " 'page': 1,\n", - " 'span': [0, 107]}],\n", - " 'text': \"Earth is the third planet from the Sun. It's our home \"\n", - " 'planet. Earth is the only place we know of with life.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.30151367,\n", - " 494.86206055,\n", - " 240.17156982,\n", - " 505.07229614],\n", - " 'page': 1,\n", - " 'span': [0, 24]}],\n", - " 'text': 'Basic facts about Earth:',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 464.97409058,\n", - " 477.47979736,\n", - " 487.02810669],\n", - " 'page': 1,\n", - " 'span': [0, 79]}],\n", - " 'text': 'Ā· Distance from the Sun: Average of 149.6 million '\n", - " 'kilometers (93 million miles)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 452.86901855,\n", - " 317.90722656,\n", - " 463.24041748],\n", - " 'page': 1,\n", - " 'span': [0, 37]}],\n", - " 'text': 'Ā· Rotation Period: 24 hours (one day)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 440.71496582,\n", - " 396.66357422,\n", - " 451.19915771],\n", - " 'page': 1,\n", - " 'span': [0, 52]}],\n", - " 'text': 'Ā· Moons: One moon, called Luna or simply \"the Moon\".',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Page-footer',\n", - " 'prov': [{'bbox': [303.13299561,\n", - " 87.20314026,\n", - " 308.11428833,\n", - " 96.53633118],\n", - " 'page': 1,\n", - " 'span': [0, 1]}],\n", - " 'text': '1',\n", - " 'type': 'page-footer'}],\n", - " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", - " 'page-footers': [],\n", - " 'page-headers': [],\n", - " 'tables': [],\n", - " 'type': 'pdf-document'}\n" - ] - } - ], - "source": [ - "pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))" - ] - }, - { - "cell_type": "markdown", - "id": "72274586", - "metadata": { - "id": "72274586" - }, - "source": [ - "## Step-4: Doc chunks\n", - "\n", - "In the previous step, we have extracted text from oru PDFs. But we have the content of entire file as 'one row' in our parquet output.\n", - "\n", - "In this step, we are going to split the documents in chunks, according to their layout segmentation.\n", - "\n", - "This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`\n", - "to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.\n", - "It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: \"application/json\"`,\n", - "which provides the required JSON structure." - ] - }, - { - "cell_type": "markdown", - "id": "96198fa6", - "metadata": { - "id": "96198fa6" - }, - "source": [ - "### 4.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "305f00a3", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "305f00a3", - "outputId": "dd511f34-bab3-4dde-d938-493debb02e5e" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "šŸƒšŸ¼ STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" - ] - } - ], - "source": [ - "STAGE = 2\n", - "\n", - "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_chunk_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "369f2cd1", - "metadata": { - "id": "369f2cd1" - }, - "source": [ - "### 4.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "5b7b18d5", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "5b7b18d5", - "outputId": "e0b87171-9d66-473f-e66a-e4b6ae3c3f66" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "22:43:09 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30}\n", - "22:43:09 INFO - pipeline id pipeline_id\n", - "22:43:09 INFO - code location None\n", - "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:43:09 INFO - orchestrator doc_chunk started at 2024-10-16 22:43:09\n", - "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", - "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "22:43:09 INFO - done flushing in 0.0 sec\n", - "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "āœ… Stage:2 completed successfully\n", - "CPU times: user 1.07 s, sys: 180 ms, total: 1.25 s\n", - "Wall time: 1.55 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from doc_chunk_transform_python import DocChunkPythonTransformConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # doc_chunk arguments\n", - " # ...\n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(DocChunkPythonTransformConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"āŒ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "213afdf6", - "metadata": { - "id": "213afdf6" - }, - "source": [ - "### 4.3 - Inspect Generated output\n", - "\n", - "We would see documents are split into many chunks" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "d8138d43", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 897 - }, - "id": "d8138d43", - "outputId": "fd01e0cb-899e-4c73-d50e-5f4e6f5ff802" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Files processed : 2\n", - "Chunks created : 8\n", - "Input data dimensions (rows x columns)= (2, 12)\n", - "Output data dimensions (rows x columns)= (8, 16)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 mars.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "7 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "3 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "7 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "3 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "7 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", - "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "7 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... " - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (f\"Files processed : {input_df.shape[0]:,}\")\n", - "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "9e9ca75c", - "metadata": { - "id": "9e9ca75c" - }, - "source": [ - "### 4.4 - Understanding the Output\n", - "\n", - "Here we see 2 PDF files are split into 6 chunks. Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points\n", - "\n", - "See how **document_id** is carried throughout. This helps us identify original documents.\n", - "\n", - "Also note **contents** is now plain text (not JSON as before)" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "3090c950", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 300 - }, - "id": "3090c950", - "outputId": "0f4b6771-8d38-4a27-c756-21f916b23a4f" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
4earth.pdfSolar System\\nOur solar system is a vast and f...
5earth.pdfSolar System\\nFor more details about our Solar...
6earth.pdfEarth\\nEarth is the third planet from the Sun....
7earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", - "1 mars.pdf Solar System\\nFor more details about the Solar...\n", - "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "3 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", - "4 earth.pdf Solar System\\nOur solar system is a vast and f...\n", - "5 earth.pdf Solar System\\nFor more details about our Solar...\n", - "6 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "7 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "d5f151ae", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "d5f151ae", - "outputId": "a4c491b2-53db-4d71-da24-4479de8d1d65" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about the Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 3------\n", - "Basic facts about Mars:\n", - "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "Ā· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 3------\n", - "Earth\n", - "Basic facts about Earth:\n", - "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "Ā· Rotation Period: 24 hours (one day)\n", - "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } - ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "7ad1c60d", - "metadata": { - "id": "7ad1c60d" - }, - "source": [ - "## Step-5: DOC ID generation of Chunks\n", - "\n", - "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", - "\n", - " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", - " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", - "\n", - "**This is a pre-requisite for fuzzy dedup** in the pipeline." - ] - }, - { - "cell_type": "markdown", - "id": "1afaa0fd", - "metadata": { - "id": "1afaa0fd" - }, - "source": [ - "### 5.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "6ffd6f54", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "6ffd6f54", - "outputId": "1784c80d-6309-4913-9f55-c018b978968f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "šŸƒšŸ¼ STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" - ] - } - ], - "source": [ - "\n", - "# Input for this stage is the output of exact dedeup component\n", - "# output of this component makes it possible for fdedup component to run on data.\n", - "\n", - "STAGE = 3\n", - "\n", - "input_folder = output_chunk_dir\n", - "output_folder = output_docid_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "f78a51b7", - "metadata": { - "id": "f78a51b7" - }, - "source": [ - "### 5.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "5fc77557", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "5fc77557", - "outputId": "db2b8670-543e-4073-9c7d-3f9ef5f4317e" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "22:43:09 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "22:43:09 INFO - pipeline id pipeline_id\n", - "22:43:09 INFO - code location None\n", - "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:43:09 INFO - orchestrator doc_id started at 2024-10-16 22:43:09\n", - "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", - "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "22:43:09 INFO - done flushing in 0.0 sec\n", - "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "āœ… Stage:3 completed successfully\n", - "CPU times: user 10.1 ms, sys: 3 ms, total: 13.1 ms\n", - "Wall time: 11.3 ms\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " # doc id configuration\n", - " \"doc_id_doc_column\": \"contents\",\n", - " \"doc_id_hash_column\": \"chunk_hash\",\n", - " \"doc_id_int_column\": \"chunk_id\",\n", - "}\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# launch\n", - "\n", - "launcher = PythonTransformLauncher(DocIDPythonTransformRuntimeConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"āŒ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "a9a8c1fa", - "metadata": { - "id": "a9a8c1fa" - }, - "source": [ - "### 5.3 - Inspect Generated output\n", - "\n", - "You will notice we have two extra columns\n", - "\n", - "- **hash_column**\n", - "- **int_id_column**\n", - "\n", - "But still the same number or rows as before" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "da9adede", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 860 - }, - "id": "da9adede", - "outputId": "036db4ca-12f6-4b3e-9d7f-fa70e494870d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (8, 16)\n", - "Output data dimensions (rows x columns)= (8, 18)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 mars.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "7 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "3 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "7 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "3 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "7 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", - "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "7 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 " - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53", - "metadata": { - "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53" - }, - "source": [ - "## Step-6: Exact Dedup\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", - "metadata": { - "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe" - }, - "source": [ - "### 6.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "4c7a1b94", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "4c7a1b94", - "outputId": "2f6f05bc-f6fd-4d66-ea01-ed89cd5b80f3" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "šŸƒšŸ¼ STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" - ] - } - ], - "source": [ - "STAGE = 4\n", - "\n", - "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_exact_dedupe_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", - "metadata": { - "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e" - }, - "source": [ - "### 6.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "outputId": "74dc0b75-58b5-4c97-9965-91315e8a98a5" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "22:43:09 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", - "22:43:09 INFO - pipeline id pipeline_id\n", - "22:43:09 INFO - code location None\n", - "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:43:09 INFO - orchestrator ededup started at 2024-10-16 22:43:09\n", - "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", - "22:43:09 INFO - Starting from the beginning\n", - "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "22:43:09 INFO - done flushing in 0.0 sec\n", - "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "āœ… Stage:4 completed successfully\n", - "CPU times: user 12.6 ms, sys: 5.26 ms, total: 17.9 ms\n", - "Wall time: 14.6 ms\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # ededup parameters\n", - " \"ededup_doc_column\": \"contents\",\n", - " \"ededup_doc_id_column\": \"chunk_hash\",\n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"āŒ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "eaf1c3c3", - "metadata": { - "id": "eaf1c3c3" - }, - "source": [ - "### 6.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "d824ebf6", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 815 - }, - "id": "d824ebf6", - "outputId": "68f55770-c750-4607-a205-ba183603019d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (8, 18)\n", - "Output data dimensions (rows x columns)= (7, 19)\n", - "Input chunks before exact dedupe : 8\n", - "Output chunks after exact dedupe : 7\n", - "Duplicate chunks removed : 1\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremoved
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[]
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[]
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[]
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[]
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[]
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[]
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 earth.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "3 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "3 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "2 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", - "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "6 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", - "\n", - " removed \n", - "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", - "1 [] \n", - "2 [] \n", - "3 [] \n", - "4 [] \n", - "5 [] \n", - "6 [] " - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } + "cells": [ + { + "cell_type": "markdown", + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", + "metadata": { + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" + }, + "source": [ + "# Data Prep Kit Demo 1 - Python version\n", + "\n", + "This notebook will introduce DPK and showcase some of it's capabilities.\n", + "\n", + "Here is the workflow\n", + "\n", + "![](https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "b15976e3", + "metadata": { + "id": "b15976e3" + }, + "source": [ + "## How to run this notebook\n", + "\n", + "Two options:\n", + "\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", + "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", + "\n", + "The notebook will work as in both environments" + ] + }, + { + "cell_type": "markdown", + "id": "eb8b0d5c", + "metadata": { + "id": "eb8b0d5c" + }, + "source": [ + "## Step-1: Inspect the Data\n", + "\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/intro-example1/examples/notebooks/intro/input/solar-system)\n", + "\n", + "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf)\n" + ] + }, + { + "cell_type": "markdown", + "id": "39a0ab6e", + "metadata": { + "id": "39a0ab6e" + }, + "source": [ + "## Step-2: Figure out Runtime Environment\n", + "\n", + "### 2.1 - Determine runtime\n", + "\n", + "Determine if we are running on Google colab or local python environment" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "1fe354b7", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1fe354b7", + "outputId": "5c153f72-08ed-4d6e-ccc7-dae851e7fd8b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "markdown", + "id": "8e7c104b", + "metadata": { + "id": "8e7c104b" + }, + "source": [ + "### 2.2 -Download Data if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "3309799e", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3309799e", + "outputId": "99530315-6dd5-405d-dbde-61e2332e441b" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " !mkdir -p 'input/solar-system'\n", + " !wget -O 'input/solar-system/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf'\n", + " !wget -O 'input/solar-system/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf'\n", + " !wget -O 'my_utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/my_utils.py'" + ] + }, + { + "cell_type": "markdown", + "id": "a5dc2b68", + "metadata": { + "id": "a5dc2b68" + }, + "source": [ + "### 2.3 - Install dependencies if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1fcec577", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "1fcec577", + "outputId": "0f77fc39-ffeb-48da-ce6f-1750d8d3ad62" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " ! pip install --default-timeout=100 \\\n", + " data-prep-toolkit-transforms==0.2.1 \\\n", + " data-prep-toolkit-transforms-ray==0.2.1 \\\n", + " deepsearch-toolkit\n" + ] + }, + { + "cell_type": "markdown", + "id": "243322b8", + "metadata": { + "id": "243322b8" + }, + "source": [ + "### 2.4 - Restart Runtime\n", + "\n", + "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", + "\n", + "You do this by going to **`Runtime --> Restart Session`**\n", + "\n", + "Then you can continue to the next step (no need to re-run the notebook)" + ] + }, + { + "cell_type": "markdown", + "id": "e8b10be1", + "metadata": { + "id": "e8b10be1" + }, + "source": [ + "## Step-2: Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "356c66f7", + "metadata": { + "id": "356c66f7" + }, + "source": [ + "### 2.1 - Basic Config" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "e4YMZrBuFycl", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e4YMZrBuFycl", + "outputId": "d7ee9449-4f21-4c9a-fa54-14b7f28d764a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "33345487", + "metadata": { + "id": "33345487" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "## Configuration\n", + "class MyConfig:\n", + " pass\n", + "\n", + "MY_CONFIG = MyConfig ()\n", + "\n", + "MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'\n", + "\n", + "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", + "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", + "\n", + "## Embedding model\n", + "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "b15e6827", + "metadata": { + "id": "b15e6827" + }, + "outputs": [], + "source": [ + "## Add parent dir to path\n", + "import os,sys\n", + "\n", + "this_dir = os.path.abspath('')\n", + "parent_dir = os.path.dirname(this_dir)\n", + "sys.path.append (os.path.abspath (parent_dir))" + ] + }, + { + "cell_type": "markdown", + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", + "metadata": { + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" + }, + "source": [ + "### 2.2 - Setup input/outpur directories" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "outputId": "4d5511fb-1c6f-47df-e5ea-2c1b354d262f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Cleared output directory\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "\n", + "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", + " raise Exception (f\"āŒ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", + "\n", + "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", + "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", + "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", + "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", + "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_embeddings_out')\n", + "\n", + "## clear output folder\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", + "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", + "\n", + "print (\"āœ… Cleared output directory\")" + ] + }, + { + "cell_type": "markdown", + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", + "metadata": { + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" + }, + "source": [ + "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", + "\n", + "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", + "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", + "metadata": { + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a" + }, + "source": [ + "### 3.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "outputId": "c50847d4-f2c7-4559-f5f7-d6a3d025027d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'\n" + ] + } + ], + "source": [ + "STAGE = 1\n", + "\n", + "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", + "output_folder = output_parquet_dir\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", + "metadata": { + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" + }, + "source": [ + "### 3.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 657, + "referenced_widgets": [ + "97b603697cfa4b4ea4e6735b6768ca35", + "e87e8d3262c54cfaaa8768505edacda3", + "b78aa40816e44f7fbebcb24ca68818b3", + "7053c9606a414e978636a7e241909504", + "da0787b239764847a731083997780a85", + "553f3c16839a49d79591d0fc4862bed6", + "c0eb5bc8f6ee427ca42204b3c56f9a4e", + "9d184ed175f0403fb03c2e13dfd04e0a", + "724778729161445c98b187031ae4f67c", + "1cb3bbf7d724411cbe9831543a4aecc0", + "06f9b33494984e4885d5aad813d1d2bc" + ] + }, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "outputId": "01d207fb-983d-40b2-e5f6-e38e3789110a" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:34:39 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "13:34:39 INFO - pipeline id pipeline_id\n", + "13:34:39 INFO - code location None\n", + "13:34:39 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", + "13:34:39 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:34:39 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "13:34:39 INFO - orchestrator pdf2parquet started at 2024-10-18 13:34:39\n", + "13:34:39 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "13:34:39 INFO - Initializing models\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "750f3b6951094b2eb68490c7f5f98148", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Fetching 10 files: 0%| | 0/10 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0mars.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...10116e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf
1earth.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011efbdbcb9-f0af-42f0-b191-2f14ce3ddc7cpdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdf
\n", + "" ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", - "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", - "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "82cc9bb0", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 269 - }, - "id": "82cc9bb0", - "outputId": "46d9e91d-c470-4e3e-e5c8-508c534dbceb" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfSolar System\\nFor more details about the Solar...
1mars.pdfMars\\nMars, the fourth planet from the Sun, is...
2mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
3earth.pdfSolar System\\nOur solar system is a vast and f...
4earth.pdfSolar System\\nFor more details about our Solar...
5earth.pdfEarth\\nEarth is the third planet from the Sun....
6earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Solar System\\nFor more details about the Solar...\n", - "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "2 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", - "3 earth.pdf Solar System\\nOur solar system is a vast and f...\n", - "4 earth.pdf Solar System\\nFor more details about our Solar...\n", - "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "6 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } + "text/plain": [ + " filename contents num_pages \\\n", + "0 mars.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "\n", + " num_tables num_doc_elements document_id ext \\\n", + "0 0 11 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 pdf \n", + "1 0 11 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "1 2024-10-18T13:34:43.410297 0.794765 earth.pdf " + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(5)\n", + "\n", + "## To display certain columns\n", + "#parquet_df[['column1', 'column2', 'column3']].head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "e5058a21", + "metadata": { + "id": "e5058a21" + }, + "source": [ + "\n", + "### 3.4 - Understand the output\n", + "\n", + "Here are some interesting attributes to note:\n", + "\n", + "- **filename** : original filename\n", + "- **contents** : text\n", + "- **document_id**: unique id (UUID) assignd to this document\n", + "- **hash** : hash of document\n", + "- **pdf_convert_time** : time to convert this pdf in seconds\n", + "\n", + "Let's inspect the **contents** column. See how the text is being divided up!" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "f870e624", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f870e624", + "outputId": "0b4c054f-3a8a-4db3-f32f-17bd1466b102" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',\n", + " 'filename': 'mars.pdf',\n", + " 'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.35137939,\n", + " 654.45184326,\n", + " 169.88169861,\n", + " 667.98492432],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.09541321,\n", + " 630.68127441,\n", + " 210.66503906,\n", + " 642.34405518],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.84518433,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.02520752],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.18510437,\n", + " 570.83258057,\n", + " 374.99838257,\n", + " 581.07043457],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about the Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.22866821,\n", + " 542.98168945,\n", + " 163.86282349,\n", + " 554.45288086],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87440491,\n", + " 500.84011841,\n", + " 477.48345947,\n", + " 534.55810547],\n", + " 'page': 1,\n", + " 'span': [0, 196]}],\n", + " 'text': 'Mars, the fourth planet from the Sun, is a cold, '\n", + " 'desert world with a thin atmosphere composed '\n", + " 'primarily of carbon dioxide. Its reddish hue comes '\n", + " 'from iron oxide, or rust, prevalent on its surface.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.2026062,\n", + " 482.90710449,\n", + " 237.04431152,\n", + " 493.07443237],\n", + " 'page': 1,\n", + " 'span': [0, 23]}],\n", + " 'text': 'Basic facts about Mars:',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 453.019104,\n", + " 477.48171997,\n", + " 474.9703064],\n", + " 'page': 1,\n", + " 'span': [0, 78]}],\n", + " 'text': 'Ā· Distance from the Sun: Average of 228 million '\n", + " 'kilometers (142 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.79351807,\n", + " 431.73287964,\n", + " 451.2142334],\n", + " 'page': 1,\n", + " 'span': [0, 64]}],\n", + " 'text': 'Ā· Rotation Period: 24.6 hours (one Martian day - '\n", + " 'called a \"sol\")',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 429.10913086,\n", + " 365.9559021,\n", + " 438.83737183],\n", + " 'page': 1,\n", + " 'span': [0, 44]}],\n", + " 'text': 'Ā· Moons: Two small moons, Phobos and Deimos.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.51646423],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "import pprint\n", + "import json\n", + "\n", + "pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))\n", + "# json.loads(output_df.iloc[0, ]['contents'])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "e1a10c2d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e1a10c2d", + "outputId": "c1d992c2-faa8-40cd-c375-857970201daa" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',\n", + " 'filename': 'earth.pdf',\n", + " 'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.30961609,\n", + " 654.45184326,\n", + " 174.04208374,\n", + " 667.93347168],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.12528992,\n", + " 630.69073486,\n", + " 210.66503906,\n", + " 642.27935791],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87112427,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.04595947],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.20942688,\n", + " 570.81555176,\n", + " 375.57919312,\n", + " 581.08459473],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about our Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.15542603,\n", + " 542.98168945,\n", + " 167.32983398,\n", + " 554.36669922],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.91053772,\n", + " 512.46295166,\n", + " 477.84887695,\n", + " 534.48431396],\n", + " 'page': 1,\n", + " 'span': [0, 107]}],\n", + " 'text': \"Earth is the third planet from the Sun. It's our home \"\n", + " 'planet. Earth is the only place we know of with life.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.30151367,\n", + " 494.86206055,\n", + " 240.17156982,\n", + " 505.07229614],\n", + " 'page': 1,\n", + " 'span': [0, 24]}],\n", + " 'text': 'Basic facts about Earth:',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 464.97409058,\n", + " 477.47979736,\n", + " 487.02810669],\n", + " 'page': 1,\n", + " 'span': [0, 79]}],\n", + " 'text': 'Ā· Distance from the Sun: Average of 149.6 million '\n", + " 'kilometers (93 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 452.86901855,\n", + " 317.90722656,\n", + " 463.24041748],\n", + " 'page': 1,\n", + " 'span': [0, 37]}],\n", + " 'text': 'Ā· Rotation Period: 24 hours (one day)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.71496582,\n", + " 396.66357422,\n", + " 451.19915771],\n", + " 'page': 1,\n", + " 'span': [0, 52]}],\n", + " 'text': 'Ā· Moons: One moon, called Luna or simply \"the Moon\".',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.53633118],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))" + ] + }, + { + "cell_type": "markdown", + "id": "72274586", + "metadata": { + "id": "72274586" + }, + "source": [ + "## Step-4: Doc chunks\n", + "\n", + "In the previous step, we have extracted text from oru PDFs. But we have the content of entire file as 'one row' in our parquet output.\n", + "\n", + "In this step, we are going to split the documents in chunks, according to their layout segmentation.\n", + "\n", + "This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`\n", + "to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.\n", + "It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: \"application/json\"`,\n", + "which provides the required JSON structure." + ] + }, + { + "cell_type": "markdown", + "id": "96198fa6", + "metadata": { + "id": "96198fa6" + }, + "source": [ + "### 4.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "305f00a3", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "305f00a3", + "outputId": "dd511f34-bab3-4dde-d938-493debb02e5e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" + ] + } + ], + "source": [ + "STAGE = 2\n", + "\n", + "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_chunk_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "369f2cd1", + "metadata": { + "id": "369f2cd1" + }, + "source": [ + "### 4.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "5b7b18d5", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5b7b18d5", + "outputId": "e0b87171-9d66-473f-e66a-e4b6ae3c3f66" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:34:45 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", + "13:34:45 INFO - pipeline id pipeline_id\n", + "13:34:45 INFO - code location None\n", + "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:34:45 INFO - orchestrator doc_chunk started at 2024-10-18 13:34:45\n", + "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "13:34:45 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "13:34:45 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "13:34:45 INFO - Done processing 2 files, waiting for flush() completion.\n", + "13:34:45 INFO - done flushing in 0.0 sec\n", + "13:34:45 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:2 completed successfully\n", + "CPU times: user 826 ms, sys: 101 ms, total: 928 ms\n", + "Wall time: 923 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from doc_chunk_transform_python import DocChunkPythonTransformConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # doc_chunk arguments\n", + " # ...\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(DocChunkPythonTransformConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "213afdf6", + "metadata": { + "id": "213afdf6" + }, + "source": [ + "### 4.3 - Inspect Generated output\n", + "\n", + "We would see documents are split into many chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "d8138d43", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 897 + }, + "id": "d8138d43", + "outputId": "fd01e0cb-899e-4c73-d50e-5f4e6f5ff802" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Files processed : 2\n", + "Chunks created : 8\n", + "Input data dimensions (rows x columns)= (2, 12)\n", + "Output data dimensions (rows x columns)= (8, 16)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...
\n", + "
" ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "cc61dffa", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "cc61dffa", - "outputId": "7fb26043-8538-48b6-80b7-16ceb818c1a8" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "For more details about the Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 1------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 2------\n", - "Basic facts about Mars:\n", - "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "Ā· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 3------\n", - "Earth\n", - "Basic facts about Earth:\n", - "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "Ā· Rotation Period: 24 hours (one day)\n", - "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } + "text/plain": [ + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "3 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "7 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "3 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "7 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (f\"Files processed : {input_df.shape[0]:,}\")\n", + "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "9e9ca75c", + "metadata": { + "id": "9e9ca75c" + }, + "source": [ + "### 4.4 - Understanding the Output\n", + "\n", + "Here we see 2 PDF files are split into 6 chunks. Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points\n", + "\n", + "See how **document_id** is carried throughout. This helps us identify original documents.\n", + "\n", + "Also note **contents** is now plain text (not JSON as before)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "3090c950", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "3090c950", + "outputId": "0f4b6771-8d38-4a27-c756-21f916b23a4f" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
4earth.pdfSolar System\\nOur solar system is a vast and f...
5earth.pdfSolar System\\nFor more details about our Solar...
6earth.pdfEarth\\nEarth is the third planet from the Sun....
7earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", + "
" ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "383f40ba", - "metadata": { - "id": "383f40ba" - }, - "source": [ - "### 6.4 - Understanding the output\n", - "\n", - "Remember we had 8 chunks initially. Now we have 7! One duplicate chunk is removed.\n", - "\n", - "If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf` is removed from one of the documents! Pretty neat, eh!\n", - "\n", - "```text\n", - "## Solar System\n", - "\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "85309751-8556-41c6-ac32-84acc941bc8d", - "metadata": { - "id": "85309751-8556-41c6-ac32-84acc941bc8d" - }, - "source": [ - " ## Step-7: Fuzzy Dedup\n", - "\n", - "And fuzzy dedupe is only available in RAY version. So we will skip it here\n", - "\n", - "See this file [dpk_intro_1_ray.ipynb](dpk_intro_1_ray.ipynb)" - ] - }, - { - "cell_type": "markdown", - "id": "5370950a-2a3a-4143-8218-f9b4808099ba", - "metadata": { - "id": "5370950a-2a3a-4143-8218-f9b4808099ba" - }, - "source": [ - "## Step-8: Text encoding\n", - "\n", - "Encode text for the vector storage." - ] - }, - { - "cell_type": "markdown", - "id": "85aba685", - "metadata": { - "id": "85aba685" - }, - "source": [ - "### 8.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "outputId": "41d268f5-7cc6-432e-d56e-2ba882fbdba6" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "šŸƒšŸ¼ STAGE-6: Processing input='output/04_exact_dedupe_out' --> output='output/05_embeddings_out'\n" - ] - } + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", + "1 mars.pdf Solar System\\nFor more details about the Solar...\n", + "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "3 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "4 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "5 earth.pdf Solar System\\nFor more details about our Solar...\n", + "6 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "7 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "d5f151ae", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d5f151ae", + "outputId": "a4c491b2-53db-4d71-da24-4479de8d1d65" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 3------\n", + "Basic facts about Mars:\n", + "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "Ā· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "Ā· Rotation Period: 24 hours (one day)\n", + "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "7ad1c60d", + "metadata": { + "id": "7ad1c60d" + }, + "source": [ + "## Step-5: DOC ID generation of Chunks\n", + "\n", + "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", + "\n", + " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", + " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", + "\n", + "**This is a pre-requisite for fuzzy dedup** in the pipeline." + ] + }, + { + "cell_type": "markdown", + "id": "1afaa0fd", + "metadata": { + "id": "1afaa0fd" + }, + "source": [ + "### 5.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "6ffd6f54", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6ffd6f54", + "outputId": "1784c80d-6309-4913-9f55-c018b978968f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" + ] + } + ], + "source": [ + "\n", + "# Input for this stage is the output of exact dedeup component\n", + "# output of this component makes it possible for fdedup component to run on data.\n", + "\n", + "STAGE = 3\n", + "\n", + "input_folder = output_chunk_dir\n", + "output_folder = output_docid_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "f78a51b7", + "metadata": { + "id": "f78a51b7" + }, + "source": [ + "### 5.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "5fc77557", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5fc77557", + "outputId": "db2b8670-543e-4073-9c7d-3f9ef5f4317e" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:34:45 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "13:34:45 INFO - pipeline id pipeline_id\n", + "13:34:45 INFO - code location None\n", + "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:34:45 INFO - orchestrator doc_id started at 2024-10-18 13:34:45\n", + "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", + "13:34:45 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "13:34:45 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "13:34:45 INFO - Done processing 2 files, waiting for flush() completion.\n", + "13:34:45 INFO - done flushing in 0.0 sec\n", + "13:34:45 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:3 completed successfully\n", + "CPU times: user 12.8 ms, sys: 3.7 ms, total: 16.5 ms\n", + "Wall time: 13.1 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " # doc id configuration\n", + " \"doc_id_doc_column\": \"contents\",\n", + " \"doc_id_hash_column\": \"chunk_hash\",\n", + " \"doc_id_int_column\": \"chunk_id\",\n", + "}\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "\n", + "launcher = PythonTransformLauncher(DocIDPythonTransformRuntimeConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "a9a8c1fa", + "metadata": { + "id": "a9a8c1fa" + }, + "source": [ + "### 5.3 - Inspect Generated output\n", + "\n", + "You will notice we have two extra columns\n", + "\n", + "- **hash_column**\n", + "- **int_id_column**\n", + "\n", + "But still the same number or rows as before" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "da9adede", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 860 + }, + "id": "da9adede", + "outputId": "036db4ca-12f6-4b3e-9d7f-fa70e494870d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 16)\n", + "Output data dimensions (rows x columns)= (8, 18)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3
\n", + "
" ], - "source": [ - "STAGE = 6\n", - "\n", - "input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_embeddings_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "c97545f4", - "metadata": { - "id": "c97545f4" - }, - "source": [ - "### 8.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "outputId": "b2119b07-0654-45cd-f729-1396e18b24b1" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "22:43:10 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "22:43:10 INFO - pipeline id pipeline_id\n", - "22:43:10 INFO - code location None\n", - "22:43:10 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_embeddings_out\n", - "22:43:10 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:43:10 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:43:10 INFO - orchestrator text_encoder started at 2024-10-16 22:43:10\n", - "22:43:10 INFO - Number of files is 2, source profile {'max_file_size': 0.010450363159179688, 'min_file_size': 0.010318756103515625, 'total_file_size': 0.020769119262695312}\n", - "22:43:12 INFO - Completed 1 files (50.0%) in 0.004 min\n", - "22:43:12 INFO - Completed 2 files (100.0%) in 0.004 min\n", - "22:43:12 INFO - Done processing 2 files, waiting for flush() completion.\n", - "22:43:12 INFO - done flushing in 0.0 sec\n", - "22:43:12 INFO - Completed execution in 0.039 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "āœ… Stage:6 completed successfully\n", - "CPU times: user 671 ms, sys: 230 ms, total: 901 ms\n", - "Wall time: 2.8 s\n" - ] - } + "text/plain": [ + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "3 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "7 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "3 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "7 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", + " chunk_hash chunk_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 " + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53", + "metadata": { + "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53" + }, + "source": [ + "## Step-6: Exact Dedup\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", + "metadata": { + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe" + }, + "source": [ + "### 6.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "4c7a1b94", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4c7a1b94", + "outputId": "2f6f05bc-f6fd-4d66-ea01-ed89cd5b80f3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" + ] + } + ], + "source": [ + "STAGE = 4\n", + "\n", + "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_exact_dedupe_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", + "metadata": { + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e" + }, + "source": [ + "### 6.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "outputId": "74dc0b75-58b5-4c97-9965-91315e8a98a5" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:34:45 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", + "13:34:45 INFO - pipeline id pipeline_id\n", + "13:34:45 INFO - code location None\n", + "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:34:45 INFO - orchestrator ededup started at 2024-10-18 13:34:45\n", + "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "13:34:45 INFO - Starting from the beginning\n", + "13:34:45 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "13:34:45 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "13:34:45 INFO - Done processing 2 files, waiting for flush() completion.\n", + "13:34:45 INFO - done flushing in 0.0 sec\n", + "13:34:45 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:4 completed successfully\n", + "CPU times: user 17.6 ms, sys: 997 Ī¼s, total: 18.6 ms\n", + "Wall time: 15.2 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # ededup parameters\n", + " \"ededup_doc_column\": \"contents\",\n", + " \"ededup_doc_id_column\": \"chunk_hash\",\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "eaf1c3c3", + "metadata": { + "id": "eaf1c3c3" + }, + "source": [ + "### 6.3 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d824ebf6", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 815 + }, + "id": "d824ebf6", + "outputId": "68f55770-c750-4607-a205-ba183603019d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 18)\n", + "Output data dimensions (rows x columns)= (7, 19)\n", + "Input chunks before exact dedupe : 8\n", + "Output chunks after exact dedupe : 7\n", + "Duplicate chunks removed : 1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremoved
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[]
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[]
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[]
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[]
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[]
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[]
\n", + "
" ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # text_encoder\n", - " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", - "}\n", - "\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"āŒ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "b734852c", - "metadata": { - "id": "b734852c" - }, - "source": [ - "### 8.3 - Inspect Generated output\n", - "\n", - "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "7b1c1d09", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 760 - }, - "id": "7b1c1d09", - "outputId": "018daa18-e5db-4483-d8d5-30aded80d5e3" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (7, 19)\n", - "Output data dimensions (rows x columns)= (7, 20)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremovedembeddings
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...[-0.051861435, 0.0035226212, 0.030617002, 0.04...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[][0.07728295, 0.024970993, -0.043180738, 0.0580...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcBasic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[][0.10598018, 0.025460618, 0.023627337, 0.03905...
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[][0.0077404436, -0.02055944, 0.026426593, 0.011...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[][-0.062105548, -0.0053322907, 0.031277698, 0.0...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[][0.072435796, -0.058001805, -0.019771898, -0.0...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[][0.091821924, 0.015197902, 0.07716932, 0.01711...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 earth.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "3 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "3 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "2 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", - "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "6 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", - "\n", - " removed \\\n", - "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", - "1 [] \n", - "2 [] \n", - "3 [] \n", - "4 [] \n", - "5 [] \n", - "6 [] \n", - "\n", - " embeddings \n", - "0 [-0.051861435, 0.0035226212, 0.030617002, 0.04... \n", - "1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", - "2 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", - "3 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", - "4 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", - "5 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", - "6 [0.091821924, 0.015197902, 0.07716932, 0.01711... " - ] - }, - "execution_count": 28, - "metadata": {}, - "output_type": "execute_result" - } + "text/plain": [ + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "3 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "3 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "6 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", + " chunk_hash chunk_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", + "\n", + " removed \n", + "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [] \n", + "5 [] \n", + "6 [] " + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "82cc9bb0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 269 + }, + "id": "82cc9bb0", + "outputId": "46d9e91d-c470-4e3e-e5c8-508c534dbceb" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nFor more details about the Solar...
1mars.pdfMars\\nMars, the fourth planet from the Sun, is...
2mars.pdfBasic facts about Mars:\\nĀ· Distance from the S...
3earth.pdfSolar System\\nOur solar system is a vast and f...
4earth.pdfSolar System\\nFor more details about our Solar...
5earth.pdfEarth\\nEarth is the third planet from the Sun....
6earth.pdfEarth\\nBasic facts about Earth:\\nĀ· Distance fr...
\n", + "
" ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "f5e12630-be6b-4188-a925-77117155617b", - "metadata": { - "id": "f5e12630-be6b-4188-a925-77117155617b" - }, - "source": [ - "## Step-9: Copy output to final output dir" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "outputId": "31f09b58-7b2d-48bb-9dac-bc0ba9625c01" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "āœ… Copied output from 'output/05_embeddings_out' --> 'output/output_final'\n" - ] - } + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nFor more details about the Solar...\n", + "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "2 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "3 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "4 earth.pdf Solar System\\nFor more details about our Solar...\n", + "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "6 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "cc61dffa", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cc61dffa", + "outputId": "7fb26043-8538-48b6-80b7-16ceb818c1a8" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 1------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 2------\n", + "Basic facts about Mars:\n", + "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "Ā· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "Ā· Rotation Period: 24 hours (one day)\n", + "Ā· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "383f40ba", + "metadata": { + "id": "383f40ba" + }, + "source": [ + "### 6.4 - Understanding the output\n", + "\n", + "Remember we had 8 chunks initially. Now we have 7! One duplicate chunk is removed.\n", + "\n", + "If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf` is removed from one of the documents! Pretty neat, eh!\n", + "\n", + "```text\n", + "## Solar System\n", + "\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "85309751-8556-41c6-ac32-84acc941bc8d", + "metadata": { + "id": "85309751-8556-41c6-ac32-84acc941bc8d" + }, + "source": [ + " ## Step-7: Fuzzy Dedup\n", + "\n", + "And fuzzy dedupe is only available in RAY version. So we will skip it here\n", + "\n", + "See this file [dpk_intro_1_ray.ipynb](dpk_intro_1_ray.ipynb)" + ] + }, + { + "cell_type": "markdown", + "id": "5370950a-2a3a-4143-8218-f9b4808099ba", + "metadata": { + "id": "5370950a-2a3a-4143-8218-f9b4808099ba" + }, + "source": [ + "## Step-8: Text encoding\n", + "\n", + "Encode text for the vector storage." + ] + }, + { + "cell_type": "markdown", + "id": "85aba685", + "metadata": { + "id": "85aba685" + }, + "source": [ + "### 8.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "outputId": "41d268f5-7cc6-432e-d56e-2ba882fbdba6" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "šŸƒšŸ¼ STAGE-6: Processing input='output/04_exact_dedupe_out' --> output='output/05_embeddings_out'\n" + ] + } + ], + "source": [ + "STAGE = 6\n", + "\n", + "input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_embeddings_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "c97545f4", + "metadata": { + "id": "c97545f4" + }, + "source": [ + "### 8.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "outputId": "b2119b07-0654-45cd-f729-1396e18b24b1" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:34:45 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "13:34:45 INFO - pipeline id pipeline_id\n", + "13:34:45 INFO - code location None\n", + "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_embeddings_out\n", + "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:34:45 INFO - orchestrator text_encoder started at 2024-10-18 13:34:45\n", + "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.010450363159179688, 'min_file_size': 0.010318756103515625, 'total_file_size': 0.020769119262695312}\n", + "13:34:47 INFO - Completed 1 files (50.0%) in 0.004 min\n", + "13:34:47 INFO - Completed 2 files (100.0%) in 0.005 min\n", + "13:34:47 INFO - Done processing 2 files, waiting for flush() completion.\n", + "13:34:47 INFO - done flushing in 0.0 sec\n", + "13:34:47 INFO - Completed execution in 0.034 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Stage:6 completed successfully\n", + "CPU times: user 615 ms, sys: 146 ms, total: 761 ms\n", + "Wall time: 2.24 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # text_encoder\n", + " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", + "}\n", + "\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"āŒ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "b734852c", + "metadata": { + "id": "b734852c" + }, + "source": [ + "### 8.3 - Inspect Generated output\n", + "\n", + "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "7b1c1d09", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 760 + }, + "id": "7b1c1d09", + "outputId": "018daa18-e5db-4483-d8d5-30aded80d5e3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (7, 19)\n", + "Output data dimensions (rows x columns)= (7, 20)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremovedembeddings
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...[-0.051861435, 0.0035226212, 0.030617002, 0.04...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[][0.07728295, 0.024970993, -0.043180738, 0.0580...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\nĀ· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[][0.10598018, 0.025460618, 0.023627337, 0.03905...
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[][0.0077404436, -0.02055944, 0.026426593, 0.011...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[][-0.062105548, -0.0053322907, 0.031277698, 0.0...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[][0.072435796, -0.058001805, -0.019771898, -0.0...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\nĀ· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[][0.091821924, 0.015197902, 0.07716932, 0.01711...
\n", + "
" ], - "source": [ - "import shutil\n", - "\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", - "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", - "\n", - "print (f\"āœ… Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" - ] + "text/plain": [ + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "3 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "3 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "6 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", + " chunk_hash chunk_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", + "\n", + " removed \\\n", + "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [] \n", + "5 [] \n", + "6 [] \n", + "\n", + " embeddings \n", + "0 [-0.051861435, 0.0035226212, 0.030617002, 0.04... \n", + "1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", + "2 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", + "3 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", + "4 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", + "5 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", + "6 [0.091821924, 0.015197902, 0.07716932, 0.01711... " + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" } - ], - "metadata": { + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "f5e12630-be6b-4188-a925-77117155617b", + "metadata": { + "id": "f5e12630-be6b-4188-a925-77117155617b" + }, + "source": [ + "## Step-9: Copy output to final output dir" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "metadata": { "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "dpk-1-basic-022dev1-py312", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.7" - }, - "widgets": { - "application/vnd.jupyter.widget-state+json": { - "06f9b33494984e4885d5aad813d1d2bc": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "1cb3bbf7d724411cbe9831543a4aecc0": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "553f3c16839a49d79591d0fc4862bed6": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "7053c9606a414e978636a7e241909504": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_1cb3bbf7d724411cbe9831543a4aecc0", - "placeholder": "ā€‹", - "style": "IPY_MODEL_06f9b33494984e4885d5aad813d1d2bc", - "value": "ā€‡10/10ā€‡[00:00<00:00,ā€‡349.38it/s]" - } - }, - "724778729161445c98b187031ae4f67c": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "97b603697cfa4b4ea4e6735b6768ca35": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_e87e8d3262c54cfaaa8768505edacda3", - "IPY_MODEL_b78aa40816e44f7fbebcb24ca68818b3", - "IPY_MODEL_7053c9606a414e978636a7e241909504" - ], - "layout": "IPY_MODEL_da0787b239764847a731083997780a85" - } - }, - "9d184ed175f0403fb03c2e13dfd04e0a": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "b78aa40816e44f7fbebcb24ca68818b3": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_9d184ed175f0403fb03c2e13dfd04e0a", - "max": 10, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_724778729161445c98b187031ae4f67c", - "value": 10 - } - }, - "c0eb5bc8f6ee427ca42204b3c56f9a4e": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "da0787b239764847a731083997780a85": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "e87e8d3262c54cfaaa8768505edacda3": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_553f3c16839a49d79591d0fc4862bed6", - "placeholder": "ā€‹", - "style": "IPY_MODEL_c0eb5bc8f6ee427ca42204b3c56f9a4e", - "value": "Fetchingā€‡10ā€‡files:ā€‡100%" - } - } - } + "base_uri": "https://localhost:8080/" + }, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "outputId": "31f09b58-7b2d-48bb-9dac-bc0ba9625c01" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "āœ… Copied output from 'output/05_embeddings_out' --> 'output/output_final'\n" + ] } + ], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", + "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", + "\n", + "print (f\"āœ… Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "dpk-2-basic-021-py311", + "language": "python", + "name": "python3" }, - "nbformat": 4, - "nbformat_minor": 5 + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "06f9b33494984e4885d5aad813d1d2bc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "1cb3bbf7d724411cbe9831543a4aecc0": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "553f3c16839a49d79591d0fc4862bed6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7053c9606a414e978636a7e241909504": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_1cb3bbf7d724411cbe9831543a4aecc0", + "placeholder": "ā€‹", + "style": "IPY_MODEL_06f9b33494984e4885d5aad813d1d2bc", + "value": "ā€‡10/10ā€‡[00:00<00:00,ā€‡349.38it/s]" + } + }, + "724778729161445c98b187031ae4f67c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "97b603697cfa4b4ea4e6735b6768ca35": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_e87e8d3262c54cfaaa8768505edacda3", + "IPY_MODEL_b78aa40816e44f7fbebcb24ca68818b3", + "IPY_MODEL_7053c9606a414e978636a7e241909504" + ], + "layout": "IPY_MODEL_da0787b239764847a731083997780a85" + } + }, + "9d184ed175f0403fb03c2e13dfd04e0a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "b78aa40816e44f7fbebcb24ca68818b3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9d184ed175f0403fb03c2e13dfd04e0a", + "max": 10, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_724778729161445c98b187031ae4f67c", + "value": 10 + } + }, + "c0eb5bc8f6ee427ca42204b3c56f9a4e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "da0787b239764847a731083997780a85": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e87e8d3262c54cfaaa8768505edacda3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_553f3c16839a49d79591d0fc4862bed6", + "placeholder": "ā€‹", + "style": "IPY_MODEL_c0eb5bc8f6ee427ca42204b3c56f9a4e", + "value": "Fetchingā€‡10ā€‡files:ā€‡100%" + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index b39e30d2d..04af8ecd9 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -68,7 +68,11 @@ "execution_count": 1, "id": "1fe354b7", "metadata": { - "id": "1fe354b7" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1fe354b7", + "outputId": "6665c654-baa5-46dc-d370-9931e0e9eed3" }, "outputs": [ { @@ -105,7 +109,11 @@ "execution_count": 2, "id": "3309799e", "metadata": { - "id": "3309799e" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3309799e", + "outputId": "00d7362e-d675-4aaf-8c87-d99027d9a06c" }, "outputs": [], "source": [ @@ -131,14 +139,19 @@ "execution_count": 3, "id": "1fcec577", "metadata": { - "id": "1fcec577" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "1fcec577", + "outputId": "48cf233b-f04e-4b9b-9605-423f87693f10" }, "outputs": [], "source": [ "if RUNNING_IN_COLAB:\n", " ! pip install --default-timeout=100 \\\n", - " data-prep-toolkit[ray]==0.2.2.dev1 \\\n", - " data-prep-toolkit-transforms[ray,all]==0.2.2.dev1 \\\n", + " data-prep-toolkit-transforms==0.2.1 \\\n", + " data-prep-toolkit-transforms-ray==0.2.1 \\\n", " deepsearch-toolkit" ] }, @@ -187,7 +200,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e4YMZrBuFycl", - "outputId": "54e232da-b2a8-4f3e-d983-94259505dad3" + "outputId": "1a1d5f01-0856-40b6-8b1c-8187b0c38d64" }, "outputs": [ { @@ -218,7 +231,7 @@ "base_uri": "https://localhost:8080/" }, "id": "33345487", - "outputId": "c14c3a3d-c074-4535-b75d-19c5effa7d94" + "outputId": "f3e71a25-4864-4f8f-dfce-4af3d7e08a8a" }, "outputs": [ { @@ -226,7 +239,7 @@ "output_type": "stream", "text": [ "MY_CONFIG.RAY_RUNTIME_WORKERS: 2\n", - "MY_CONFIG.RAY_NUM_CPUS: 1\n", + "MY_CONFIG.RAY_NUM_CPUS: 0.8\n", "MY_CONFIG.RAY_MEMORY_GB: 2\n" ] } @@ -259,10 +272,11 @@ "else: # local run\n", " num_cpus_available = os.cpu_count()\n", " # print (num_cpus_available)\n", - " MY_CONFIG.RAY_NUM_CPUS = 1\n", + "\n", + " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + " MY_CONFIG.RAY_NUM_CPUS = 0.8\n", " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", " # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", - " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", "\n", "print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)\n", "print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)\n", @@ -305,7 +319,7 @@ "base_uri": "https://localhost:8080/" }, "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "outputId": "fd42f265-445f-488c-8c62-b293424f162d" + "outputId": "ec5beb05-027a-49eb-9a96-271471619d81" }, "outputs": [ { @@ -370,7 +384,7 @@ "base_uri": "https://localhost:8080/" }, "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "outputId": "f4c02b6f-effd-4d04-8547-f270f721f8d2" + "outputId": "f8383739-a4fb-450c-dc37-5df32aab8212" }, "outputs": [ { @@ -409,38 +423,38 @@ "base_uri": "https://localhost:8080/" }, "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "outputId": "2cb0721a-1526-4129-a72f-77c1beefafdb" + "outputId": "14a36e73-a186-4431-a755-f46ccb691130" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:45:46 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "22:45:46 INFO - pipeline id pipeline_id\n", - "22:45:46 INFO - code location None\n", - "22:45:46 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", - "22:45:46 INFO - actor creation delay 0\n", - "22:45:46 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:45:46 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", - "22:45:46 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:45:46 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "22:45:46 INFO - Running locally\n", - "2024-10-16 22:45:48,783\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - orchestrator started at 2024-10-16 22:45:52\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.14609298761934, 'object_store': 3.073046493344009}\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", - "\u001b[36m(RayTransformFileProcessor pid=1001895)\u001b[0m 22:45:55 INFO - Initializing models\n", - "Fetching 10 files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 10/10 [00:00<00:00, 103563.06it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=1001895)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:00 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:02 INFO - Completed processing 2 files in 0.033 min\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:02 INFO - done flushing in 0.001 sec\n", - "\u001b[36m(RayTransformFileProcessor pid=1001896)\u001b[0m 22:45:55 INFO - Initializing models\n", - "Fetching 10 files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 10/10 [00:00<00:00, 126716.13it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=1001896)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", - "22:46:12 INFO - Completed execution in 0.43 min, execution result 0\n" + "13:30:44 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "13:30:44 INFO - pipeline id pipeline_id\n", + "13:30:44 INFO - code location None\n", + "13:30:44 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'memory': 2147483648, 'max_restarts': -1}\n", + "13:30:44 INFO - actor creation delay 0\n", + "13:30:44 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:30:44 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", + "13:30:44 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:30:44 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "13:30:44 INFO - Running locally\n", + "2024-10-18 13:30:47,436\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - orchestrator started at 2024-10-18 13:30:50\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 14.872821807861328, 'object_store': 7.436410903930664}\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(RayTransformFileProcessor pid=10098)\u001b[0m 13:30:53 INFO - Initializing models\n", + "Fetching 10 files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 10/10 [00:00<00:00, 110376.42it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=10098)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:59 INFO - Completed processing 2 files in 0.145 min\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:59 INFO - done flushing in 0.001 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=10099)\u001b[0m 13:30:53 INFO - Initializing models\n", + "Fetching 10 files: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 10/10 [00:00<00:00, 73713.60it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=10099)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "13:31:09 INFO - Completed execution in 0.421 min, execution result 0\n" ] }, { @@ -448,8 +462,8 @@ "output_type": "stream", "text": [ "āœ… Stage:1 completed successfully\n", - "CPU times: user 4.46 s, sys: 1.22 s, total: 5.69 s\n", - "Wall time: 30.4 s\n" + "CPU times: user 4.41 s, sys: 1.39 s, total: 5.8 s\n", + "Wall time: 31.1 s\n" ] } ], @@ -528,7 +542,7 @@ "height": 255 }, "id": "fe59563d", - "outputId": "40c31bad-d00a-4da9-8169-9db1bcc47704" + "outputId": "d10c022d-524f-4a13-ebf8-6431114e9172" }, "outputs": [ { @@ -581,12 +595,12 @@ " 1\n", " 0\n", " 11\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", " \n", " \n", @@ -596,12 +610,12 @@ " 1\n", " 0\n", " 11\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", " \n", " \n", @@ -614,16 +628,16 @@ "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", "\n", " num_tables num_doc_elements document_id ext \\\n", - "0 0 11 f20aa513-8473-4bf7-a746-a66eb28b722c pdf \n", - "1 0 11 b4c44875-3612-4c5a-b387-2f04c63d1276 pdf \n", + "0 0 11 62e5639f-f922-4ccc-a041-3cb02f1cfd83 pdf \n", + "1 0 11 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.131556 2.001925 earth.pdf " + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.494027 2.015123 earth.pdf " ] }, "execution_count": 10, @@ -674,7 +688,7 @@ "base_uri": "https://localhost:8080/" }, "id": "f870e624", - "outputId": "fd259342-158a-4a33-f148-d8462e2f1ca2" + "outputId": "9142246b-988c-4674-99d7-e2f3fffbaaf4" }, "outputs": [ { @@ -826,7 +840,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e1a10c2d", - "outputId": "68cdc0c0-3bf5-45a2-d2bc-99aa79e3e0d5" + "outputId": "ca74113e-6fd3-488b-836a-60bd58299fb1" }, "outputs": [ { @@ -1000,7 +1014,7 @@ "base_uri": "https://localhost:8080/" }, "id": "305f00a3", - "outputId": "7a800f4b-bc80-452d-c3d6-170e19f3422e" + "outputId": "689f1531-7007-49d9-9a27-39c39f8f2c50" }, "outputs": [ { @@ -1041,32 +1055,32 @@ "base_uri": "https://localhost:8080/" }, "id": "5b7b18d5", - "outputId": "e6f06879-906c-47d0-ef34-b018e4efa00f" + "outputId": "0146bd91-2ccb-4e56-c649-f415a38bfcf8" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:46:15 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30}\n", - "22:46:15 INFO - pipeline id pipeline_id\n", - "22:46:15 INFO - code location None\n", - "22:46:15 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "22:46:15 INFO - actor creation delay 0\n", - "22:46:15 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:46:15 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "22:46:15 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:46:15 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:46:15 INFO - Running locally\n", - "2024-10-16 22:46:16,484\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - orchestrator started at 2024-10-16 22:46:19\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.136235047131777, 'object_store': 3.068117522634566}\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - Completed processing 2 files in 0.0 min\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - done flushing in 0.001 sec\n", - "22:46:31 INFO - Completed execution in 0.271 min, execution result 0\n" + "13:31:12 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", + "13:31:12 INFO - pipeline id pipeline_id\n", + "13:31:12 INFO - code location None\n", + "13:31:12 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "13:31:12 INFO - actor creation delay 0\n", + "13:31:12 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:31:12 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "13:31:12 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:31:12 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:31:12 INFO - Running locally\n", + "2024-10-18 13:31:14,121\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - orchestrator started at 2024-10-18 13:31:16\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 14.963891602121294, 'object_store': 7.4819458005949855}\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:18 INFO - Completed processing 2 files in 0.032 min\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:18 INFO - done flushing in 0.001 sec\n", + "13:31:28 INFO - Completed execution in 0.269 min, execution result 0\n" ] }, { @@ -1074,8 +1088,8 @@ "output_type": "stream", "text": [ "āœ… Stage:2 completed successfully\n", - "CPU times: user 1.04 s, sys: 360 ms, total: 1.4 s\n", - "Wall time: 19.1 s\n" + "CPU times: user 982 ms, sys: 291 ms, total: 1.27 s\n", + "Wall time: 18.9 s\n" ] } ], @@ -1140,7 +1154,7 @@ "height": 897 }, "id": "d8138d43", - "outputId": "3e040b55-8c94-4f97-fedf-d2dbead55a72" + "outputId": "e1758b0c-5f22-4368-c3e6-ff778fc9ae82" }, "outputs": [ { @@ -1202,10 +1216,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -1221,10 +1235,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", @@ -1240,10 +1254,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", @@ -1259,10 +1273,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", @@ -1278,10 +1292,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -1297,10 +1311,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", @@ -1316,10 +1330,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", @@ -1335,10 +1349,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", @@ -1371,24 +1385,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "7 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "3 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "6 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "7 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", "\n", " source_document_id \\\n", - "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "7 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "3 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "6 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "7 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -1466,7 +1480,7 @@ "height": 300 }, "id": "3090c950", - "outputId": "4c3b6461-ae8c-41d9-8c71-e1bbe634b9ed" + "outputId": "3f542446-2cfa-404c-c642-3732f7b74568" }, "outputs": [ { @@ -1569,7 +1583,7 @@ "base_uri": "https://localhost:8080/" }, "id": "d5f151ae", - "outputId": "3dc3ec5d-31d7-4081-db16-8bb6051ea80a" + "outputId": "4616d648-0852-4ecb-cef8-f5940e176de0" }, "outputs": [ { @@ -1662,7 +1676,7 @@ "base_uri": "https://localhost:8080/" }, "id": "1f747c0d", - "outputId": "765daa01-138b-4bfa-a75c-bffc80f9e246" + "outputId": "e42500b7-5d1e-41fd-b53b-34d3393f36f4" }, "outputs": [ { @@ -1704,36 +1718,35 @@ "id": "f6e9e145", "metadata": { "colab": { - "base_uri": "https://localhost:8080/", - "height": 883 + "base_uri": "https://localhost:8080/" }, "id": "f6e9e145", - "outputId": "fe3d0a3d-0575-4dd8-8564-e336a6ddb68d" + "outputId": "2add5f0c-3ab6-4336-8a7b-ac8b1b76ab73" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:46:32 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "22:46:32 INFO - pipeline id pipeline_id\n", - "22:46:32 INFO - code location None\n", - "22:46:32 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "22:46:32 INFO - actor creation delay 0\n", - "22:46:32 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:46:32 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "22:46:32 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:46:32 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:46:32 INFO - Running locally\n", - "2024-10-16 22:46:33,897\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - orchestrator started at 2024-10-16 22:46:35\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.126107025891542, 'object_store': 3.0630535120144486}\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - Completed processing 2 files in 0.003 min\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - done flushing in 0.001 sec\n", - "22:46:46 INFO - Completed execution in 0.227 min, execution result 0\n" + "13:31:29 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "13:31:29 INFO - pipeline id pipeline_id\n", + "13:31:29 INFO - code location None\n", + "13:31:29 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "13:31:29 INFO - actor creation delay 0\n", + "13:31:29 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:31:29 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "13:31:29 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:31:29 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:31:29 INFO - Running locally\n", + "2024-10-18 13:31:31,792\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - orchestrator started at 2024-10-18 13:31:32\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 15.033103181049228, 'object_store': 7.516551589593291}\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:33 INFO - Completed processing 2 files in 0.012 min\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:33 INFO - done flushing in 0.001 sec\n", + "13:31:43 INFO - Completed execution in 0.228 min, execution result 0\n" ] }, { @@ -1741,8 +1754,8 @@ "output_type": "stream", "text": [ "āœ… Stage:3 completed successfully\n", - "CPU times: user 122 ms, sys: 153 ms, total: 276 ms\n", - "Wall time: 14.9 s\n" + "CPU times: user 123 ms, sys: 145 ms, total: 267 ms\n", + "Wall time: 15.2 s\n" ] } ], @@ -1808,10 +1821,10 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 373 + "height": 860 }, "id": "1911179a", - "outputId": "b82445e8-ebba-48fa-b1c2-26a9e0743ef9" + "outputId": "45e83e2a-1f70-46b9-e311-c50f025419be" }, "outputs": [ { @@ -1873,10 +1886,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -1894,10 +1907,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", @@ -1915,10 +1928,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", @@ -1936,10 +1949,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", @@ -1957,10 +1970,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -1978,10 +1991,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", @@ -1999,10 +2012,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", @@ -2020,10 +2033,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", @@ -2058,24 +2071,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "7 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "3 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "6 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "7 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", "\n", " source_document_id \\\n", - "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "7 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "3 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "6 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "7 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -2160,7 +2173,11 @@ "execution_count": 21, "id": "4c7a1b94", "metadata": { - "id": "4c7a1b94" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4c7a1b94", + "outputId": "40a119b4-44fc-483d-9ad0-da178a2a8eb1" }, "outputs": [ { @@ -2197,32 +2214,36 @@ "execution_count": 22, "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", "metadata": { - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "outputId": "bd0f3f94-8c48-4c6b-b911-858e389243f4" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:46:47 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", - "22:46:47 INFO - pipeline id pipeline_id\n", - "22:46:47 INFO - code location None\n", - "22:46:47 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "22:46:47 INFO - actor creation delay 0\n", - "22:46:47 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:46:47 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "22:46:47 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:46:47 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:46:47 INFO - Running locally\n", - "2024-10-16 22:46:48,851\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - orchestrator started at 2024-10-16 22:46:50\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.11034622322768, 'object_store': 3.055173110216856}\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - Completed processing 2 files in 0.003 min\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - done flushing in 0.001 sec\n", - "22:47:01 INFO - Completed execution in 0.226 min, execution result 0\n" + "13:31:45 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", + "13:31:45 INFO - pipeline id pipeline_id\n", + "13:31:45 INFO - code location None\n", + "13:31:45 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "13:31:45 INFO - actor creation delay 0\n", + "13:31:45 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:31:45 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "13:31:45 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:31:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:31:45 INFO - Running locally\n", + "2024-10-18 13:31:47,001\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - orchestrator started at 2024-10-18 13:31:48\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 15.010423279367387, 'object_store': 7.505211639218032}\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Completed processing 2 files in 0.013 min\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - done flushing in 0.001 sec\n", + "13:31:58 INFO - Completed execution in 0.228 min, execution result 0\n" ] }, { @@ -2230,8 +2251,8 @@ "output_type": "stream", "text": [ "āœ… Stage:4 completed successfully\n", - "CPU times: user 125 ms, sys: 134 ms, total: 259 ms\n", - "Wall time: 15 s\n" + "CPU times: user 136 ms, sys: 154 ms, total: 289 ms\n", + "Wall time: 15.2 s\n" ] } ], @@ -2292,7 +2313,12 @@ "execution_count": 23, "id": "d824ebf6", "metadata": { - "id": "d824ebf6" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 815 + }, + "id": "d824ebf6", + "outputId": "9173efb6-1b95-4a7e-b531-1a611841a4d0" }, "outputs": [ { @@ -2358,32 +2384,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", - " Solar System\\nOur solar system is a vast and f...\n", - " $.main-text[2]\n", - " 1\n", - " [132.84518433, 588.96014404, 479.40917969, 623...\n", - " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", - " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", - " 4\n", - " []\n", - " \n", - " \n", - " 1\n", - " mars.pdf\n", - " 1\n", - " 0\n", - " 11\n", - " pdf\n", - " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", - " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", - " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", @@ -2391,10 +2395,10 @@ " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " 5\n", - " []\n", + " [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...\n", " \n", " \n", - " 2\n", + " 1\n", " mars.pdf\n", " 1\n", " 0\n", @@ -2402,10 +2406,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", @@ -2416,7 +2420,7 @@ " []\n", " \n", " \n", - " 3\n", + " 2\n", " mars.pdf\n", " 1\n", " 0\n", @@ -2424,10 +2428,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", @@ -2438,6 +2442,28 @@ " []\n", " \n", " \n", + " 3\n", + " earth.pdf\n", + " 1\n", + " 0\n", + " 11\n", + " pdf\n", + " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", + " 2686\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", + " earth.pdf\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", + " Solar System\\nOur solar system is a vast and f...\n", + " $.main-text[2]\n", + " 1\n", + " [132.87112427, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 0\n", + " []\n", + " \n", + " \n", " 4\n", " earth.pdf\n", " 1\n", @@ -2446,10 +2472,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", @@ -2457,7 +2483,7 @@ " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " 1\n", - " [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...\n", + " []\n", " \n", " \n", " 5\n", @@ -2468,10 +2494,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", @@ -2490,10 +2516,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", @@ -2512,7 +2538,7 @@ "0 mars.pdf 1 0 11 pdf \n", "1 mars.pdf 1 0 11 pdf \n", "2 mars.pdf 1 0 11 pdf \n", - "3 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", "4 earth.pdf 1 0 11 pdf \n", "5 earth.pdf 1 0 11 pdf \n", "6 earth.pdf 1 0 11 pdf \n", @@ -2521,71 +2547,71 @@ "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "3 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "6 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", "\n", " source_document_id \\\n", - "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "3 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "6 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", "\n", " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", "6 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", "\n", " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", " document_id \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", "\n", " chunk_hash chunk_id \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", "\n", " removed \n", - "0 [] \n", + "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", "1 [] \n", "2 [] \n", "3 [] \n", - "4 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "4 [] \n", "5 [] \n", "6 [] " ] @@ -2614,7 +2640,12 @@ "execution_count": 24, "id": "82cc9bb0", "metadata": { - "id": "82cc9bb0" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 269 + }, + "id": "82cc9bb0", + "outputId": "e043fa01-ceca-49ae-b764-8154219c7b6c" }, "outputs": [ { @@ -2646,22 +2677,22 @@ " \n", " 0\n", " mars.pdf\n", - " Solar System\\nOur solar system is a vast and f...\n", + " Solar System\\nFor more details about the Solar...\n", " \n", " \n", " 1\n", " mars.pdf\n", - " Solar System\\nFor more details about the Solar...\n", + " Mars\\nMars, the fourth planet from the Sun, is...\n", " \n", " \n", " 2\n", " mars.pdf\n", - " Mars\\nMars, the fourth planet from the Sun, is...\n", + " Basic facts about Mars:\\nĀ· Distance from the S...\n", " \n", " \n", " 3\n", - " mars.pdf\n", - " Basic facts about Mars:\\nĀ· Distance from the S...\n", + " earth.pdf\n", + " Solar System\\nOur solar system is a vast and f...\n", " \n", " \n", " 4\n", @@ -2684,10 +2715,10 @@ ], "text/plain": [ " filename contents\n", - "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", - "1 mars.pdf Solar System\\nFor more details about the Solar...\n", - "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "3 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "0 mars.pdf Solar System\\nFor more details about the Solar...\n", + "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "2 mars.pdf Basic facts about Mars:\\nĀ· Distance from the S...\n", + "3 earth.pdf Solar System\\nOur solar system is a vast and f...\n", "4 earth.pdf Solar System\\nFor more details about our Solar...\n", "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", "6 earth.pdf Earth\\nBasic facts about Earth:\\nĀ· Distance fr..." @@ -2707,7 +2738,11 @@ "execution_count": 25, "id": "cc61dffa", "metadata": { - "id": "cc61dffa" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cc61dffa", + "outputId": "aff7a0d9-a791-42a5-d5b7-ad643f59f261" }, "outputs": [ { @@ -2717,17 +2752,13 @@ "========== mars.pdf ===========\n", "-------Chunk 0------\n", "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", "For more details about the Solar system see Chapter 1.\n", "-------\n", - "-------Chunk 2------\n", + "-------Chunk 1------\n", "Mars\n", "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", "-------\n", - "-------Chunk 3------\n", + "-------Chunk 2------\n", "Basic facts about Mars:\n", "Ā· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", "Ā· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", @@ -2736,13 +2767,17 @@ "========== earth.pdf ===========\n", "-------Chunk 0------\n", "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", "-------\n", "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", "Earth\n", "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", "-------\n", - "-------Chunk 2------\n", + "-------Chunk 3------\n", "Earth\n", "Basic facts about Earth:\n", "Ā· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", @@ -2810,7 +2845,11 @@ "execution_count": 26, "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", "metadata": { - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "outputId": "d53a92d2-0f1c-465f-f11c-b9bc2931f651" }, "outputs": [ { @@ -2849,56 +2888,60 @@ "execution_count": 27, "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", "metadata": { - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "outputId": "1e63d364-3944-465a-ff7c-6e1dc750b2de" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:47:02 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 1}}\n", - "22:47:02 INFO - pipeline id pipeline_id\n", - "22:47:02 INFO - code location None\n", - "22:47:02 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "22:47:02 INFO - actor creation delay 0\n", - "22:47:02 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:47:02 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", - "22:47:02 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:47:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:47:02 INFO - Running locally\n", - "2024-10-16 22:47:03,977\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - orchestrator started at 2024-10-16 22:47:05\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.128299713134766, 'object_store': 3.064149856567383}\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - starting run from the beginning\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - continuing from the very beginning\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Fuzzy: num buckets 8, bucket length 8\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - created 1 bucket actors\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - created 1 minhash actors\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Table preprocessing uses 1 readers\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:06 INFO - created 1 table processor actors\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:12 INFO - Completed 1 files in 0.104 min\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:12 INFO - Completed 1 files (50.0%) in 0.104 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:15 INFO - Completed processing 2 files in 0.154 min\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:15 INFO - creating minhash snapshots\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:16 INFO - minhash snapshots created\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:16 INFO - creating bucket snapshots\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:17 INFO - bucket snapshots created\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:17 INFO - created 1 document actors\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - created 1 bucket processor actors\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - created bucket processor invoker\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - added invoker to bucket collectors\n", - "\u001b[36m(BucketsHash pid=1008361)\u001b[0m 22:47:18 INFO - processing buckets 0 long, 53 short\n", - "\u001b[36m(BucketsHash pid=1008361)\u001b[0m 22:47:18 INFO - Done submitting long buckets\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:19 INFO - Done processing buckets in 0.012 min\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:19 INFO - creating document snapshots\n", - "\u001b[36m(BucketsHashProcessorInvoker pid=1008950)\u001b[0m 22:47:19 INFO - Waiting bucket processing completion. Submitted requests 1\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:20 INFO - document snapshots created\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:21 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:30 INFO - Completed processing 2 files in 0.153 min\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:30 INFO - done flushing in 0.001 sec\n", - "22:47:40 INFO - Completed execution in 0.632 min, execution result 0\n" + "13:32:00 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.8}}\n", + "13:32:00 INFO - pipeline id pipeline_id\n", + "13:32:00 INFO - code location None\n", + "13:32:00 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "13:32:00 INFO - actor creation delay 0\n", + "13:32:00 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:32:00 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", + "13:32:00 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:32:00 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:32:00 INFO - Running locally\n", + "2024-10-18 13:32:02,246\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - orchestrator started at 2024-10-18 13:32:03\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 15.000544739887118, 'object_store': 7.500272369012237}\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - starting run from the beginning\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - continuing from the very beginning\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Fuzzy: num buckets 8, bucket length 8\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - created 1 bucket actors\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - created 1 minhash actors\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Table preprocessing uses 1 readers\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - created 1 table processor actors\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:07 INFO - Completed 1 files in 0.064 min\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:07 INFO - Completed 1 files (50.0%) in 0.064 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:15 INFO - Completed processing 2 files in 0.197 min\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:15 INFO - creating minhash snapshots\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:16 INFO - minhash snapshots created\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:16 INFO - creating bucket snapshots\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - bucket snapshots created\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - created 1 document actors\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - created 1 bucket processor actors\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - created bucket processor invoker\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - added invoker to bucket collectors\n", + "\u001b[36m(BucketsHash pid=16209)\u001b[0m 13:32:17 INFO - processing buckets 0 long, 53 short\n", + "\u001b[36m(BucketsHash pid=16209)\u001b[0m 13:32:17 INFO - Done submitting long buckets\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - Done processing buckets in 0.01 min\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - creating document snapshots\n", + "\u001b[36m(BucketsHashProcessorInvoker pid=16602)\u001b[0m 13:32:17 INFO - Waiting bucket processing completion. Submitted requests 1\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:18 INFO - document snapshots created\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:18 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:25 INFO - Completed processing 2 files in 0.113 min\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:25 INFO - done flushing in 0.005 sec\n", + "13:32:35 INFO - Completed execution in 0.588 min, execution result 0\n" ] }, { @@ -2906,8 +2949,8 @@ "output_type": "stream", "text": [ "āœ… Stage:5 completed successfully\n", - "CPU times: user 212 ms, sys: 201 ms, total: 413 ms\n", - "Wall time: 39.4 s\n" + "CPU times: user 270 ms, sys: 200 ms, total: 470 ms\n", + "Wall time: 36.6 s\n" ] } ], @@ -2986,7 +3029,12 @@ "execution_count": 28, "id": "e899ad60", "metadata": { - "id": "e899ad60" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 677 + }, + "id": "e899ad60", + "outputId": "fcfda84c-ebbf-490f-f478-ceef7ca9e83b" }, "outputs": [ { @@ -3049,10 +3097,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -3070,10 +3118,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", @@ -3091,10 +3139,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", @@ -3112,10 +3160,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", @@ -3133,10 +3181,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", @@ -3154,10 +3202,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", @@ -3188,20 +3236,20 @@ "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "3 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "3 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", "\n", " source_document_id \\\n", - "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "3 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "3 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -3250,7 +3298,12 @@ "execution_count": 29, "id": "ab7ea52b", "metadata": { - "id": "ab7ea52b" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 238 + }, + "id": "ab7ea52b", + "outputId": "e38754ee-777f-4ed7-ebc0-9299ee122662" }, "outputs": [ { @@ -3337,7 +3390,11 @@ "execution_count": 30, "id": "6bdd3515", "metadata": { - "id": "6bdd3515" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6bdd3515", + "outputId": "e6e3f2c0-5b23-4336-bc95-013921f0724a" }, "outputs": [ { @@ -3451,7 +3508,11 @@ "execution_count": 31, "id": "20a153fa-fd56-401e-86be-4f7617affcc8", "metadata": { - "id": "20a153fa-fd56-401e-86be-4f7617affcc8" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "outputId": "530e65c6-7ceb-4c73-cb87-50da46c78add" }, "outputs": [ { @@ -3488,32 +3549,50 @@ "execution_count": 32, "id": "228df6b2-bc62-494b-9697-03ece98d7853", "metadata": { - "id": "228df6b2-bc62-494b-9697-03ece98d7853" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 914, + "referenced_widgets": [ + "8b7571c585df431eb901fcdebdf8177e", + "06107a2f48b3491f91bbe84e46e10ba0", + "bd74356eca18423aa0373c808d9097e3", + "7e13e8779a81400f996d4428c74acfaf", + "a75892696be546a3970962bae7bf732a", + "68997339f13240a4824a9e416096bee4", + "919b086abd314077bbff75687392bd91", + "b4c209371e7a403986991a786cfb296d", + "6c08de2dd9a2402c90b1a7a645db9b13", + "91fff81a1de8487c9009e872b751edb0", + "ada62d24cbcf4361acbb21808f334d33" + ] + }, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "outputId": "b10eecc1-cd17-49c1-e3b1-b80e0e1bfa86" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:47:42 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "22:47:42 INFO - pipeline id pipeline_id\n", - "22:47:42 INFO - code location None\n", - "22:47:42 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "22:47:42 INFO - actor creation delay 0\n", - "22:47:42 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:47:42 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", - "22:47:42 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:47:42 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:47:42 INFO - Running locally\n", - "2024-10-16 22:47:44,003\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - orchestrator started at 2024-10-16 22:47:47\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Number of files is 2, source profile {'max_file_size': 0.009654045104980469, 'min_file_size': 0.00907135009765625, 'total_file_size': 0.01872539520263672}\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.101744843646884, 'object_store': 3.0508724208921194}\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - Completed processing 2 files in 0.011 min\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - done flushing in 0.001 sec\n", - "22:48:03 INFO - Completed execution in 0.349 min, execution result 0\n" + "13:32:37 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "13:32:37 INFO - pipeline id pipeline_id\n", + "13:32:37 INFO - code location None\n", + "13:32:37 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "13:32:37 INFO - actor creation delay 0\n", + "13:32:37 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:32:37 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", + "13:32:37 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:32:37 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:32:37 INFO - Running locally\n", + "2024-10-18 13:32:39,609\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - orchestrator started at 2024-10-18 13:32:42\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Number of files is 2, source profile {'max_file_size': 0.009654045104980469, 'min_file_size': 0.00907135009765625, 'total_file_size': 0.01872539520263672}\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 14.943363189697266, 'object_store': 7.471681594848633}\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:47 INFO - Completed processing 2 files in 0.087 min\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:47 INFO - done flushing in 0.001 sec\n", + "13:32:57 INFO - Completed execution in 0.333 min, execution result 0\n" ] }, { @@ -3521,8 +3600,8 @@ "output_type": "stream", "text": [ "āœ… Stage:6 completed successfully\n", - "CPU times: user 422 ms, sys: 241 ms, total: 663 ms\n", - "Wall time: 22.9 s\n" + "CPU times: user 607 ms, sys: 226 ms, total: 833 ms\n", + "Wall time: 22.1 s\n" ] } ], @@ -3578,7 +3657,12 @@ "execution_count": 33, "id": "7b1c1d09", "metadata": { - "id": "7b1c1d09" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 659 + }, + "id": "7b1c1d09", + "outputId": "70612634-b336-4ad5-ddb3-782ca0676bae" }, "outputs": [ { @@ -3641,10 +3725,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -3663,10 +3747,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", @@ -3685,10 +3769,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", @@ -3707,10 +3791,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", @@ -3729,10 +3813,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", @@ -3751,10 +3835,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", @@ -3786,20 +3870,20 @@ "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "3 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "3 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", "\n", " source_document_id \\\n", - "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "3 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "3 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -3865,7 +3949,11 @@ "execution_count": 34, "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", "metadata": { - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "outputId": "d151e618-6528-40b5-fdbd-1c67291a7279" }, "outputs": [ { @@ -3887,7 +3975,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 31, "id": "dc0a6728", "metadata": { "id": "dc0a6728" @@ -3901,7 +3989,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "dpk-1-basic-022dev1-py312", + "display_name": "dpk-2-basic-021-py311", "language": "python", "name": "python3" }, @@ -3915,7 +4003,353 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.7" + "version": "3.11.10" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "06107a2f48b3491f91bbe84e46e10ba0": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_68997339f13240a4824a9e416096bee4", + "placeholder": "ā€‹", + "style": "IPY_MODEL_919b086abd314077bbff75687392bd91", + "value": "" + } + }, + "68997339f13240a4824a9e416096bee4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6c08de2dd9a2402c90b1a7a645db9b13": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "7e13e8779a81400f996d4428c74acfaf": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_91fff81a1de8487c9009e872b751edb0", + "placeholder": "ā€‹", + "style": "IPY_MODEL_ada62d24cbcf4361acbb21808f334d33", + "value": "ā€‡0/0ā€‡[00:00<?,ā€‡?it/s]" + } + }, + "8b7571c585df431eb901fcdebdf8177e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_06107a2f48b3491f91bbe84e46e10ba0", + "IPY_MODEL_bd74356eca18423aa0373c808d9097e3", + "IPY_MODEL_7e13e8779a81400f996d4428c74acfaf" + ], + "layout": "IPY_MODEL_a75892696be546a3970962bae7bf732a" + } + }, + "919b086abd314077bbff75687392bd91": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "91fff81a1de8487c9009e872b751edb0": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a75892696be546a3970962bae7bf732a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "ada62d24cbcf4361acbb21808f334d33": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "b4c209371e7a403986991a786cfb296d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": "20px" + } + }, + "bd74356eca18423aa0373c808d9097e3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_b4c209371e7a403986991a786cfb296d", + "max": 1, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_6c08de2dd9a2402c90b1a7a645db9b13", + "value": 0 + } + } + } } }, "nbformat": 4, From 27e7134ef20f19c1ed0132820a3a187c5f21b229 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 21 Oct 2024 08:02:01 -0700 Subject: [PATCH 13/19] Update examples/notebooks/intro/README.md Co-authored-by: Maroun Touma --- examples/notebooks/intro/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md index 14d56e8e9..30c7a7b24 100644 --- a/examples/notebooks/intro/README.md +++ b/examples/notebooks/intro/README.md @@ -14,6 +14,7 @@ conda create -n data-prep-kit -y python=3.11 conda activate data-prep-kit # install the following in 'data-prep-kit' environment +pip3 install data-prep-tooklit==0.2.1 pip3 install data-prep-toolkit-transforms==0.2.1 data-prep-toolkit-transforms-ray==0.2.1 pip3 install jupyterlab ipykernel ipywidgets From 71e0dc2bb4d40a87f8b09900edbc31dcb575a24b Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 21 Oct 2024 08:48:13 -0700 Subject: [PATCH 14/19] Update README.md pip install in 2 lines --- examples/notebooks/intro/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md index 30c7a7b24..4a45cbbad 100644 --- a/examples/notebooks/intro/README.md +++ b/examples/notebooks/intro/README.md @@ -15,7 +15,8 @@ conda activate data-prep-kit # install the following in 'data-prep-kit' environment pip3 install data-prep-tooklit==0.2.1 -pip3 install data-prep-toolkit-transforms==0.2.1 data-prep-toolkit-transforms-ray==0.2.1 +pip3 install data-prep-toolkit-transforms==0.2.1 +pip3 install data-prep-toolkit-transforms-ray==0.2.1 pip3 install jupyterlab ipykernel ipywidgets ## install custom kernel From b3acad2eef97cb110f6b049af9f025614ce01921 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 21 Oct 2024 08:50:09 -0700 Subject: [PATCH 15/19] Update dpk_intro_1_python.ipynb Python only needs data-prep-toolkit --- examples/notebooks/intro/dpk_intro_1_python.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb index 91bb79060..f3659afcf 100644 --- a/examples/notebooks/intro/dpk_intro_1_python.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_python.ipynb @@ -149,8 +149,8 @@ "source": [ "if RUNNING_IN_COLAB:\n", " ! pip install --default-timeout=100 \\\n", + " data-prep-toolkit==0.2.1 \\\n", " data-prep-toolkit-transforms==0.2.1 \\\n", - " data-prep-toolkit-transforms-ray==0.2.1 \\\n", " deepsearch-toolkit\n" ] }, From b236dc062a6d7c6db0f5d0e73fb3c32348c25ae9 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 21 Oct 2024 08:52:02 -0700 Subject: [PATCH 16/19] Update dpk_intro_1_ray.ipynb We still need data-prep-toolkit, and the ray version of transforms --- examples/notebooks/intro/dpk_intro_1_ray.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index 04af8ecd9..5bf90522f 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -150,7 +150,7 @@ "source": [ "if RUNNING_IN_COLAB:\n", " ! pip install --default-timeout=100 \\\n", - " data-prep-toolkit-transforms==0.2.1 \\\n", + " data-prep-toolkit==0.2.1 \\\n", " data-prep-toolkit-transforms-ray==0.2.1 \\\n", " deepsearch-toolkit" ] From 4d070ca8ec30af78ad3c9da9983ed8a8759cb758 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 21 Oct 2024 09:44:48 -0700 Subject: [PATCH 17/19] Update dpk_intro_1_ray.ipynb We need transforms only for ray version --- examples/notebooks/intro/dpk_intro_1_ray.ipynb | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index 5bf90522f..da33a3499 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -151,6 +151,7 @@ "if RUNNING_IN_COLAB:\n", " ! pip install --default-timeout=100 \\\n", " data-prep-toolkit==0.2.1 \\\n", + " data-prep-toolkit-transforms==0.2.1 \\\n", " data-prep-toolkit-transforms-ray==0.2.1 \\\n", " deepsearch-toolkit" ] From 4c831d01342b512f8471109b78201f38506ea6de Mon Sep 17 00:00:00 2001 From: Maroun Touma Date: Mon, 21 Oct 2024 16:25:34 -0400 Subject: [PATCH 18/19] fix link to pdf2parquet readme.md Signed-off-by: Maroun Touma --- transforms/language/doc_chunk/python/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transforms/language/doc_chunk/python/README.md b/transforms/language/doc_chunk/python/README.md index f962717d6..fbacf4ade 100644 --- a/transforms/language/doc_chunk/python/README.md +++ b/transforms/language/doc_chunk/python/README.md @@ -4,7 +4,7 @@ This transform is chunking documents. It supports multiple _chunker modules_ (se When using documents converted to JSON, the transform leverages the [Docling Core](https://github.com/DS4SD/docling-core) `HierarchicalChunker` to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc. -It relies on documents converted with the Docling library in the [pdf2parquet transform](../pdf2parquet) using the option `contents_type: "application/json"`, +It relies on documents converted with the Docling library in the [pdf2parquet transform](../../pdf2parquet/python/README.md) using the option `contents_type: "application/json"`, which provides the required JSON structure. When using documents converted to Markdown, the transform leverages the [Llama Index](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#markdownnodeparser) `MarkdownNodeParser`, which is relying on its internal Markdown splitting logic. From b297156f0ec4c201164122a283506a658acdf9ba Mon Sep 17 00:00:00 2001 From: Hiroya Matsubara Date: Tue, 22 Oct 2024 12:29:06 +0900 Subject: [PATCH 19/19] implement subdomain focus feature in data-prep-connector (#725) * implement subdomain focus feature in data-prep-connector Signed-off-by: Hiroya Matsubara * refactoring Signed-off-by: Hiroya Matsubara * bump version Signed-off-by: Hiroya Matsubara --------- Signed-off-by: Hiroya Matsubara --- data-connector-lib/pyproject.toml | 2 +- .../src/dpk_connector/core/crawler.py | 6 ++++++ .../src/dpk_connector/core/spiders/sitemap.py | 20 +++++++++++++------ .../src/dpk_connector/core/utils.py | 5 +++++ .../dpk_connector/core/test_sitemap_spider.py | 20 ++++++++++++++++--- .../test/dpk_connector/core/test_utils.py | 16 +++++++++++++++ 6 files changed, 59 insertions(+), 10 deletions(-) diff --git a/data-connector-lib/pyproject.toml b/data-connector-lib/pyproject.toml index 402a2642a..737234bdf 100644 --- a/data-connector-lib/pyproject.toml +++ b/data-connector-lib/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "data_prep_connector" -version = "0.2.2.dev0" +version = "0.2.2.dev1" requires-python = ">=3.10" keywords = [ "data", diff --git a/data-connector-lib/src/dpk_connector/core/crawler.py b/data-connector-lib/src/dpk_connector/core/crawler.py index f024e63b1..491806398 100644 --- a/data-connector-lib/src/dpk_connector/core/crawler.py +++ b/data-connector-lib/src/dpk_connector/core/crawler.py @@ -74,6 +74,7 @@ def async_crawl( user_agent: str = "", headers: dict[str, str] = {}, allow_domains: Collection[str] = (), + subdomain_focus: bool = False, path_focus: bool = False, allow_mime_types: Collection[str] = ( "application/pdf", @@ -96,6 +97,7 @@ def async_crawl( user_agent (str): The user agent string to use for the crawler. Defaults to "Scrapy/VERSION (+https://scrapy.org)". headers (dict[str, str]): A dictionary of additional headers to send with each request. Default is an empty dictionary. allow_domains (Collection[str]): A collection of domains to restrict the crawler to. Default is the domains of the seed URLs. + subdomain_focus (bool): If specified, only links under the subdomains of the input seed URLs will be extracted. Ignored if `allow_domains` is specified. path_focus (bool): If specified, only links under the paths of the input seed URLs will be extracted. allow_mime_types (Collection[str]): A collection of MIME types to allow during the crawl. Default is a collection containing "application/pdf", "text/html", "text/markdown", and "text/plain". disallow_mime_types (Collection[str]): A collection of MIME types to disallow during the crawl. Default is an empty collection. @@ -140,6 +142,7 @@ def async_crawl( seed_urls=seed_urls, callback=on_downloaded, allow_domains=allow_domains, + subdomain_focus=subdomain_focus, path_focus=path_focus, allow_mime_types=allow_mime_types, disallow_mime_types=disallow_mime_types, @@ -155,6 +158,7 @@ def crawl( user_agent: str = "", headers: dict[str, str] = {}, allow_domains: Collection[str] = (), + subdomain_focus: bool = False, path_focus: bool = False, allow_mime_types: Collection[str] = ( "application/pdf", @@ -177,6 +181,7 @@ def crawl( user_agent (str): The user agent string to use for the crawler. Defaults to "Scrapy/VERSION (+https://scrapy.org)". headers (dict[str, str]): A dictionary of additional headers to send with each request. Default is an empty dictionary. allow_domains (Collection[str]): A collection of domains to restrict the crawler to. Default is the domains of the seed URLs. + subdomain_focus (bool): If specified, only links under the subdomains of the input seed URLs will be extracted. Ignored if `allow_domains` is specified. path_focus (bool): If specified, only links under the paths of the input seed URLs will be extracted. allow_mime_types (Collection[str]): A collection of MIME types to allow during the crawl. Default is a collection containing "application/pdf", "text/html", "text/markdown", and "text/plain". disallow_mime_types (Collection[str]): A collection of MIME types to disallow during the crawl. Default is an empty collection. @@ -198,6 +203,7 @@ def on_completed(result: Any): user_agent, headers, allow_domains, + subdomain_focus, path_focus, allow_mime_types, disallow_mime_types, diff --git a/data-connector-lib/src/dpk_connector/core/spiders/sitemap.py b/data-connector-lib/src/dpk_connector/core/spiders/sitemap.py index f24d4088b..de18ab596 100644 --- a/data-connector-lib/src/dpk_connector/core/spiders/sitemap.py +++ b/data-connector-lib/src/dpk_connector/core/spiders/sitemap.py @@ -28,6 +28,7 @@ get_content_type, get_etld1, get_focus_path, + get_fqdn, is_allowed_path, urlparse_cached, ) @@ -42,6 +43,7 @@ def __init__( self, seed_urls: Collection[str], allow_domains: Collection[str] = (), + subdomain_focus: bool = False, path_focus: bool = False, allow_mime_types: Collection[str] = (), disallow_mime_types: Collection[str] = (), @@ -88,11 +90,15 @@ def __init__( self.focus_paths.add(path) # Domains and mime types filtering - self.allowed_domains = set( - allow_domains - if len(allow_domains) > 0 - else [get_etld1(url) for url in seed_urls] - ) + if allow_domains: + self.allowed_domains = set(allow_domains) + elif subdomain_focus: + self.allowed_domains = set() + for url in seed_urls: + if fqdn := get_fqdn(url): + self.allowed_domains.add(fqdn) + else: + self.allowed_domains = set(get_etld1(url) for url in seed_urls) self.allow_mime_types = set( [m.lower() for m in allow_mime_types] if len(allow_mime_types) > 0 else () ) @@ -155,7 +161,9 @@ def start_requests(self): ) def _parse_sitemap(self, response: Response): - yield ConnectorItem(dropped=False, downloaded=False, system_request=True, sitemap=True) + yield ConnectorItem( + dropped=False, downloaded=False, system_request=True, sitemap=True + ) seed_url = response.meta["seed_url"] diff --git a/data-connector-lib/src/dpk_connector/core/utils.py b/data-connector-lib/src/dpk_connector/core/utils.py index d2dfa760d..50a9c9981 100644 --- a/data-connector-lib/src/dpk_connector/core/utils.py +++ b/data-connector-lib/src/dpk_connector/core/utils.py @@ -57,6 +57,11 @@ def get_etld1(url: str) -> str: return f"{ext.domain}.{ext.suffix}" +def get_fqdn(url: str) -> str: + ext = tldextract.extract(url) + return ext.fqdn + + def get_focus_path(url: str) -> str | None: parts = urlparse_cached(url) if len(parts.path.split("/")) > 2: diff --git a/data-connector-lib/test/dpk_connector/core/test_sitemap_spider.py b/data-connector-lib/test/dpk_connector/core/test_sitemap_spider.py index 826963f2f..308c4ff89 100644 --- a/data-connector-lib/test/dpk_connector/core/test_sitemap_spider.py +++ b/data-connector-lib/test/dpk_connector/core/test_sitemap_spider.py @@ -1,13 +1,12 @@ from pathlib import Path import pytest +from dpk_connector.core.item import ConnectorItem +from dpk_connector.core.spiders.sitemap import BaseSitemapSpider, ConnectorSitemapSpider from scrapy import Request from scrapy.crawler import Crawler from scrapy.http import HtmlResponse -from dpk_connector.core.item import ConnectorItem -from dpk_connector.core.spiders.sitemap import BaseSitemapSpider, ConnectorSitemapSpider - @pytest.fixture def crawler() -> Crawler: @@ -22,6 +21,21 @@ def crawler() -> Crawler: return crawler +def test_init_subdomain_focus(): + spider = BaseSitemapSpider( + seed_urls=( + "http://blog.example.com/", + "http://contents.example.com/", + ), + subdomain_focus=True, + ) + assert spider.seed_urls == { + "http://blog.example.com/", + "http://contents.example.com/", + } + assert spider.allowed_domains == {"blog.example.com", "contents.example.com"} + + def test_init_path_focus(): spider = BaseSitemapSpider( seed_urls=( diff --git a/data-connector-lib/test/dpk_connector/core/test_utils.py b/data-connector-lib/test/dpk_connector/core/test_utils.py index 096a4e194..54f15a70d 100644 --- a/data-connector-lib/test/dpk_connector/core/test_utils.py +++ b/data-connector-lib/test/dpk_connector/core/test_utils.py @@ -7,6 +7,7 @@ get_content_type, get_etld1, get_focus_path, + get_fqdn, get_header_value, get_mime_type, is_allowed_path, @@ -83,6 +84,21 @@ def test_get_etld1(url: str, expected: str): assert get_etld1(url) == expected +@pytest.mark.parametrize( + "url,expected", + [ + ("http://www.example.com", "www.example.com"), + ("https://www.example.co.uk", "www.example.co.uk"), + ("http://www.example.com/path?query=string#fragment", "www.example.com"), + ("http://localhost:8080/", ""), + ("http://www.example.com:8080/", "www.example.com"), + ("http://www.sub.example.com:8080/", "www.sub.example.com"), + ], +) +def test_get_fqdn(url: str, expected: str): + assert get_fqdn(url) == expected + + @pytest.mark.parametrize( "url,expected", [

8AV^n$4~yq7tEKLDqGYPR><80!yP`XKh&P~~v`d~x z$f3tF)kG}#(9by8FX1S+$Q&%$lA~Q`f;fxPow~x;ZLh(eGD`1Ro%@ota8cfeAyO*( ziQ3*(K+R8>55({a0k#FRsjHM)S*2yCW3X@=`0(k2ObxJQ?S;b^4Iu^c z>FqFd;lQ`K^L2Cybz%Af>?a$nHTE(ldzF3-(`c`IH$CFoYcRACdEi97o*ypY9WS%S zH%I4&H-gz6HFe*86%9Wbp2eZq?qk8Eaq|~m{r&yO|3L|5{y!+8jLeMe|H}gXj}pqx z#>nu$%m0@W%0$n=!utP93GHxE&f03_jqd6~+TJN>??Q76MB0w*`fs4U5r!0wL@8~T zxG%)rI^Au2Wq*CWVPQ;UnO!oSWm?rlBTzD;xR5G1fh*-F1COUf#AKx4<>!~5>e~Ss z8kr~pYj$pAumi7WXlQUCl$Q_Z1d!IQ!s0_Bsp0eakF7xETc3X3e%TI8PKNP=#)Ad= z&)!Ko`B$LvBb-|n9-KfJ1J7rEgDPWdYiCHyXz#A@$Ct1`h`-$u-tM2DSld{fe3F5I zBQ52L`*n|`13HqLw!-4lwypvo#dGk-Y9wIfAMZcVHiOM``U8~mQz3Ev^A7^$2QG)1 zS58w_0vE5Mu&OLyV(0@`b#rodw*SSWJ-o!sT#N>QR3L$o0JtY2;@4EtIhrd2gMXgg zOGL;oeyIIRI@`IA>nbb`Ef0;2B*vP`!vwGfaLaFsddX}48;SYM!PvL6ysJf>g?;rS z0hrh9=;VBCs&{vHr`PK2XlLYN*J9+_{27rOn%o54KiIea=kCV|eBaPw;dYYzzYM--D1F{RfF zMM((-;DuxrPX%ub#OTH5%;MnU1c?4e;^hZCWAQ5j4*~Dy*Nq zOSRPY^vd$^euu&Y*wNwrDQ5TEWX%K~AJr07HPuH51cwcd&k4j2FSCuCp%!`Sw|Y0zo-^@R`D+GY{&sBCNZd~=f9tP z7u7n&)jR>Gl&iI-s(+R@eaDTpq<2qrP>s#Ny>wH0gfo0vlea#udvkbKd}C<-5iv<$ zea?4A7N-AzI@!b}*bPJKFTArIdc|2hvU#2ve%cL>e&h~5`ylVS4fN6MikSY1YTvz@!n)o?J@g^U z&hjC8ef1eNHuNen{ec?;*8}{;-tT7gfgR(4-5t~XhF;%c&{Y4qz8AUJJN<_JczLA$ z`QJfT|DHbe!m8@Vs`AhCQSX|p{7k|@}?iXz6MK^kIXMEQR zi@(ped}oaFr`q?$nc1B;?8+b2cXeX?g;(+9*L>j9s(~Fay9)bfj&sJPM8+2Omb9{b z%V${c@c3KvF~@FgefRS&*0#PI<@V8)btj2tHRo5pxAc26{W810GIW4<=>}%|-yJM= z{s8x`-rmP1d%XULn7_OVbo~bYwi77*Y)krLwO0Gu&i}O6{0v;w%)qtsQ(Hd1l(ER;g@w|Z=yik! z0F@G7+m=QnPZ6qhi>2TQ5zrg>NEA{^nEgpL@Bqh%tmF~cP#6!npxg~3r0TTv6Rw-t zWpwWueE)q;FB0L6wZz4Sk6Oxi9$#Gc8+6i;_?7lNleJvaPthK*$_WU=1t@z;@C&*g zL~NgliR?~Tk7~7r<w^KX#(Ov4|I7|@+jJ3=IWTZ{u1 z?HkHW@=sThjJ}+VwpCx}~ld#<$ zXukHtid#+Ey10Ld`8fvt30|E8xWwnfThJ9V&wIs(f{1fecRWT)5j^%CXgGdlR_)uS z#cdY4svMq}Vk!$It>`N9x;m*dREG)}VPWnY*!~+gA*QaVS&XCah{zjp^xc|`w2$ht zK2Ys%MIc5>m!^+3K3;hLu(@-#P*lSz*(Vx6mk^8z0N*k7IC_g%Sa5swsg%N+8&MS3iAbE|=VIaU+PP5w>Y2CZw@jnB^?}M^~YzzPBa*F$!1>qo}?k z*l!bllvZl8Jti33md*!>xfZ01t;Ns|^JJU@_rNvh%;9s57D{lM+jc8^@rlLEpU^Ad9#!YJO z_|i!_4P0_9L4>a>B+%Oo|2FzA_zw|tXG|qJKu<0}{IEF~1`a%M9nxeA#3{rv6l?%m z7CAC1urQ;L%*xpA5;v#600uA#V+KG%1Vj|igE#iJA9MM?aMFs#(unI09!{*S_)?S_ zC3)3~a*$%5O|-?oh3kyTU`YdTU-9u+82q=jwKN$?M!u- zwXFyQ@rdV@KHJfaac zh-&9Y_2AodDT^$^Tyr>ug_{@|hv!FMSHch?<|#JK2M5-GF-^y&JO?dBd8hifj8RLEXe5$Wzs1q2Ou zgKYsP4N>NT3wDMc*~P3Hyed~XBtw|`9M>*ykv2m0>9jOa6I?O8<*M%m%Uvy61e1S9 ze#<%}{2xj$EZq_Ft6eBh(B9Gr?s-Q|S6WlKNC4y?0`E_~mW35-_p<)t6IC2v(<16M z%JP)+Rimxq(mc0SA13{aE6`w_IRz^HxacEZ-b&D(}%Bvstb7IksREU*0DG zRi|h%b!qt}i4QS$zQ@vk-7_>*>@Pg%PV}(dOx2E8cIWd5KcI;!kEfKj5qUN@N6!%v z$AJ5?Gqr^s5s_I~aoc8aoQ++C$$TBBC55k|>K%1BCl2Ei-JJ~4ylMVEb_nfcMx{si zSf@_Ev2?2Teqz)o=^KGQp&Qh&zs1?~@;zJ>`Ildj6K!1FC|Nb0Vi3C+*~zTSRf?wu zwq?sVG3{oB&}VoKp)2(`5&xMYd>ea?Vh}%EM-tsA-AieZc>a?g;Q5Su(4`Th)Z*j7 z>@n_}Z9es!VRYCb!c?;Pt}j#^qxr=_Pi|(nAMsecQh_6WF?-gWH2-RDY@;;)Mf-P| zbZ?dYGKsZVzJj@LItsoSm(6XrCK(Z z$=vIu>R4MS_nkb=<3#HcB}y5O%0sEwzz_u565U4k$erm{^X&B$DEeGhZkM$&Iwy=W z38vvmWCoY%VhOJY4X-Lh@Wv-~$?Rwh?ClBmgEgeDz8_3Q7ROUQ+pmZF%7l4)|; zj$4o%(=0GT#cj!(P|9z#BSY%xFqSIJv2)mZuMgrXNkthSoo(Zf^?`Pcy9q=P{BqrBC29jggIhhS@W5cHP ziDmO0u?7juWsiAd6R$NI(JtpkYG2!C^g|ymyLjo-j&IxXP?y48X+F}#R_7*?8UaExL;wxC z@@oKdmG3g-bR=%Q&QQE-0-?CMcxX8_;@iVBrwA?|y7pt6dJlH5J;7$N45SARg)BGK zpOj2iYj}{oK{)DQvO))xx##4}7BhJthWle?!D%qqgqphHaGhw2K&b-!-5ek1L_gRl z%!?QYu~|p)Rdz$v=rND=7Nq|{Pnvy4rpHr~hF#WWLbLs~^CxltJwc|Xo?EXZ77D^y zJQ8aHGjKlP`f?0abl-8R<)S47j#m%z6xjAV@(UReV(#Q%`%#gK8e>eQ z@A{wYbF4x0Qx4;6&qv&<{;okZFPITfd7ejv(NILf5-(@|c;O5?;3!rf-EmIxUM&M+ zYt}el`GnB|ME`~$V-1oiogKRHswu!XMMYVB=aQe>?&i)gC$9-~F6bWIGAld@NBriD z-H3{(nZ<@Ad4P|_^n}yUacs1gZoF(n(MG2ywD-z+%Ti~!fqtrB2vh9Zn4cq2GNy!? z`o;(0LzC6#jn|NF(cH*14ak2pLBp1!1EuljZiq`}=fk>SYd3&dp)#E;INLfM<;iSM zUSyboPC7hDOHl>+bYE`uTRXf(ge58MSX#2~i z)xql)yawOrB>}DUutg)H`QX+|P0kt8n?tt;{D)~fYNcQiNRT*#M%M^rg z;l;Gsl^#5mc`0#IBT3WA5RWM%4t{7gy%T7iyhgOQvZ+(R;JU{)z|v4gY8$+q!GSzn z>dW5aRPFhM6a*OS>ooClEi1)|?inG-0$7Z2V6`ep`aw&3W^tU(d0X1$gWEc{VyS8Q zy^kIUFLH&Jd=Yx(10URJSzn`W(A*1d+pjC53l}66ee@f=n+TbvLJU#q^&?vIft>c~ z8UD;@YL?0}B<34C<@$%SSMqh+2UAM46Apxax<#D|JxAl`{Gnt#DxS>4xr5N^ctcGG zqY;SMNO#52N*%t6I@9ire3}FIe%AClC~^c3T6_Sa1OnD^=viU4qUR?|&w0vu;#Vep ze4@he?StrG!*m!&e2C{dCPVI!Sj;z-Ye|8U8u5rO(n3I<*CrJ+Zx(W?HtA<@a9JLeeB;?AM zle`O~>V9$|m#d^4k&7}L$wwB(E1(LSL;}_%Bfda48;8PaY#x0)RkRh5`dO*7sb(7D z6LF?+-kI&oS8t5p-$Pe#ih%F@;c7|&tp@$I;1Z>~rLYP5m$kQH5)(Pcp>v4&@=NYe>opw{v1Zvy8nDj+EpPz;43C;Rl)3->=@&Tj;JFMUj} zZYDtDK+5{}Q$h$V!8C>8?OD(Tn7Oy>eoetiNBR6~D?*f&)9*`52Ca`AMZ4;jrClwF zKr1$v1t(nfBpb4`WG_X}pF#?B$|FHSh_DE`UGyToiqY0mOZWUMSiHy=#t7_qsuR%EP8K_UVk5FdpbrbTf0(UAA4Zb1q9?ry*mYq zH@o48&_6MZHO(lD<*=BvpzHjW2*i&Zm*U2s+Od&a{1?GIlH-OC4()pD9zQG?vx;W` z1KRmfz(1X$!9X_c5z^%>^0jP-#=>i1xNaiF7Q9)5UvbkdrnCqa%6^O~rTfIcW|UO1 zrH6;Ef&y;nt}Ih!dB2k-36+p_CunUJXGM?IxPGzox~1mZlw?;yIC@*4%Bs~FPXz&$ z^PmktV!4VpJ`OB-XYdj}R>&AbW436VZ9=*3H3+>Fj%nb$*D=QkR071L7+)yTzZ2H% zOUR;A8DM(Pb$XPO#yYs;IQJ?wW>TGSMAoB@_q6U4aEL9z26awXs$pob1Hx~}RVU=s zP$%fl>7U2VWW}9Aji5=`rgBZ&;X3Cu9J?}IR{F1e?rrl??19A=kf~z1O&`9AQ@U#s zgwZvs&PfZ77D#V7G%V1*v$Q2?iUicGzDzao9EkDKvthK13C|k&BH&9MTisvi_bR7f znjl>q)6A&?cUuf-+aYM9V`c<4%6qAObGlWdS3S7`QJ2ORP%okq?mD7^5w{)SSBsOG zT>!2719b_kQAXDb>l(k;juaMy$R`a0TC#KIP1J2XrORa-dSnF9kzOv5GHedOC+;l4 zWPjQ1v72VH=ErLWB^DVd-n{W7GXQwkJW!OsuFxB#G?fLeI~2Nh6(peP^5*oX^9Ri* zdD~2E6dt-$6I=6N500=?A7x0_I|Ta03j0AgwZgncf>>bg@u7 zPvMyUL@YeyQ>g9WQ;`_Eqz8f*f(?_dZ%NE?C*KIZvq*(-rVwS!MX85px6}@Duc1B6 zBNNqAYNFT^m4rKcMo;2m=Wp33DmcZKQ(aC_W7Nb}M{e zYTt!>g&TDBB8YmC2v#QW^W!9f=98g-EE^4zVIBwJh-?mT>27g{ov9O{<_g?@+0WM5 zh36jki3j;`*u&dsc*d0%#&y4ZMc}kLl1qc@2S%#z(3&-ESV{e*43k{mBWyB*=0{E? zx<-gb-QALenW|8=)y%!uO~^SboRFc?hLWtz;1Bu(&FFs5IJBHzNze7slWsDdy2y# zV>MY`hwc6|`Dla#<#2yJUvC@chyIKpCtay`-=`0jU1laA_!JJpT`tQG!Dpq}^H8px zw4Ywi*tiwQB67Z0kL(bTz<}evKIe3|X&PE>ciGjK zfO;^bN;t*j0MZL;<5|^|@~t@t2-Pe&u1_&iZnZnLVEMXr&mXdXSx>MMDb-&H^1gdg zGi`ToEMedM5nc`S0Hd>;`k5HJS%%>;H2pite{8Gm7Nw5nb zTsnV&#I;+RjWz021A}uoTkELhv0^CilsDa~bhk7SBiFK6C3=&#oR`%J*W&Owk4SvJ zQwE%0`&Sm(NCmdofp1-C*q3O(F*p09B16i58>W^({aPmn_{wnGq=RE2vUfRmD6EcC~C$cA~eJ0N}}b}s=#zZr{JpSx)ej1mEr?Gb(5QV zX|5jr&$rirpE47Eza@@QDW2%8T@A_ch!TU@6OATa?y~3!IQzl6W7kbm$&>o1`sEgL zElaC|1ephEwPN}G$YugGW#lR9%c>@)GSr&=M@IKcYd;_|wpHtx6ve6O$+yC!$mAmm z_xR76+6@*vG$bgv@(88IUJ}WDG6#zmYD|zg(2DYO{qj@DvyRK^%>Y3Hoa_kf+0k`x zBANX!@b;%^c1G7a&(q_fb-`Da(FjCgibv?ZeI8*OOPAKx#=wPPJh;DU%PGh8Qh5fA z{1rRA91_5&o&2hmStbk`(7O&51U(`ZSEoD&^r`H!^`lUfxG#vvEr^=uk}#$Ez14jq zVBI%-Zfa5in)eb!+S;$kg|G~QVIu#R&*ibCO*-?j*X$=6TIt64;mXH4WitCoBTa$! zfA4{oJ>HL;saSQ;ZcBQ@c6iOm8m*Q>2a{^`7T{~w4xoOQhPg{P!NYF;&DkUVP8R*o z`>fu|xMHGTI}kZ{nZp-JPxU0CmxTRmNT>%S|9t*TLJ7zrY)zfK=3}d#K zv$TDlln3a~dbiHeP6;eAck@R$`PPMWGx>RiDZt->R8q(=&wy|w5l${T29Yp8Q`-xH zUWL2MksOXBA0~W%E#L#R5Z`W7GhISlZ&qbfy(gX2=_p}4?F2*NgV9LbY|yd}$kM_Q z;5Ab+5`OT-@Y*a!-z#=Nf4!4r3{LkJBu$_)^?T^gb*mv76RH(n z@_NOp!%SkX@e6eVus1ZH^Vyy%flGOBt-%TbFUbjUc7|z)-vPPLvkC$WD}Cf`FPDgu z=BGnan#R4H*potC6{?saexP;T!#rm>#;X4{7cZbs@4APyE-J@* zn-kLDDekFVgDJ8KuXO*sR$k4dsBYA83Y4$;aBYG?>}(_cUgBO}ph1}S0BI-}N zf?Kv0u&V`X5NlgWuSAtizssuNhe{gfui8Lt=$(SKuH*=a70=^yaBElb$PQxjbu zmFo(Aw!(MqOMKTDLkzjpK~KpK8H)7`hFa7v=s%BFt!UCPB4avZMIU>W?bvM$I$hw# zikrihlYbZ$O}FA*K4bGC1hfCbbrTDT(ms_yv#~Pu(36aYmY`kh^xg<$TgI<1C~s2{ zaCp45aE#DrPS9xX%F;UFg-uxU%+Y(Ct*RN(c?y)%UamFgBqtf^IF*Aw^SI3@d- z!1bigsTpJD+=Ay-pmv7==32{LDuDL!fN|Au)k-x=1Lw!g2Nx0Mn+sj>7vsz}JhE?* zQi|yr2q<=Vv+(siyiP4k`$V1Vrb#dQF!ofjf!7)cD7OdKyhYUA0Q4G`n^cb1FF|{H zdN)9k#ggjMeaHPY$z+=G?tCzlED+s)QS7Kg(s*D5`i$)a*mw3+UFI#&Aw`=}16%AZ zW7{y#T^#W6wYZVvKzyvJzp{eIO-2|*=)3~8MFCtU)0?q8uKM2}W`3skJ5&K)f@CLv zc%X`r*T}}Np;Rq50ci`uTkH3@9w##UD2jV%{i#j@2RFg_UR?@{lqSqYsdEE)L!gi80Qro!O z99}G~bQo&Z(-=B%FkZ{}0rk}!Jb)d+iONGdq|vxwdlmQCyV&WChh)_N{!4<1biE3o z9vzD;Bvdkb4Js({Qg!3e3TcJ?Y+1s zk8>bA_8?mA*hij}k1dbDpDRV#PNBLRT&sJbck45v$zSYXOHVF@l3YcA#Nyb42*^4L z>XbrS zl$BLHFd-HBieJB)|HGZM6zhue+X;>)cDo9|{7i2x)+Q@g2+U-%rWKwp8rv8EKLyeF zj&e5yAHXVe`T1bA++h8jqSnQ~A*(^0O51}Ug9|*0zM%LF6$#px{7~ib7e7ADcE4|Tz$hp} z0RCXtp}g3*B)L<5PVe?y?ewNTA~@M^^|&^(4maHr{;%c2>yyPC!pjsdP9NVaTF95O}Prqw9ln7mlFL#p36k|hJHL8zS*zB#u;m~iaFzN zaARr9fiK=%)=(gXLz^=BQE?obl~$8UpayPtHxoPXQEJ3T;95Gu)?W82WPud5RrDQ`o&(P^X;NK0LOSUufFbZZ@OUNA?a?}lBD$cR zm^{XLC8Bnl4{(T!4Z(y!t<2HbnzB+trl5Hj?cZqtt{XI4^q=!@EM@VgiF>~()1sCB zFh_&#VX1Bu-x*Qj=M2Rr34sY*RYcJr?>QCP?mw`*hfna9)B|vP6@Z7Hw8AR3G>59Y z021-+Hmd(Dwf-a{(T~PMcaH9)M)F*&%h}bciy)t;r!t#FQb~ua{ZfeW7mr9$fE3fy zk0cgosVHGTrfDvp#%-*ML6&b3z+G+DUCM!Yviugq8!DsMBp%Qy=Bu}?9oaY_1M(am zFX?C)>b)`|NIj$WkQinRT%MXsB=Xes#F>av0X0pWPshZ@1S=})U`~7g^$Nx3$u4J% z{!=_OzjQ8S1snFOPFu~d*-21^DzVVKxM#L7qz2Gaq&E0m*J?a7tEmm>nCH)DWP;n> zPMx(pYaxq=NQ|=w_;wb@X)$$~@d=OKl>Eoe>ikH17-UT_%#wDHcvXt-26YEb-I({w zH<04IYw=Alehe7?%n1uTyv#-2G%|zc+7c$-X9-XcN3$@6WlIr`VdZAp}-#UN^2n6{mr zaVaF?B=8&WgxRC^LVe{{4PPUc{Y$2KmBq z8VAJDt)F{M03MagI+A4uqA|P)hxQ>b4sHN_=m@fy_oQ5PX|!KXJ!OETceR2dY4Ns9 z!i0`w@|{p@W+4Im;ZWfoeQ#y0!L!5T$+=zcVr{0@nYyZg^Jr$In;EnUePGsKl^&Vg z0+aJSdM>X+%spqzxbV%DPSr-e#)C}AC0b9W-&gS_GsKsD9#?3O(C}QOQ2aV^C}Y+D zr=5cUZq0T;V6Eg4?u#cm|1`jsn3HT~fMSv0U2wVzkh*L*gk&O?8Gq=0%h%X1$>*!A zbkx2tGl~*H9)cTy?Jq9FTMBxiOOgxkaUq;UlPNz}A~$cG zGSC-3y`k@6sds`jL!ZSnfz~S#AzaeB(HHzoKeDHz!nh=P4`%k`d>h&MGjHxzv@1;t zglV(1+C}mgkvaHW2+1pw$;KiNrXEXy-4=GzT2H{YhjJ!5v?3@rMOa1oZRBD3GtRC* zrCx-W8^F?NP8&%VPGunIDLLL_<7d%fu`nNNQKf-4Vv-aZ?c+p^W4Ax1(oOWt{%4^1 zi%xHBYASf<>uKrm4=5XqlplmeN7CIer z^<2L=1v z+}Z(Nkn|PV19Cf1vV3%WW7l^=^MZ4;<{tj{h`rpnSQD4b+z?p80q!-S)u%)tGbpIt z{{T!tv%gCFuE0VBE(8RdhVYFs`3mZai0NevvA}%e#65m!!c^MlnDx(Pd3Jz)ZyGPv z)G{n1-Ix>eliXd+_=&t^ElW&G^}fPB9%{;`9C!yvK7ZR;2r4d5^#3%6;n+#{dH7>9 zX+2eLv5hGZXcrOR1^0C*Ty%iscL%jDc1UR86>;-pQ>|5xu=clI1M@e8Q5zTT1EiE- z24dJ`)XQ6msTh4Y+fyJ?c#SpO^q|C}7-?q;DrbZCWxU1jI0uu9bi$ceUXX335jq!0 zirz0UfQH~2_VU; zRP@0cD+EE>8;~C3$;}^Ez`zU7Q1!f_*$3nY7D4NQn9LFwaa+BnLhA-~Rh;v-H6`kE z(b9I_URC@keN$07{J&He0f7?=GREWmYn(JNNQrm{n%H*GQyxfM>@7~gYP(aCVFIUw zhB4LZP*|?D90c1?*Ci?iI7X3jGu$5uFD^$w$e^`S~@Jl>3X@GS9aDy zk>^|p-;auHT1y))q@9u3~0)jmjkQG|LleBM!P$_-1ngC{TbFo=1w zTwFgFWVL$Cm0NY|E!pJcpM#XxZytFNUVvbt%>wet{vyIe<=OsZ+URHco#O`s$!t5s z*@0ZIqvq|FS=wsTodo(AhTLV@0RmWr1S&BICB3DpqQvI≫z6k%Q51@%pB(=SgO^ zFRa_3L=!Jms9T2(`W9K(Pjn0}r#1ukpOU+t2!d8kb@j=Y*SHB#-x!BSGp%bMTRh>U z^Wv>}+Rlj+8b0LvyEiKxQGKhyaKnd@$wj@}HTnt90U1xYt7SpyCfViJgD`R-{lM6& z$CkX}4ujQq^XsTF_Bo7Cr@$89@N-GVK8i_TmGWhTF$v-Xa2X-fC6jKwwgmH>D^kuz z4{N-UTO&rcmMI<^>8h*yBT zI@JsJ23#3`8B33YiAxSI=)ts#zODryIntL zVLalNjy_tK)1JTM(b>G$2HBj=59vLvr%qm6ffw>`ztC^()oX_(rMtuDCkT9JRbY&S z#`GnFLv%r?k)vl{aE{VGZm__O_OOqZ43>dM?AP?U+$DFK@dYiZM`k?U=nxEL73*SX z+$o;DKGWHM4nC4Cm%eH^8OLd-1b#%E)x35YRr;|QTWd#@m7B!p?~cF&b^D=WeniEh zoDB8V0{0ApM$BFhQ;pC2t2`QF2ZVsni+6vv>&FTKTsThab8eshmzcX1ScjA*g59%2 z3TM1RCd1tGjuVPAY!b2xmlCL&uHD{D7%dND8u{iTwUxl5COaEk2RoUQRH^FE*>!M0 z`LlyIYn- zEt-YL-67p(O~X3DN51T~ug@=eIDV$q^{16^%KHYZTkW*X|FivSjWP|=3UKb6|0$G0dTf0qJ0Z-1)eAHYs zLxaFfS=T+LIidj1>f!d;sact(){>&&wzf0;M=X%#p9&dn31phb&+QW^>S@c7wMXvUw z6e0|igQtlz)mE9rYr>Tq+HF8AORmmzopTc$)%PKx8jVXzPZq5R^)Y4_v**XOEg#xp zBMKcAij&x1)E_BzyxXcDm0YlA=wl2etG12*rrT?k9e@}Nyv&bU+k-Kh7`$DMn-rYD zJSm1s!Q0Exz}po>;Lw;rnBoZ(9scmdFjVj@&5z(DxhJ*yYa52p=(J;ebJ4`Xm}a7> zZx^@oTZ@T;f=58pC&HDO0sLXJ;yWYwXb=Wo0$-3YXvPKDI(fpVaf}+ zvcy%h`iCW-&V1iony_}kz%?w3>QWRoR%04u1Ur9n^GqnPZc;C2wsqu-JOfuvmR4_d zek6w@4Z4*nQZ2!)@N4A_8u{~%R-n>#F`GzLp=C#DDU9^2{iftAu6zEq0D0%9Ps*te z4QIrZbT5eW6x7|?M}VX4-{W*6N1xeiA>KaUhhmGQ6ZC?$dxSG*whd%wdZD9f<`e-i z)90=XY^?n6{}{u#FH1rO27_a%0?pV5g~CC$R>+e{-u1Y(7P_lPUm@x3L0xD@r9hDR z;Xa`P5|s&FLzr(-%_HCMy9?sLVc7Kn=}-3E)9u@-*KcW?x0YTTvW3__zNFjmXwSvE zg)AEa0=HC3I$sXk0dB(xwL9`hFH(r9zVi5L_|g6jnDC7&sE*Fl+h0;_LyUx(j1cb_ z%+v2YN!cR>fsT{4h$fnQfY@}KRFnInlByzKw?nXI3JcO)OEi;bkHw?FngvS=w9#kbrbzD&LzjtUb> zu2m&L9I|R-0Ap)kG)DOoQ1G($MK;z|M{cL4;SA8s{k$egq%vaLTP*~^VIzh2=E$9H zE&e+KZRNv>$Zo~}_?V}%dYlSOoXYt&&+H8SNDXPdT?>lgC39IV%2^d&MNX2h(6sfx*@2%Nfm%>dV>bAagbjk1E z*-l-3$wQ|ncb!cI!@fG6C{I3!h@n1r7V^CQma;s%r+x{kl`TH;>WXxWuDLgI>Zgxa z@8KYESnCqL?ZR>zR;%>up%w z=7z~y(|Q3=NKq)XwZ;4YaTnSWlgc$f5wJ_2_*r<y@cQsSDMr2(>JGrn6*yI`L=h$D$@AWn#dfCi5C|5XekjVEd zqz(lKz{&pNd@(VjDh^(Xqk$sfeLf_xrHHcySs$e&&?|`PRk|PdF`#?s&^6LonjI|S zd=~VLWL`=Nb!Hn55%wRCz1qwy2OpZPh2yl`oa(haEY$`gn=GFgtkRxruid?Qhrm8k zRxnb~@+SNw{7U5~hr+g+iBh>OjsDs*;`sWpk-rkfvp$m1s+BT&V+96LCD*$RpS(N8by{Fl_{?||0;}Z_267pf zr)oJMX)M;ae9}VhgmTmYaa6Rd+&?H!*sd9Iyf6ehIpI4;vTJYpuAqE}{yk*)Fb%9` zaj&D;-4@oON*YP!OZ=>elrVuWa?^xaD%mGENipHXgIq#vkcy?vDw&qT8d=GX;8i|; z)mXNt3G3oigQ_xJowYkyNIK>T(hFg#0!(|V)hA)(<7j{708CE()e?wvqtmn8G3zyb zzOghZvb(P}Z zYLf>|i~U4Z=hRF5j?`bt+cd`rgCUd>%~|iP{p|{Ioy)!A^6En3llSUsB1taOg7CT5QD$?8-6qj$;>WT&dvRrYPZ!zB_IK| z^4ce4Mto!KiKf>5gDVIRYvo3HRQs?T_p`;E;jzzAh5d~VYcB-~xFI^;)3l6@Usc^L2>qRmkT@E{+N3;YRA*~bQbW{9HPz@pRh18L~jp9GD^$tJo_FSD7I=+1HN zqp5mpAJF`{7^^qBe;2XjE6Y}5gY=I`dyFJ5x22c0QjH}WP1rD>x1fm}xPnmE=CC=u z&%x1rAikYJ$M{H=R;{osp{0@pv`%z?-?|2Gax4|^AkOoSLiE2?=Pg$RP&A2ioHcCh z)U1vhK6V(4z9be>{=TiFJ(L(8JigzEjIpQgg(Ra72uFwGYsjVb5hbnWj7bT#^Z!~{ zQG#BY&S7XfZW;tFlqlpk4z?&iq5}FMBck$Yw5RYR1PqMUexDbunt*s~v%KGqZScnmR~TFm2fP`}yISWJl3avkPSI@^ z)@o`Dm_FT}QCsD)kU7t&1V%vi8>%mlY!rwkYG8ND|5_M?+ZG$^7P<0(70Zk4@o=!A zSO+!}lXksfC#;oQ<(e>%qn_E*2s)NjRyP$JO&Ij};C>|tCuC>MtJDxq1 zs%+5ZZvj_oBJtqsKBu|xb`g}JakSNxM;;$YEI}p_cnBO81U^~$XMkrYl&pe~My&ac zl3k{He6Mh)(WQEJVY3DEMx7Tj0h&}>_5xpOAatlE_n88i8Y|vATBlh^xPrSnX&1HgNejp6$r-Y^Fl>~>( z+2SU4XQw+wp-}R&0FEFW)G{Nw54+Mbyr@hU$13&CE_Z_sR1^GpmeXXMwj%{fZdq1$ zfM-05^VHt0j!ZQjI3`?^4;|#)8+Uoit0FE`H$vfb>7dS3rrdlF7;uJsSNT=#{7pSi z=mv&&xVm=*94`%yS`h}{L)06|C9?s-r5|YLEXOt5YTVh)@0rrj4J$lj6Son=OpRfl zu?b$$e6R^Vwwrq-F<`VSo@@oSc;ACM@fPeyG+mk-QN8o+jpZ%4pB;Em9wGjK^aFoL ziX)5xjXGh%b)AG8T)|LY!|qfIkPgai@WOsh0jM13cxC6Uc9LIgA zYv?~eDpjWz9rG|X;@Ds}&d^cyGx*ncEPbH}lN?Y#z}%|pnfaFYi0pKDogn~81UbEV zo&=XE;0`%R-?0vsB_4yK!fE6#&;9h8JdyqAK#$MuYl|bteww<<+?syYJ$Ond4AVIp z$Q|&b%kVSTnu{Y5xexV+TmtD$nUL25k0wu|xG$Ji62+wXmw?B}w_nGi12mca!n4dBL^k?n zQZ}`TSS3JSO4ud;@$2D(_L7E;HW<>3qqc<5$$nA`#`n0+9rG9kBG3DeCcdb+^LK>W z6(@OdSBE|6+-|f=xRPJe8(v2p(;AlL5e;5L3XIDC9WUqicEmfvQ-8VQ?sEr98S~1F|2{M zTS7J>FV6hBs}b6_eG(3gL`T=vNyZl6R}WN6q%rPf-A1w1{ofup#dR3)1S1^s2f=gAyFNyt6j8eT|$}-SuWE$BBhMl6-l-cR}{)@?r8g2GciTxOP>cCYR$` z_Rw%7%^$I#)O|NjCfjZbY?4*mT6+Zc5}@Db*2+aD6M~wpXvsgBe$1jxgzb81D<`VH z8f&2HeX?VlGCTA3@JpAYj96aha$d?nd^{UXYDn{kU3CV-Bt`qVHqP$xb_`oq*_WPf zL&ou}xMltFSmU+8A$WMrC&`!2D=w$DlXR!;z;S*14UpAxEUWnKkByXAW#}J{Y<%N< z7*qTUB97k>9+1Ub)w7XgUxwBh5g!c4zAw@mF|?`6u^g)d8*~sK9q@cjhi2E#TGBdXt4G>od2wTJ zUjgS?v+|=I=>kWP{!MCJpT5@w^qKn?SG`-Z{BzFvV(nTEri>5NKf)W8Q|Rn ze4#k{>`)xHSK0PLViBM3D1dup%~>=NL~dZ?`{^6R$v1cJp9I1f#j6FoVK)RBkn)H_ ziu6kGM?wm?QeUiGjGQ27C^d<*iqJT}b3&`yGbwYna;Y~Jk_3ZuhKtK^MTO1hD_}7e zR7iCYb@+b#!sJ(o^?L1F3Tt}Z7WVTfr*$BU;R74B1L#{pL&A5&sZ|xx+z(5-2bI)* zKO)_0+|&wwR!1#bd~O#ct80*NgD1IA$5(nbtu8R$-*_$*k+?EKkspf6`wC!z+vP0z zil;559dtg#ZQ&lz>-?}I6R~9dTbA|wM{zTqJea-z+)7+pe&byT9fp{!k{TNk#J6Gv z1YP80byxn(kPB`qMPmJIM1ZlY0=k4CIwzm9DPglDR+gom%wl_gbYk1 z-3}$6!s{e0v1Nsmv2u^FpsMc}4l=m=8O)P0PudVD02jmIQwT){%1Ij&@oBAud3EPa zJ4B2|W)Kwl@2~s>wt@^~Y%s01bgHk@b0f+IrV*w#1&(tYgi%;8i??mFt!LCCtr~@F z+eF@Uh|j+}t8--v1FpB}{TgAr=JroK_HFdjJ$?}tzmk1Jr`a;a@35d&b`eyGOg5IG zP@IaRm;Hqp$O&_J)?yxPd`1e7dg%ad0>ipfDUN0e;(GeTx*NRM&UnaKyMxoo={M<` zAae0w6CRSlkNAY5!h4R7WQ@&^qy#n{U;|ya{{OXcPCKFi3IZG3wr$(CZQHhO+qP}n zw(U7%@7-+f+x|t9roEf<_AM#e_=r!^D*$nu0sn|5`2nK1N<`wCTjZyn3;;#T&r!}5q4vr*S@)*?2tHzZqNhYmC{*Lkn3~#y4n1CXf(y6ClLpb0IYsC6BJd6$w{eyD z$?9lQ{j_<6$$$0W)CC)9OI;@n6U3G^%X-t6mCfA&SUL$Su3xdo|Db)Kr87y zNi*qI54JwtYUVxa>}7$03`G~^;|Nx51{yqI!V9hAu@p2JF7U8dNTFHus!)4j^L&&1 zTL$H?mEm0FM?PTUS~(LlM&@M`t*S^WR&yroE^y|v7~-LWfcyCnWr3JGW*-*=D%j4) zb;!;dRoDL6YEzDZl->yG*pNqP6xv96k!%8(M>Zd>a#)RAX**10V~j>^?93h@CFFT+)O)sEva#`xv?7> zZp*&EQ%1bFEx>jlXjvuy5o3{gY+0~7<^r;|R~k5k19EHsISnTTxqEM|cSN_G@u z`ZJJj!yW;4?m{((aKoy11TDt`Uwf?0NAz5|C69-Y(>Tj&3}of*sN1O;a($@O7gBIf z(3MESx?&Cl47BuBUqR)^k5ju34o(WR382_3ZS! zc&s)EDsCk`&W;(jRF}mte)ohPZ>BsrzSLHS_Y>-&l*6z9($fa@z{eBPZj@DP8?#j?$&i$CfRCy~M6+!=l8-Yga_S4lNnX z6NYgRU5ut0po9H_&K*hK7_SQM!jK`^`bbeRKd>%!X(T7#yVLIcGbnoFVS)$J+LiWc z;^<$yKrz_-;A?3sEUxqgjF(X7ARc~Na3;uAQ_`95VJvDivdWqrHaN2zOuW;Ke*8+F-$%LTcPI~s60_7tP;os6F65WS8JH?!*te!TufMiDx4to(Xip1-rBYP8QA0X$O)8x&?6j{_IOfA# zDGd4@9(}1CN4f(es!g~ZQeu$np>O&<9@U>a7lxo?Ul#*jioUL<814wWL~bT1LGEO$s4m z^xYx4#gU^3`+!K42wpb{0Qj8&8AzKv2<0Au0*J^qT_iDKI_#--<{oD8B~RpE4;uV= zpno{i&ie~$V!{2U->JXCXqi|#3IvN6d&yWG=rRwCaYD+ZdVEw_q$*sOFzQqW!R&=+ z^t`nHj;)!^;?q&*^TwSF!h+{P?m`7HDej=_c}+7`h)N2_8&`FI1t0`!ZE*Wrw(RO7 zCvBUUxjNX1ug*V0zzRseVTYLNxAjp%a|O;C-BtXcUj1Z(^?+IkNw8Ug80OS-wWYzr z(gkDq5}ruhn@^GYBTx+TF0>E&bSi6Ak0&a96t5Nt#(^o2hZ;SLA$@G^xoxOB8{Z~# zT@907LQkQ(4@Q|q$Rl;t3J7x`#q4Z(97dqR()V~o1Qm3WaVj`4!Qo1H7JHZ_L}`#Z zO{)wvZ0taY3!LcoLH$3q40tSg6%--&rZDAVRQhb@Be;JK#;CWQ8OYYxBF-7A$CJB6 zXT?3b=(daTJq!1rSl0AWi{X76sR+NuMx4qeO91*^u9%nZ+};->{%YoNS=Hdnj0X%rWG9g1#o z2uQ0!Ke+l}QC{ah`EccCWj7j3Ypj$@D@7EQIGsl9kZpp+ynr2Scj&yZrf*jyXHxEr zvFO50i~#z$NZ8U|I}5}wG*A79LNl@#Bf=96PwK=qGWpdWtf&_BPy;;ZaU;Y{&LY=YRA8nI`9ML^=)R)wJB?Bp`)eWw8 zFi0OFb|Y~WeBpRi`lXo&K~R;xjb{8>Q>z9paRr;r8(>msi=O-Ue?M0pZd0H{_*0+1 z*wTkBP7bo*2$l8QW3B)LW-Oz8R)2^QLc%K6j};u|&hitiNKF}5B&p6%jydknq-;Mb_9?o0fII~gk9)F=Ix zYK1etpRO7IDGm!CGJ_bOC`p?l84hUd7m`y8GIE&J1|M4pxZQW+Cf>{h0&If2C_dtV zHLkEb?sHZxUs6{tAo$tQ4*ss*UVMX&l%rh#RzNOjH`U~)HVV}xNip^J|a+_3@lXsLRj3+l}v3akVYaz9=kKVh__)wml7qb)Q&p|$3DcjooD5^ibN zz1J;wNF1?>R+k$3;Z!o8|3&gK9jbIBFuJq9^3TCSz{pKfrLrTgXxj(17GEhbe&>mY zNzvXzw&g73qU@|6IIki4}Qq7^|K+26K$gT!qEjShX(&He(6nAk8`8b@pD~IaEsfx-NF*x5N9Ts zWcYa8FU0~0`bzX5)d*Dxu3#k`be{UwvT;~f-h{)32J{kx4<`J zn8RPDuu-XxrC&tz!JALX0G@cSv0j%{>affo>xPV2joYoQEh6-tuvmY*ge42dtgJ`)V9}LXVmutzjB4NaHZknT+Hy0$9wR zC=fL=TG7qmVbT~5b<5^R(w+*`6G7^>!e9d?2pK=V3+@>sG<efJfGYmeF_FQXtpgmyud>`|gK3oYv}zF#FUx0U%@q zPGa&c(C9YiC{&EkEEtlDyY`!+`;xB_0imc*Y?M8YFGtmwEFA9)MVu8mMMC8_xY3<# zd^ptT1N>Z(elM=jf{d)-@0V$P;3%$8{z6Gc@~8MT5By|w9gG$?61kDfPO1>{`_|Eh zu&bOH#(-x@#VTe+I=y7y#A5hqz+1e)#=}u`H&av17+Pl-5FW;%)=(7g`c%KJ>I*9R z4ZF|dlXcUC5r90Odi-h0YcEv2+$cjFJmG9JBtw=3NoY!(bKkLi!AeXNzws@@%n6=$ zdqh(>4{10QQmj+RDauRq?@QJg$5xs3c0D*YAS;i-=MP@{uG_-_g=8>KzFCwm?CIXh+|vc zEnCbdJ#^sRogMDk*Ani?x!r0$U#YkXI?CFpBF==^0%)A+CGhxOt+5fLXcP4~JNqyz zOmI7e3=-X*dM5%?h6!;WmPRh!3U|2|kuJW($zXi9T{4JFcksPLe?y4zU!?i5KGVD1 zt*^GSZH8DAwQC;}I;Ww3I23akp-m*5npL+KWRcM++9Jg6?b`Z2%lG&^rpNpDJ2sOx z)v*}SuM?yqz0P|I}w^-yUxMr#AgWb(B$8= z*^FDEF7|cFn-$dR{&v|-DwRBYO0+Q&&{lX3&x>1ULeDiisbWz;@MWQm~Bg$k%23^?Dh z;Ps2`x@zxyTppQ%dx}ObyvHyn*nD{9Zg379aO*I3AH@iZkK08a#n4Lq-6)6ysz671 zjOlqz`SkK_JE*0efRAb&hoktecy#8CGWv<4UrOaFE&Nrgf$`^+1;{&MLFqLny#gpy z9uJIgb%YPIGDA#L3-((mjR8XxO&~P^jBXu~nr|MymcyfA7am&0&uv~rb#vRX$5~BvuzA5GU);H*_1cn=pGQt zk8}fi3&$DtrYxo;N{cW!UYdBjJ_4ig4#;ZjDQ~F7c_IeqHR#04(zNVPmZm3yg;w^R zOjlSeE7cA{sYx048-tvlWU`NRhtVf{$)!eV0 zF7D;CYL$%g0=I`gjY-L=uWyNwm`jbf3O&Czq*QqDLj__OHd?7f4}@H$ZAYIeM}k0< zr9r7T?=O1B(az7(*NSle0hGR7FO=_X8Ab|wEU=8`A4~7&5vMJ02@u~{D+YwA+rNS)n^{%Pe&Q$Uz_GdSH2_Uo(MPRoKSyl@1JK3S&kD_5IIZx2<2VF zP-9TAW_l&c_=(O)6ZN=n4t#eq)Jrp0msT)8NPmh%#o}%}Bpeu~@`kJh z$j{^trqT%Ykgbt0G?;JWrHCcFt7}p&A3L)->0C)=)pRbB6dz?&Z&SOE^=5|u*g`mY z*%%7mHxjw!Qss+MTW?h|M#`d{S)eQv{u!BpanocgAX)lD^U`5O%O{qP8j9o6!=YQ< z?)bM&u$hY!ooBFq_Gu|k%#WELT&I!Ge`}wSJy4K+BK*)kh+1>(ws#W|waRnRqNc_| zg)Ncs$0by06ixArIRzYRADo7@W5P=EF<|lLBZj4<1~d6<5UJ~|nfnXNvex3-cN^f9 z#%Y33L$$iUc6)ze8deqU74>(_y=;}beXU`v@1H5+aol9SdaV{3VA;@#EVX%NJ6%l0 z#6Q4^GllIim%Ixmb~fl701!i*WQ~`<3uEJ91~FrKYVV$Lf!u57IvI*ejxpt}^%lZe z5@5MV5q*He`stBV^OhasqHLITtyXXysIP7Pf+jzoV-NpZhX!F>rY@mcqQL8bJi;I` zO)IfAb*Sw1WfOUhjYus){Q+yZK}!l}$x&KP_uQ(d>MS!cd}Yxz+FR~`hAv#h-_~ql zeJ(Le4+mIhn)dh24v!C(RFDwPrplvJf+r(RKICG-hwewQ=Z7G+1MX%fJB;IhpIum% zj^=U2l2mAdq4@6VY59Bt4G`j;T z{0`ss0qNuwoj4o>&QU=#o+dr&yBkAP90)2~(ez&IG`t7MSMa2=Zfj_zh0fXfJ{`0D zxEdD2DA`jT&nnRZcl9A#EjS!MGv*%YrR@Ep%sH<%I@DFAxXH>dQ~g2yl4?GYW=tY+ z&@r<-vTiEfJ4*n|S8HQ!2|K3qP>T7jg^*LZDAx?H&x0L0K8j(Je0pZKymw8&EAJLg znX1RPk+T&!Yk@o5>u_!43)U^VihB<9IvWrLo`E}(6x&VjHdG;6rqR&tm%o)4ELI0wH zn`C#H?n1hAW`B@xT@33=zgtb2o%aQ)EXxcTNL1}k#}<>3B@e9B;JGFFTYk8ETUKV8c76)lfg*LCt9+6&DjYZ^#*Yl;hs6lA)!7%*0i|HeJTBRx)+th~W=(NpQ zF|^R0)F`F?nned(JyvYB6Aq7J^EBA9*J{0*G{o{#-=fWdcr0EEm>8V? zz-6Q3XkVKQxyD11ck}L;ce*aN(+0IMBq+(-U{d~j|MH*SN+wc!+6afc3f@Wy_eCot zG?J|Q?0|?ZQPJi{8Pu1qJq&b0+j3P4FgoiUfZACDXTccYk6hg~0q@RW;sdiwlomTA z(eW9=XfZJ$=nhtH_A& z=L?9X=^SE*%iI}K1H^C#MF zDJXf0g~1tYY)cgV+nlhySle(gdT7fwn>s%@PBeFTFZi zMx+(~BN}mHq3qdTlv(Mu3uiOi11x8iRzpt=BNCyNqRpxn-KYoaU-FddK*6ThSdAki z>x!VXxM+%9NN{Ps`yyvT%WUT6oj*x(wR9py%NXa6a2V{-=W$Q<#G9{_{oU+g1ZiJJF*yEo-_t*_Vv^Mgk=$x(5}@IloHhjo;*Cz8KPSU$cKE1% zF}`W7PpKof?E~gs)I97#m-96zvyw_sO@=*X{Q31?UIAs*yy>}c6SX|8^l#l5*V^L8 z(gg)SGA2PLFjL7-opx`gwURf^{SWU$PgnCLG)p|1`ZM`jMnw_Po`w-O^-?b?b-l}X z#}yO6@mo%Zs)K%VCSlIU(8QYglHQ$09TSbku?~AXxd@A?bto$gqWM8;N=Y!^y@zEY z1H>tT;DjwY#~BQ`jEx``LjKmqMs#Ze3m8noQeh|4D23ENs+=CJXHhe(%<#@zhkT1D zP;{fSZ&utSa-O(L&b$TELCG|&t9`-cNdgDTB|5y9=uiq^`}ICBa6Te8&5My4K;!Mc zw|B(PSPG;$g1&B3@mN*U13=}hrL{ZBsnHypZ(0KnG|KD#g!#?RmAi3W0(jqOmuH8h zNS1~uCo6#beaCBK%XGC!K}Doz!6vwkSxfW5Q;Y5wNc2E5Q?mNOy zFiY!BX8x_DfFJItRv=*qePIhcOzc;-HZ2elU=wRNjvNLKD--SnFX?v=8S~Sq_K!V7 zorb)GtO~ZFme+=cK0lR!Og6N?=$DNSF%Xb(hh@1>9ezhg8WxliZYFPkAm;F zdUEsFb-p9n9`(G$%>0M^oH|IqPkQz=sH}FgK!~P1AD>FZ4LeZgB9PbLTJsvTdzn}h z4WyDDN(bBnj$1_=*dMyBI(#$O(M}9=<`!0}f7y&8H@WS*BG5e6E{Xn%z*O3$RXc!~0twVTZoE_$c~NC{+V^#dk5>ovOL z`_eZ%kTJf1NCtyZ33Opd?lJkGgw@&ON1PP%?&y(eMo>tMleYp=QyL9je;0R(lJuVklTj`4Jmby87EJs{pzpKfFg5f?A-<&7! z<*Ce(&-{EK#C#0>823|2iecN*hwwS?+6NaoK>X5)-u)?u^2$C)aYf~Q9_}Ht+l(Xt zWoDd1cmzx-Q@9(R&Uo9}1E%4T?{TyR{umEfWpX&pv}%j~oQ@8Ud|tmVh;-T)F%ItL z!NtEW>oGAS^yD&^@*c*srw)iwu*@*lfdD+&11;O*aM)_u;Cx?+z$=>sWbP#sQ-)s+ zXT2{S(#NTTmoFfd%x5Gc47;GObW1JK%vWS4)Y`yiFezL-T9#0^!QY*1By*AnH9sy2 zY7S5D_$J*^u#ubX?e7@}I06;qBE8HC|BCLKKe85lMX3})%m|lF94SE{+y-xv;W6oY z)p2|8J+w6sFgWzN;&-+j-z2YjNTL57asyAukNclouw8@LhW(eVn%BK~BNCDw)jXRJ zh_}*;G-ejbITOls9t(P?nqM$hx38~3*Im>|?n+1W*z-z$|6MgdUH(~R&pW{z<y5_H<)N1FOtBP$Z?We=+*9#WD~!;h zOC1?*AKmzYXpv>)3Bd^JNBUO{nP>k#n@R>w)t0NeH1*z+s`YhHkRE%!l;2+Zwm|Ll zoUJ*5jW433yGfWiGIs>P>87)EiLFkIffIdspB0?GPhyMweM^cLHBcd zraX2tr)rp-I7qEx1!%DTlCowYd`PX82;7@94Rrq@Eay_2xkN(qIaMjmj@@!{@na+U zddMpi3eZ6J7at~~!0D3LUR{8SNqn>;7AMjM>g!$5B zGS_k5B(O!0&HTr}t)&i?TG#%tWPa-+mtH1fGKBX$YL1>8dh8#F^pgVLp#6rGm-tl(toP)CLpn+ z4TON4Om9MXli0*Opwn$&fw3wo1?XedV|vo;5UGn>)tbKc#cTO9C3(Be$8&AOk$a6z zdft3SVnq~x@cClnj)2LCz2%h{Is8Avxy8Gj)V0Y^5ttlN^cqAE2UN79U19^~a&ruE zvxJkyTL{T(eLnxxpGO?F%C*iduBvR6<`tIy=fJD=yc`&ug4fXVn!>*(Iz`i+v%vLF@VoW| z$f_=8NO8kNK@!$En@1~ejS~;;sohS6A;_p>E0z4i++(&9V0^$OzSsZ%W~k`lP<8mw z6wGh{mrDriolYw$^`(8a-_YhWAZc(0ny-Ci*Y9&GN@HkVf~za7rC!qNLZ)S&%(8kt z4bFd-X1&SOucJ6}vNK`JWAv8?mW57Q*E2eeZ>2|N(jvuo7u5$x;6bv1)r1rE^SaWd z#$D?OG@ppSa|qtOQG02QfGcjVjG5|eUO2wNv)~a>qW4|TS;@oN_GRD^0_z8i)htB1 zNfpd4;BmnA^A8zqOG+?s#hF~w7jLRS!)vOWkCZX)8)G65e^89KL97$#X{)>tGLeFy z5blaen4Ra6TYuXX+tD;4;d^<5PE>;eX5}q=t>;=<3ivvkx##|^0Jn7U`scs8UYi?u zEMI$=woa8kUnge7x!If0!7cA@+7is+WsVz<$*W5J7$7s ziX&5GhQYWHjochs{4y-)?2)oLRrWbRfI8qJ5M7FXL{4pY$)@vtHiUl zCEmS_4~o^)IHxyp#a5gq3K^Mexd%9@L7{u#9s-2-BsTb3PC7P3i^-Q@j2LG|3#f@R z#FBFaQ7xxmXkUo^jtKN4b`Y8uWdMB2;#?ktB03jfOX-Fys(EE?MnGwK(~9TV9@+t} z%|TA0mk$cA132aZkineSS}xEPYs)aC=}S!DmD>sH9i|_pc37ej6^`XlAoIyuw*EEf z-YQ>p zvcx3=eUMq;G|O#xXdq1(lT2bX=(16Hl#*pNz($CL*JYCl8R z0ZUIjr8%RRR?&m4qRZ^Z!L^xSBVK|%QD|iInURa!qW+bA1yCGO)-HqqK|;`=fx#gV z7zT#H0>KIH?l3_HXK)J|T!LG0CU~$waDux_aEIXTgg4oJ`)aFptN#C0b$|Ev>C=6? zPSD%4+=<~eN;fHI}3EZzM&$~t3o?r#h$<<9w zJQ!roy(Y!kHzVW6G40|L#hphaH$2BS!&ev%s3Y^uRA+vyDXa1k$YxAi#A-Mj_YS>u zcCAdpe^m;_F32?egzotIRhr#e7}m#mO?nX?(fA_-v^f<&oACPA7jo|RuEA36obQFE zI*H`iuxzk9aY{oWB37eG{rI9h7?dIdq_{y{+LDjN^4xQisy)R=ME&vkL!zd}Z%F-~ zs?Xo|>h>u(<$_&!vo5it=rvIYnJLVZGEkY`Tk)BQod@g*uJPezrBX8MwXrlK?J^|OKE}oc1w{e_awc|BFjEuGyqb3DmP~~_C#Uj%vk7^gt)58{qUs6bI zZgqPLAr$i2CvGrA+28CR9xY;Q_f&PP&u4}>c7m^X?aH zVN*a%4QG0k#Sjf_xo!@A&eeJCY(#R_7~od}s|D+`$-a5t-lI(84n4kLM2OC&v*T)t z71fYt(sLBG%_pbW657XyS2RWAOY`Fiv$_t!EV6SD0n{H<>rP=Hd@rIQvk6#rl+=g} zn)Pl&`|DTZ3=tTai3cVh=p*C0YioYp%crJt%;8+ShME5s8!kyI5BMEepK$*RrEDS` zye3*1pW~Z1d$d5Zr^{_xYs4tBYpq;TleE@{dr^dpF>y&^YJ+h)11CtC?N*1?qz00* z9Nb3CEJd57`X92F-bj+`z2l(ZVL~8en387J&L+v8qARek+ov5avmcDUzBAFhZ+{&EXYZO*vU&)$}?OIdtwC{DN z&CrX_PAzY$jbK0(^@%enK&HnUO3eo}g&>kxb*4IA$7gGKS)-92C1zp;p;|TXq+zOl zvd*89$DDRA4VhSyP0Pk~Q%1sY0}%WeF8 z;TObv)Y|T<85uvP%amQxFr~@@4`=xIo^yow^`+#X(-?=}y9T4ZMR!r|oI397S8_E% z`Sc8a>Hp5Gh{BUSE-Xsggqm_-6UF*qmLbJVAy8_ z$Oh1&N5A`tCvIlTC}`wZWqQzjW#iY-A$bZ|NvKVGYz&Fz? z(X`XF437>};k}wL3r$b9x4w0>j*?*xg3g9t{krx*f30|88B)V#r*Rm$%utA>a}8 zhu4NEDgzjYW1pxDLNy+ig5%zM6QbXXL^>3+_P zZr-orY(X|C`jL-UD((pUmE9`!I@s0^ui-c_bZV@NTaM;17V`DG9>#6v{M})+dFwwPA`4JTd zwv%gCUd_-$lIah8uqID`n*YSgpvc-|_daF8c{)>E2;)Xs%VJ$ZkKx67&db{wU2>?| zOGe3tqpi1SHa=G#Qvt`ZMLrlnT?M)yzn$==xph0<0$|!2EAvHuGsZ^#PZ(~|*I&{V z$MX2j;VodA|4vba&;={#fJ{awEyfwSfE;0FlzVBQ03j+f;o}DIQtTk;y4P`qQHz~z|_xSDc zAI#L4EtdMttaW!7&|zb2DOE9@zrV?()Oq8opuK4pZvT)&xkL+5D&AQ&Y)dKmsr#)W z+zX~W|EB7vbtjcOrz)N+r}V?jxOet(>oza5e|`Bq0r^m*Yz}2dKxv?(dkV@?6kG>D|7cgVH9_RBfwfq@?yHt{YyO>fo zNzK-NB-I}C4*kL!HLbQK{@v+E1*fHi2l7`rEGmcxY?c`K=1(nWh;zrt38U1e1DNr; zOvu{LdUM&@FR-ur4osdvnbQ&(Ij2EOw`URQs3;y@L~p=A^=9E$`N`;SnqJXT z!zW&|1ec_z+nxdkln2!g8Vt8b6YuSqq>WimN1v3TL%OlZG(zR@>Mp+#>R z_swHab|S70we+3n{D;lYl)|T=L_=md&dPXk@wVy{gT{xKMp-}Bhs_fG;tLpz(he1& z%YE&eL-QDHk|`Qp2Ggvu=C{+KQIn{-Fy5{FgjswtV!FD5>bMWxf)Nh~etF7Xg%szy z${QoE??$gyN>Bx#4%Xk|kGmTr&1nR+_9|E!>{s#{hAG=^Y|p87gyHvaD2DNDIg zVytAH9{yHL;`kh^A0!7Jj|Uc3UP~aP5^e;iUK)aOH6ZD<(WQjW;d-PHro-WTO+l%N z_5fgEzpc>j2xCB-kOI~jS{qRkn>zSDb~ubQjh_NHwA)HQy%_zRDKgKT?66-zrpm*q z!$as7xXUs7v?-P!l%_SxwOjid2TLLe53d_XTLd%GUJ`_`fVZ`_Hb~94U(c2ekOGMK z<-Tg;d@Bj^elGSqMb>{A+cnMtGKriotF+Cv*kQr=1~%CvtAR1g@x8UUhFY9EWOw7L z*MA|0A@k{qvNLr|Ckty4a#BRHF8XaWb7a59hwNBF(e__|P5t#bPgjTqGEuE&l=xXC zYU;gua&^u(;o89*P%N)q+aZ;J%i9?=T1KiQAS>0Bkxm z_O631DTk#jIv>PBhX&)aY(2}@`FKM38Q9PL8y5u}dv)E5*K+rU!9C^P_!A1s$=3t| zHWaU+w)VyMre_YnPkeDdv`&nAVG0o;azrse9y~6h9Ew}pPRy3NMp<;e%)gn=cD!#N zF%7f6=Khx)vb_J1Ll(dX=Kn`#SO5TO4*pep%PFtdvhqZurRhW+zD!8 zi|syVtD#`O2E=dP)^dN5IEU7y*sB#Bn-L^K<*JuxU8^M{Pl`oI{BG{g-9>t8O{2yO z|Lv3XlXTLamYd6(FSFm58{We!rm6+6W^x?kPw%#f7Z%igoZmJao_740u3g{15qbr&18e3fgC=5El!YNXOt;=%r3r3Dbu3p6l)!9qvn)!kVkNFyUnVew?xpyCS_h?39w3t5l@BfqyTieMCp( zASdIZXEeV0Pmq(2U%-;{bQWRO1#_nbGV}X(DR-CPu#Sbqk5cYj{b{S64%4hRo|S9d zo;7RExa`72(%#*=n%>$kMUX;@qywVyAA68?Jjo-mG^*-lB7ox*IVv zYL67NcUqL{g^sL{u5n)nU*o$gQWxH5ckwqZRL4DoYqZK1<4|q)(a7ZI? z<+yevc;zdZc}rBalhbJ1q{QV}8raJ0x2*kNs0#USRE7M1Q8f=q)ibuHPqGKJQO{RP z!s*DLsp8FR+Sbm&uoX#J{X-c({iQ1T+(9ws=-Ktq^-#yGmI!s0#rIc+i|h?kyM37< z1GDuyH_l$T<>kdWKdAKmTHMNA4tTs8)MD00g&M0HI3QDnN5haWO~2D03S5Zxyi`}0 z8veuwxJ+I|+`V-kvhgE}y0kZbek$j-XYI988_bGiDUKg}# z!rFRaw%1+mr#53$a8tBBa8mCUn*;x*TQq&(ExG*oaJe#hKTj>f$9W`;5Tl@+p295? zd{q}kS@)2RlcZZQSyis(zc>H5CEv+f(HPcvsc&$n^)=6%g#Gzv1sPN|I(+(A2f(TS zC<=XcZI5uQ2h|K!uQvddzMx+Kl^&CRgaF7&#;93TAJWra^(1dC0Xq^5&SCwAkta2l zre;lC7N?p_^-k7+whb&Kb*F|*FuvP(Q~j!ojuS7qmR6tr<5)f3z*9at{dVNQN^}=U zV9Omn1Ngm-b-S>QyeBNn+3vXwKOVxReUA{a;f|_bIrG(uZdh{RTS$O|ER_OHs87^N z*y=3-4_SE@gFBEzr02Y#WQ1!<=dmM;?_!3`owf;wl=c99+N4c^$Lu`5-|nL{xrc32 zEj4BLu>cOaxZX=HsRx^O_viA4l&5rZWO=?xAST7`2XS#G!_kI~V4b$)Qe)g9ojx*Z zD2Z9I(nC6#7hMUtxW(|TS!sIpAF}(5iwWv}9Y5T+{Ci3)$rsE!Z!I2JPjvjmyr)ye zoQDO}yw@{aq!I=M==-6WA-?RFzj8_0g!81?gnQclLCR8Sm&Lvuo3NPyn{bB`+h6WV z9cg-p__7I$BS^Bw8cDKN(@C<5TCQ;)km6x6@9qX9L49e31C2sQM%6+`5|HP+67yzi zxt70=RKob_^fD(|>|UBP_&UZ%JM>XK{*goud?Uj8HToKvBLdcSoU`Umwc0+$;2h*p zUc4EUON1p2=17Xs%l;9W*9zFV%yFh+_MK^r!P(`bij#b?XfvS^J|_O^9^E>2QR=@C zmG3`^3IPKD*F<$uSFrzgqS_JyHHfj!Xg9yvPN|nP7p}nL2tz+o-5$-@hwzkGd$oB| zFAWNb_$}PKFAa}3csW-e8-|lY;M*-u)%)9rNT0f4;3C}M>J2Yk^B!MmQxY(KZ_P`(&^;rH3)NWrThJEq>prPX3R`@<)*ihLR?I#EB{?m&c32fi?S zO9(pSAE$>=_(7oZm8vWa%bdVowDutWAhW2cZ(P0oTtk+67{n4c;iZdO)I?7^OjqT|G2qj4?a8R?@l+1-nA57NP`+qnycI}W?n@H3IR?TOzj0Eu4X=9jhecWFzX zdKSmRZXj$A=FPqt8KXQ}rDeO;kPO$vJ-35K98J~ou448Xf7l-9qbc%@xzZ&5B=96K zff|#o`goT8A-g~dFhz}E<@vf}&HA%a=V!2_HI#1=B;a^ZWwL?)2orEx>$q;Q^+udy;>z&&!2u6qglc%#z zMr-C&b#45=a+TGrx`FRdywxd(qrpXZm9=oyC<UKpGuGb*G;ME zwb{lLB(L0;!slY)m|Vs&frZ&LCcC3E=q8d}*l^T#mLBUBQ?sY}{zWJGX|NCSlR(yK zW9;r@y444nxODbrC|Pu+PylwrBVbYTJ~( z?$wQMW&_e^q)!AaCUg1t71An^OX6YGq=_3V=D7z%8AjCIHv}dps`(HwV(R$Y>pEXKrt5jSc#@ zTcPgZ00nR=nOdvCZLv8O0o;(kvA&Zt901})R^_2~7H~@d7Z?QkXSek~KM6Ez={V2x z;``22wB;>EKU-t60HvO&)TZiI_PEaRL{nRaCGjUpB<2&|Enn$4BT_MR83rx0O+RFq zE1Q~i<(q0+SsHVM2n=Kdv;^^EdU5?B1ielO^TK{@gWIK?#e(T9FKZc3w)-@KjN(Pe z7h-B1mH>*??DriN#E)1usB?>f&6HV?{d9g;a(d!dXn+CZQe!M`>>F?DsXX7PsHLXk=lH#OI zl-Av&i(`5OPo~Pu)`>=iC%D=NSd$h(XrqYYcup0RRXL&sY1zV*@BLDU)5 zOWlKBjv`KdGT^a;iB4w%R)bVFjRf=t!k@vNJllre89s$2BXw>zcR zH6fbh8{8_N?k6fEDwRlUO7;`il`uNRY=*hq%qNQu9lhy?`0oh|-Y|9!r%8X}Dr>%A zu7lSz6Z=b5t*4}Lyg7aES*~&1!Nf&ZB}GU#7MkhMA1(u=e@2l1raVV|*0p)lx|^4HL(5ly2-Jn^1F!7^Y+s* zPhR1TD2W(Vphev!-@w5G#iXW^dZWp#!~QRcOBTYxOVK<_o#d_f`HRsiVFzR}*IXj2 z2=6u#z>>3XJ14;fc&%l#&Ry^{B<7|*BN1oxMrfUgp%b>|c3-~Y=Du;0uzYscv+2{a zg8n$a4fBsojL6_be%*Pc<2w_ivI$N0&`2QqGg54G+IKL>=@$Pn8$Q#8rn=L~kbxADbf2!bei0CZ~=TV>wmP=i@P+ z#o{*VcC>zX4i@=OB7D_PUR82D<*LGMWj!J*Pd3EQ-zC74dSl&^1hQ@`N6s=wZ2ak5 z^sEeWDd2k1drgr`Nf^+1gI90MbkzrTzn`d?=E_G`3tYYJi1-9Y%$|@1YuYQTyNH8B z$h_{^Vaz*Uz3xX$#}{TM4?HcmS!#cLaWwGNC0;S{Iygh@fklk?&AiIfe6)gn`lhaX z${QYMG-}t`hIEZi-dApyc&))sC$+3sVY{nZlVMlaM_u`L^Zm_Y{6^qvQ&Hn1JRc4B zhm9f6@!!9ZFW3+oAqN}M=Oi1_2HVTB2I@Bu@7;ZOQidsARUx*UxLZ}LFL9BA@8kAm zis5XC%rSh462$XdPQfU67(FW{Vzb_F%3Q;_*W2z_yHLrlc#;j154>IM(+y!!K{+3M zN-)Z)p-1eB&AgXigl$W*+QN>Kf6fioSW8kkrJwud_QEVmYb93L;0kU(upcn9E6%uc zT$bHcvh=nDB^Wo~cdJ0C?^rW&!tiU$A)14M$=+LR0_!Q;6&{-0=o^HCWL@_?j(5HV4(lqg7yAHpLJ zk>KS9gG52Xfd5$q*~-7X8rT1ReCyQ^KOw3LO3>;d;)wC*^BrFY~ld1OdpH+5`w5Qyn(b~vl`SF;fQwB9i>N;Y_`X&9~hqih_JNke+$GXCPz0UpbLn${U^;skehLK+hc0Ry?Q L85t#&q_F=FB4uQ; literal 0 HcmV?d00001 diff --git a/examples/notebooks/intro/my_utils.py b/examples/notebooks/intro/my_utils.py new file mode 100644 index 000000000..9a6477dfc --- /dev/null +++ b/examples/notebooks/intro/my_utils.py @@ -0,0 +1,55 @@ +import os +import requests +from humanfriendly import format_size +import pandas as pd +import glob + + +## Reads parquet files in a folder into a pandas dataframe +def read_parquet_files_as_df (parquet_dir): + parquet_files = glob.glob(f'{parquet_dir}/*.parquet') + + # read each parquet file into a DataFrame and store in a list + dfs = [pd.read_parquet (f) for f in parquet_files] + + # Concatenate all DataFrames into a single DataFrame + data_df = pd.concat(dfs, ignore_index=True) + return data_df + + +def download_file(url, local_file, chunk_size=1024*1024): + """ + Downloads a remote URL to a local file. + + Args: + url (str): The remote URL. + local_filename (str): The name of the local file to save the downloaded content. + chunk_size (int): The size in bytes of each chunk. Defaults to 1024. + + Returns: + None + + Example usage: + download_file('http://example.com/file.txt', 'file.txt', chunk_size=1024*1024) # Download in chunks of 1MB + """ + # Check if the local file already exists + if os.path.exists(local_file): + file_size = format_size(os.path.getsize(local_file)) + print(f"Local file '{local_file}' ({file_size}) already exists. Skipping download.") + return + + # Create the directory if it doesn't exist + os.makedirs(os.path.dirname(local_file), exist_ok=True) + + # Stream the file download + with requests.get(url, stream=True) as r: + r.raise_for_status() + with open(local_file, 'wb') as f: + for chunk in r.iter_content(chunk_size=chunk_size): + if chunk: # filter out keep-alive new chunks + f.write(chunk) + print() + file_size = format_size(os.path.getsize(local_file)) + print(f"{local_file} ({file_size}) downloaded successfully.") +## --- end: download_file ------ + From 0f083965da821eb68a7fae9de04bac86aa413f09 Mon Sep 17 00:00:00 2001 From: Pankaj Thorat Date: Wed, 16 Oct 2024 18:03:57 +0530 Subject: [PATCH 02/19] Update README.md Signed-off-by: Pankaj Thorat --- transforms/code/code_profiler/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/transforms/code/code_profiler/README.md b/transforms/code/code_profiler/README.md index 6eeed674f..53f9ddc75 100644 --- a/transforms/code/code_profiler/README.md +++ b/transforms/code/code_profiler/README.md @@ -1,4 +1,4 @@ -# Code Profiler Tranform +# Code Profiler Transform This module extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-lnaguage data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form. Our framework abstracts language-specific concepts by transforming them into a unified, language-agnostic representation called universal base syntactic representation (UBSR), referred to as a concept, which is consistently encoded within the proposed schema structure. The current version support the base syntactic concept for importing/including package/libraries, comments, functions. @@ -60,4 +60,4 @@ The high-level system design is as follows: For each new target language, the offline phase is utilized to create deterministic rules by harnessing the capabilities of LLMs and working with exemplar code samples from the target language. In this process, Workflow W1 facilitates the creation of rules around syntactic structures based on exemplar code samples, while Workflow W2 is used to establish semantic dimensions for profiling. Subsequently, we derive rules that connect syntactic constructs to the predefined semantic concepts. These rules are then stored in a rule database, ready to be employed during the online phase. -In the online phase, the system dynamically generates profiling outputs for any incoming code snippets. This is achieved by extracting concepts from the snippets using the rules in the database and storing these extractions in a tabular format. The structured tabular format allows for generating additional concept columns, which are then utilized to create comprehensive profiling reports. \ No newline at end of file +In the online phase, the system dynamically generates profiling outputs for any incoming code snippets. This is achieved by extracting concepts from the snippets using the rules in the database and storing these extractions in a tabular format. The structured tabular format allows for generating additional concept columns, which are then utilized to create comprehensive profiling reports. From 41e1d525b876868a4712dbe4e6af94d844914b6c Mon Sep 17 00:00:00 2001 From: Sujee Maniyam Date: Wed, 16 Oct 2024 22:56:31 -0700 Subject: [PATCH 03/19] DPK intro example v2 Signed-off-by: Sujee Maniyam --- examples/notebooks/intro/.gitignore | 10 + examples/notebooks/intro/README.md | 5 + .../notebooks/intro/dpk_intro_1_python.ipynb | 1919 ++++++----------- .../notebooks/intro/dpk_intro_1_ray.ipynb | 1392 ++++++------ 4 files changed, 1409 insertions(+), 1917 deletions(-) create mode 100644 examples/notebooks/intro/.gitignore diff --git a/examples/notebooks/intro/.gitignore b/examples/notebooks/intro/.gitignore new file mode 100644 index 000000000..89b9e565b --- /dev/null +++ b/examples/notebooks/intro/.gitignore @@ -0,0 +1,10 @@ +output*/ + +## File system artifacts +.directory +.DS_Store + + +## Python output +__pycache__ +.ipynb_checkpoints/ \ No newline at end of file diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md index 53d21433c..07b63f513 100644 --- a/examples/notebooks/intro/README.md +++ b/examples/notebooks/intro/README.md @@ -4,6 +4,11 @@ This is an example featuring some of the features of data prep kit. ## Running the code +The code can be run on either + +1. Google colab: very easy to run; no local setup needed. +2. On your local Python environment. Please follow the [instructions](../../../README.md#-getting-started) to setup + ## Intro This notebook will demonstrate processing PDFs diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb index 6f4cf757e..1049bf8d6 100644 --- a/examples/notebooks/intro/dpk_intro_1_python.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_python.ipynb @@ -13,7 +13,7 @@ "\n", "Here is the workflow\n", "\n", - "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" + "![](https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" ] }, { @@ -27,7 +27,7 @@ "\n", "Two options:\n", "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", "\n", "The notebook will work as in both environments" @@ -45,7 +45,7 @@ "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/main/examples/notebooks/intro/input/solar-system)\n", "\n", "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/main/examples/notebooks/intro/input/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/mars.pdf)\n" + "- [mars.pdf](https://github.com/sujee//blob/main/examples/notebooks/intro/input/solar-system/mars.pdf)\n" ] }, { @@ -71,7 +71,7 @@ "base_uri": "https://localhost:8080/" }, "id": "1fe354b7", - "outputId": "0a38a7b5-238e-433a-c378-78444908aa8a" + "outputId": "5c153f72-08ed-4d6e-ccc7-dae851e7fd8b" }, "outputs": [ { @@ -112,15 +112,15 @@ "base_uri": "https://localhost:8080/" }, "id": "3309799e", - "outputId": "9b44b764-d284-4da1-ad55-f08d5c9c0f89" + "outputId": "99530315-6dd5-405d-dbde-61e2332e441b" }, "outputs": [], "source": [ "if RUNNING_IN_COLAB:\n", - " !mkdir -p 'input'\n", - " !wget -O 'input/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/earth.pdf'\n", - " !wget -O 'input/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/mars.pdf'\n", - " !wget -O 'utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/my_utils.py'" + " !mkdir -p 'input/solar-system'\n", + " !wget -O 'input/solar-system/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf'\n", + " !wget -O 'input/solar-system/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf'\n", + " !wget -O 'my_utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/my_utils.py'" ] }, { @@ -138,7 +138,12 @@ "execution_count": 3, "id": "1fcec577", "metadata": { - "id": "1fcec577" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "1fcec577", + "outputId": "0f77fc39-ffeb-48da-ce6f-1750d8d3ad62" }, "outputs": [], "source": [ @@ -146,8 +151,7 @@ " ! pip install --default-timeout=100 \\\n", " data-prep-toolkit[ray]==0.2.2.dev1 \\\n", " data-prep-toolkit-transforms[ray,all]==0.2.2.dev1 \\\n", - " deepsearch-toolkit\n", - " " + " deepsearch-toolkit\n" ] }, { @@ -195,7 +199,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e4YMZrBuFycl", - "outputId": "42a9edae-205f-4dce-cd4e-a159bd8f620b" + "outputId": "d7ee9449-4f21-4c9a-fa54-14b7f28d764a" }, "outputs": [ { @@ -222,23 +226,9 @@ "execution_count": 5, "id": "33345487", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "33345487", - "outputId": "79b40d76-b4dd-48ea-9638-461c78a637a1" + "id": "33345487" }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "MY_CONFIG.RAY_RUNTIME_WORKERS: 2\n", - "MY_CONFIG.RAY_NUM_CPUS: 0.8\n", - "MY_CONFIG.RAY_MEMORY_GB: 2\n" - ] - } - ], + "outputs": [], "source": [ "import os\n", "\n", @@ -248,36 +238,13 @@ "\n", "MY_CONFIG = MyConfig ()\n", "\n", - "if RUNNING_IN_COLAB:\n", - " MY_CONFIG.INPUT_DATA_DIR = 'input'\n", - "else:\n", - " MY_CONFIG.INPUT_DATA_DIR = os.path.join (os.path.abspath (''), '..', 'data', 'solar-system')\n", - " \n", + "MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'\n", + "\n", "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", "\n", "## Embedding model\n", - "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'\n", - "\n", - "## RAY CONFIGURATION\n", - "### For local runs, we can use more parallelism\n", - "### For google colab, be conservative\n", - "\n", - "if RUNNING_IN_COLAB:\n", - " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", - " MY_CONFIG.RAY_NUM_CPUS = 0.3\n", - " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", - "else: # local run\n", - " num_cpus_available = os.cpu_count()\n", - " # print (num_cpus_available)\n", - " MY_CONFIG.RAY_NUM_CPUS = 0.8\n", - " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", - " # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", - " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", - "\n", - "print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)\n", - "print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)\n", - "print ('MY_CONFIG.RAY_MEMORY_GB:', MY_CONFIG.RAY_MEMORY_GB)\n" + "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'" ] }, { @@ -316,7 +283,7 @@ "base_uri": "https://localhost:8080/" }, "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "outputId": "5c305d54-1c91-455d-d0e2-b514b61a068b" + "outputId": "4d5511fb-1c6f-47df-e5ea-2c1b354d262f" }, "outputs": [ { @@ -338,8 +305,7 @@ "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", - "output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_fuzzy_dedupe_out')\n", - "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_embeddings_out')\n", + "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_embeddings_out')\n", "\n", "## clear output folder\n", "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", @@ -381,14 +347,14 @@ "base_uri": "https://localhost:8080/" }, "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "outputId": "90eb1f89-35d1-4b6f-ea34-7667680dd256" + "outputId": "c50847d4-f2c7-4559-f5f7-d6a3d025027d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "šŸƒšŸ¼ STAGE-1: Processing input='/home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system' --> output='output/01_parquet_out'\n" + "šŸƒšŸ¼ STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'\n" ] } ], @@ -418,49 +384,49 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 625, + "height": 657, "referenced_widgets": [ - "8226b2522ce446f6bd3a36c4e227370c", - "7616f1b493e1461c9fd1319fae3bc10b", - "4f63bfad92b64e7bae18e720376d402d", - "6957a659451b46dab702c1c62fa9cdd2", - "2eea7bc810e54eaeb325136352b71e66", - "ebc626c0750c470db6789b26acf15f60", - "3077f04af3a9447ab98717bd3131cd8f", - "709685da1c6c4164bed658357a2191bf", - "0a1ed94698ca4e4291c553929e0ca66c", - "5dbc6889a9c243c5a922f8cc5f1a704c", - "d6e520e4da004c818031ccfcc3588e5d" + "97b603697cfa4b4ea4e6735b6768ca35", + "e87e8d3262c54cfaaa8768505edacda3", + "b78aa40816e44f7fbebcb24ca68818b3", + "7053c9606a414e978636a7e241909504", + "da0787b239764847a731083997780a85", + "553f3c16839a49d79591d0fc4862bed6", + "c0eb5bc8f6ee427ca42204b3c56f9a4e", + "9d184ed175f0403fb03c2e13dfd04e0a", + "724778729161445c98b187031ae4f67c", + "1cb3bbf7d724411cbe9831543a4aecc0", + "06f9b33494984e4885d5aad813d1d2bc" ] }, "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "outputId": "e2c85b44-f605-4817-c120-2cdce79e3c84" + "outputId": "01d207fb-983d-40b2-e5f6-e38e3789110a" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:40:02 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "18:40:02 INFO - pipeline id pipeline_id\n", - "18:40:02 INFO - code location None\n", - "18:40:02 INFO - data factory data_ is using local data access: input_folder - /home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system output_folder - output/01_parquet_out\n", - "18:40:02 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "18:40:02 INFO - orchestrator pdf2parquet started at 2024-09-18 18:40:02\n", - "18:40:02 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", - "18:40:02 INFO - Initializing models\n" + "22:43:02 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "22:43:02 INFO - pipeline id pipeline_id\n", + "22:43:02 INFO - code location None\n", + "22:43:02 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", + "22:43:02 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:43:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "22:43:02 INFO - orchestrator pdf2parquet started at 2024-10-16 22:43:02\n", + "22:43:02 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "22:43:02 INFO - Initializing models\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "6454e0eb538145aebeed98e2ec662b22", + "model_id": "e92bbc86f5e34ee4ad7dd853a5136c01", "version_major": 2, "version_minor": 0 }, "text/plain": [ - "Fetching 7 files: 0%| | 0/7 [00:001\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", " \n", " \n", @@ -622,12 +588,12 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", " \n", " \n", @@ -640,16 +606,16 @@ "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", "\n", " num_tables num_doc_elements document_id ext \\\n", - "0 0 11 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 0 11 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "0 0 11 07bc0c9a-f863-48e3-9aed-bd289af040bc pdf \n", + "1 0 11 e141f7a4-3e45-4f04-88d3-60e0a81b195b pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:06.831334 0.857239 earth.pdf " + "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "1 2024-10-16T22:43:07.205350 0.921915 earth.pdf " ] }, "execution_count": 10, @@ -700,7 +666,7 @@ "base_uri": "https://localhost:8080/" }, "id": "f870e624", - "outputId": "f70bfa9f-62f8-417d-d91a-30c1f024ccbd" + "outputId": "0b4c054f-3a8a-4db3-f32f-17bd1466b102" }, "outputs": [ { @@ -852,7 +818,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e1a10c2d", - "outputId": "300e7688-692a-4039-c4a4-a86887d9138b" + "outputId": "c1d992c2-faa8-40cd-c375-857970201daa" }, "outputs": [ { @@ -1026,7 +992,7 @@ "base_uri": "https://localhost:8080/" }, "id": "305f00a3", - "outputId": "a787385b-214a-41b2-975d-0d3c5529c2c4" + "outputId": "dd511f34-bab3-4dde-d938-493debb02e5e" }, "outputs": [ { @@ -1067,26 +1033,26 @@ "base_uri": "https://localhost:8080/" }, "id": "5b7b18d5", - "outputId": "cb338503-3dca-45bd-a60a-bd214843a97b" + "outputId": "e0b87171-9d66-473f-e66a-e4b6ae3c3f66" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:40:09 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'output_chunk_column_name': 'contents', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", - "18:40:09 INFO - pipeline id pipeline_id\n", - "18:40:09 INFO - code location None\n", - "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:40:09 INFO - orchestrator doc_chunk started at 2024-09-18 18:40:09\n", - "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", - "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "18:40:09 INFO - done flushing in 0.0 sec\n", - "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + "22:43:09 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30}\n", + "22:43:09 INFO - pipeline id pipeline_id\n", + "22:43:09 INFO - code location None\n", + "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:43:09 INFO - orchestrator doc_chunk started at 2024-10-16 22:43:09\n", + "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "22:43:09 INFO - done flushing in 0.0 sec\n", + "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" ] }, { @@ -1094,8 +1060,8 @@ "output_type": "stream", "text": [ "āœ… Stage:2 completed successfully\n", - "CPU times: user 861 ms, sys: 140 ms, total: 1 s\n", - "Wall time: 1.21 s\n" + "CPU times: user 1.07 s, sys: 180 ms, total: 1.25 s\n", + "Wall time: 1.55 s\n" ] } ], @@ -1151,10 +1117,10 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 893 + "height": 897 }, "id": "d8138d43", - "outputId": "0d08e0a6-e743-44d9-b8f1-eec98b222a92" + "outputId": "fd01e0cb-899e-4c73-d50e-5f4e6f5ff802" }, "outputs": [ { @@ -1164,7 +1130,7 @@ "Files processed : 2\n", "Chunks created : 8\n", "Input data dimensions (rows x columns)= (2, 12)\n", - "Output data dimensions (rows x columns)= (8, 15)\n" + "Output data dimensions (rows x columns)= (8, 16)\n" ] }, { @@ -1192,17 +1158,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " \n", " \n", " \n", @@ -1212,17 +1179,18 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " \n", " \n", " 1\n", @@ -1230,17 +1198,18 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " \n", " \n", " 2\n", @@ -1248,17 +1217,18 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", " \n", " \n", " 3\n", @@ -1266,17 +1236,18 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", " \n", " \n", " 4\n", @@ -1284,17 +1255,18 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " \n", " \n", " 5\n", @@ -1302,17 +1274,18 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " \n", " \n", " 6\n", @@ -1320,17 +1293,18 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", " \n", " \n", " 7\n", @@ -1338,42 +1312,33 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", " \n", " \n", "\n", "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "7 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "3 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "7 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -1386,14 +1351,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "3 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "7 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "3 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "7 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "3 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "7 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -1405,15 +1380,25 @@ "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", "7 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", "\n", - " page_number bbox \n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... " + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... " ] }, "execution_count": 15, @@ -1461,7 +1446,7 @@ "height": 300 }, "id": "3090c950", - "outputId": "cf9bd956-7b31-42bc-ef77-9ebded8ba08e" + "outputId": "0f4b6771-8d38-4a27-c756-21f916b23a4f" }, "outputs": [ { @@ -1564,7 +1549,7 @@ "base_uri": "https://localhost:8080/" }, "id": "d5f151ae", - "outputId": "2b48675c-328d-4d24-d689-ad77231ef4b7" + "outputId": "a4c491b2-53db-4d71-da24-4479de8d1d65" }, "outputs": [ { @@ -1624,7 +1609,9 @@ { "cell_type": "markdown", "id": "7ad1c60d", - "metadata": {}, + "metadata": { + "id": "7ad1c60d" + }, "source": [ "## Step-5: DOC ID generation of Chunks\n", "\n", @@ -1639,7 +1626,9 @@ { "cell_type": "markdown", "id": "1afaa0fd", - "metadata": {}, + "metadata": { + "id": "1afaa0fd" + }, "source": [ "### 5.1 - Set Input/output Folder" ] @@ -1648,7 +1637,13 @@ "cell_type": "code", "execution_count": 18, "id": "6ffd6f54", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6ffd6f54", + "outputId": "1784c80d-6309-4913-9f55-c018b978968f" + }, "outputs": [ { "name": "stdout", @@ -1676,7 +1671,9 @@ { "cell_type": "markdown", "id": "f78a51b7", - "metadata": {}, + "metadata": { + "id": "f78a51b7" + }, "source": [ "### 5.2 - Execute" ] @@ -1685,25 +1682,31 @@ "cell_type": "code", "execution_count": 19, "id": "5fc77557", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5fc77557", + "outputId": "db2b8670-543e-4073-9c7d-3f9ef5f4317e" + }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:40:09 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "18:40:09 INFO - pipeline id pipeline_id\n", - "18:40:09 INFO - code location None\n", - "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:40:09 INFO - orchestrator doc_id started at 2024-09-18 18:40:09\n", - "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.008135795593261719, 'min_file_size': 0.008058547973632812, 'total_file_size': 0.01619434356689453}\n", - "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "18:40:09 INFO - done flushing in 0.0 sec\n", - "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + "22:43:09 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "22:43:09 INFO - pipeline id pipeline_id\n", + "22:43:09 INFO - code location None\n", + "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:43:09 INFO - orchestrator doc_id started at 2024-10-16 22:43:09\n", + "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", + "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "22:43:09 INFO - done flushing in 0.0 sec\n", + "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" ] }, { @@ -1711,8 +1714,8 @@ "output_type": "stream", "text": [ "āœ… Stage:3 completed successfully\n", - "CPU times: user 19.2 ms, sys: 603 Ī¼s, total: 19.8 ms\n", - "Wall time: 16.2 ms\n" + "CPU times: user 10.1 ms, sys: 3 ms, total: 13.1 ms\n", + "Wall time: 11.3 ms\n" ] } ], @@ -1752,7 +1755,9 @@ { "cell_type": "markdown", "id": "a9a8c1fa", - "metadata": {}, + "metadata": { + "id": "a9a8c1fa" + }, "source": [ "### 5.3 - Inspect Generated output\n", "\n", @@ -1768,14 +1773,21 @@ "cell_type": "code", "execution_count": 20, "id": "da9adede", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 860 + }, + "id": "da9adede", + "outputId": "036db4ca-12f6-4b3e-9d7f-fa70e494870d" + }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (8, 15)\n", - "Output data dimensions (rows x columns)= (8, 17)\n" + "Input data dimensions (rows x columns)= (8, 16)\n", + "Output data dimensions (rows x columns)= (8, 18)\n" ] }, { @@ -1803,17 +1815,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_hash\n", " chunk_id\n", " \n", @@ -1825,18 +1838,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 4\n", " \n", " \n", @@ -1845,18 +1859,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " 5\n", " \n", " \n", @@ -1865,18 +1880,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", " 6\n", " \n", " \n", @@ -1885,18 +1901,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", " 7\n", " \n", " \n", @@ -1905,18 +1922,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 0\n", " \n", " \n", @@ -1925,18 +1943,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " 1\n", " \n", " \n", @@ -1945,18 +1964,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", " 2\n", " \n", " \n", @@ -1965,18 +1985,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", " 3\n", " \n", " \n", @@ -1984,25 +2005,15 @@ "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "7 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "3 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "7 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -2015,14 +2026,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "3 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "7 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "3 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "7 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "3 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "7 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -2044,6 +2065,16 @@ "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", + " document_id \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", " chunk_hash chunk_id \n", "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", @@ -2101,7 +2132,7 @@ "base_uri": "https://localhost:8080/" }, "id": "4c7a1b94", - "outputId": "2a135853-c54f-4aa4-ffc4-83c2bc7a68ce" + "outputId": "2f6f05bc-f6fd-4d66-ea01-ed89cd5b80f3" }, "outputs": [ { @@ -2142,27 +2173,27 @@ "base_uri": "https://localhost:8080/" }, "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "outputId": "b9b3de92-4304-4540-dfba-a4549fa157eb" + "outputId": "74dc0b75-58b5-4c97-9965-91315e8a98a5" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:40:09 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", - "18:40:09 INFO - pipeline id pipeline_id\n", - "18:40:09 INFO - code location None\n", - "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:40:09 INFO - orchestrator ededup started at 2024-09-18 18:40:09\n", - "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", - "18:40:09 INFO - Starting from the beginning\n", - "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "18:40:09 INFO - done flushing in 0.0 sec\n", - "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + "22:43:09 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", + "22:43:09 INFO - pipeline id pipeline_id\n", + "22:43:09 INFO - code location None\n", + "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:43:09 INFO - orchestrator ededup started at 2024-10-16 22:43:09\n", + "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "22:43:09 INFO - Starting from the beginning\n", + "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "22:43:09 INFO - done flushing in 0.0 sec\n", + "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" ] }, { @@ -2170,8 +2201,8 @@ "output_type": "stream", "text": [ "āœ… Stage:4 completed successfully\n", - "CPU times: user 15.4 ms, sys: 478 Ī¼s, total: 15.9 ms\n", - "Wall time: 12.9 ms\n" + "CPU times: user 12.6 ms, sys: 5.26 ms, total: 17.9 ms\n", + "Wall time: 14.6 ms\n" ] } ], @@ -2226,18 +2257,18 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 358 + "height": 815 }, "id": "d824ebf6", - "outputId": "14aa660f-6f1a-4f93-9b61-5f8f8adcf3fe" + "outputId": "68f55770-c750-4607-a205-ba183603019d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (8, 17)\n", - "Output data dimensions (rows x columns)= (7, 18)\n", + "Input data dimensions (rows x columns)= (8, 18)\n", + "Output data dimensions (rows x columns)= (7, 19)\n", "Input chunks before exact dedupe : 8\n", "Output chunks after exact dedupe : 7\n", "Duplicate chunks removed : 1\n" @@ -2268,17 +2299,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_hash\n", " chunk_id\n", " removed\n", @@ -2291,18 +2323,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " 5\n", " [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...\n", " \n", @@ -2312,18 +2345,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", " 6\n", " []\n", " \n", @@ -2333,18 +2367,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", " 7\n", " []\n", " \n", @@ -2354,18 +2389,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 0\n", " []\n", " \n", @@ -2375,18 +2411,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " 1\n", " []\n", " \n", @@ -2396,18 +2433,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", " 2\n", " []\n", " \n", @@ -2417,18 +2455,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", " 3\n", " []\n", " \n", @@ -2437,23 +2476,14 @@ "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 earth.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -2465,13 +2495,22 @@ "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "3 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "3 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", @@ -2491,6 +2530,15 @@ "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", + " document_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", " chunk_hash chunk_id \\\n", "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", @@ -2536,10 +2584,10 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 112 + "height": 269 }, "id": "82cc9bb0", - "outputId": "2aff0a5f-8cc7-408c-e1cf-62c0b14b18fb" + "outputId": "46d9e91d-c470-4e3e-e5c8-508c534dbceb" }, "outputs": [ { @@ -2636,7 +2684,7 @@ "base_uri": "https://localhost:8080/" }, "id": "cc61dffa", - "outputId": "337b015f-3795-4c45-98a3-03ae817d4dca" + "outputId": "7fb26043-8538-48b6-80b7-16ceb818c1a8" }, "outputs": [ { @@ -2718,223 +2766,175 @@ "source": [ " ## Step-7: Fuzzy Dedup\n", "\n", - "Post exact deduplication, fuzzy deduplication is applied with the goal of removing **very similar** chunks\n", + "And fuzzy dedupe is only available in RAY version. So we will skip it here\n", + "\n", + "See this file [dpk_intro_1_ray.ipynb](dpk_intro_1_ray.ipynb)" + ] + }, + { + "cell_type": "markdown", + "id": "5370950a-2a3a-4143-8218-f9b4808099ba", + "metadata": { + "id": "5370950a-2a3a-4143-8218-f9b4808099ba" + }, + "source": [ + "## Step-8: Text encoding\n", "\n", - "And fuzzy dedupe is only available in RAY version." + "Encode text for the vector storage." ] }, { "cell_type": "markdown", - "id": "fcf574a3-b287-419c-9c86-07b828b41ca6", + "id": "85aba685", "metadata": { - "id": "fcf574a3-b287-419c-9c86-07b828b41ca6" + "id": "85aba685" }, "source": [ - "### 7.1 - Set Input/output Folder" + "### 8.1 - Set Input/output Folder" ] }, { "cell_type": "code", "execution_count": 26, - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", - "outputId": "4450ed63-3b09-42e4-8085-2951e700cf8f" + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "outputId": "41d268f5-7cc6-432e-d56e-2ba882fbdba6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "šŸƒšŸ¼ STAGE-5: Processing input='output/04_exact_dedupe_out' --> output='output/05_fuzzy_dedupe_out'\n" + "šŸƒšŸ¼ STAGE-6: Processing input='output/04_exact_dedupe_out' --> output='output/05_embeddings_out'\n" ] } ], "source": [ - "## Input to this component is the output of doc_id generator component.\n", - "\n", - "STAGE = 5\n", + "STAGE = 6\n", "\n", "input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_fuzzy_dedupe_dir\n", + "output_folder = output_embeddings_dir\n", + "\n", "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", "print (f\"šŸƒšŸ¼ STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" ] }, { "cell_type": "markdown", - "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3", + "id": "c97545f4", "metadata": { - "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3" + "id": "c97545f4" }, "source": [ - "### 7.2 - Execute" + "### 8.2 - Execute" ] }, { "cell_type": "code", "execution_count": 27, - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "id": "228df6b2-bc62-494b-9697-03ece98d7853", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", - "outputId": "2baa790d-6944-4d20-f0c1-fc2979eb1686" + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "outputId": "b2119b07-0654-45cd-f729-1396e18b24b1" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:40:09 INFO - Running locally\n", - "18:40:09 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.8}}\n", - "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_fuzzy_dedupe_out\n", - "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:40:09 INFO - pipeline id pipeline_id\n", - "18:40:09 INFO - code location None\n", - "18:40:09 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", - "18:40:09 INFO - actor creation delay 0\n", - "18:40:09 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:40:11,503\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - orchestrator started at 2024-09-18 18:40:12\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Number of files is 2, source profile {'max_file_size': 0.009611129760742188, 'min_file_size': 0.009521484375, 'total_file_size': 0.019132614135742188}\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.208082581870258, 'object_store': 4.104041289538145}\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - starting run from the beginning\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - continuing from the very beginning\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Fuzzy: num buckets 8, bucket length 8\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 bucket actors\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 minhash actors\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Table preprocessing uses 1 readers\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 table processor actors\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:13 INFO - Completed 1 files in 0.014 min\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:13 INFO - Completed 1 files (50.0%) in 0.014 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:15 INFO - Completed processing 2 files in 0.047 min\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:15 INFO - creating minhash snapshots\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:16 INFO - minhash snapshots created\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:16 INFO - creating bucket snapshots\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - bucket snapshots created\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created 1 document actors\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created 1 bucket processor actors\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created bucket processor invoker\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - added invoker to bucket collectors\n", - "\u001b[36m(BucketsHash pid=1191796)\u001b[0m 18:40:17 INFO - processing buckets 0 long, 53 short\n", - "\u001b[36m(BucketsHash pid=1191796)\u001b[0m 18:40:17 INFO - Done submitting long buckets\n", - "\u001b[36m(BucketsHashProcessorInvoker pid=1192188)\u001b[0m 18:40:18 INFO - Waiting bucket processing completion. Submitted requests 1\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:18 INFO - Done processing buckets in 0.011 min\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:18 INFO - creating document snapshots\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:19 INFO - document snapshots created\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:19 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:27 INFO - Completed processing 2 files in 0.131 min\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:27 INFO - done flushing in 0.004 sec\n", - "18:40:37 INFO - Completed execution in 0.462 min, execution result 0\n" + "22:43:10 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "22:43:10 INFO - pipeline id pipeline_id\n", + "22:43:10 INFO - code location None\n", + "22:43:10 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_embeddings_out\n", + "22:43:10 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:43:10 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:43:10 INFO - orchestrator text_encoder started at 2024-10-16 22:43:10\n", + "22:43:10 INFO - Number of files is 2, source profile {'max_file_size': 0.010450363159179688, 'min_file_size': 0.010318756103515625, 'total_file_size': 0.020769119262695312}\n", + "22:43:12 INFO - Completed 1 files (50.0%) in 0.004 min\n", + "22:43:12 INFO - Completed 2 files (100.0%) in 0.004 min\n", + "22:43:12 INFO - Done processing 2 files, waiting for flush() completion.\n", + "22:43:12 INFO - done flushing in 0.0 sec\n", + "22:43:12 INFO - Completed execution in 0.039 min, execution result 0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "āœ… Stage:5 completed successfully\n", - "CPU times: user 457 ms, sys: 296 ms, total: 753 ms\n", - "Wall time: 29.2 s\n" + "āœ… Stage:6 completed successfully\n", + "CPU times: user 671 ms, sys: 230 ms, total: 901 ms\n", + "Wall time: 2.8 s\n" ] } ], "source": [ "%%time\n", "\n", - "import os\n", - "import sys\n", - "\n", - "from data_processing.utils import ParamsUtils\n", - "from fdedup_transform_ray import FdedupRayTransformConfiguration\n", - "from data_processing_ray.runtime.ray import RayTransformLauncher\n", - "\n", - "# create parameters\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", "\n", "local_conf = {\n", " \"input_folder\": input_folder,\n", " \"output_folder\": output_folder,\n", "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", " # Data access. Only required parameters are specified\n", " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # Orchestration parameters\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # columns used\n", - " \"fdedup_doc_column\": \"contents\",\n", - " \"fdedup_id_column\": \"chunk_id\",\n", - " \"fdedup_cluster_column\": \"chunk_hash\",\n", - " # infrastructure\n", - " \"fdedup_bucket_cpu\": 0.3,\n", - " \"fdedup_doc_cpu\": 0.3,\n", - " \"fdedup_mhash_cpu\": 0.3,\n", - " \"fdedup_num_doc_actors\": 1,\n", - " \"fdedup_num_bucket_actors\": 1,\n", - " \"fdedup_num_minhash_actors\": 1,\n", - " \"fdedup_num_preprocessors\": 1,\n", - " # fuzzy parameters\n", - " \"fdedup_num_permutations\": 64,\n", - " \"fdedup_threshold\": 0.7, # (default 0.8)\n", - " \"fdedup_shingles_size\": 5,\n", - " \"fdedup_delimiters\": \" \"\n", + " # text_encoder\n", + " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", "}\n", "\n", - "# Pass commandline params\n", "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# launch\n", - "\n", - "launcher = RayTransformLauncher(FdedupRayTransformConfiguration())\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", "\n", "return_code = launcher.launch()\n", "\n", "if return_code == 0:\n", " print (f\"āœ… Stage:{STAGE} completed successfully\")\n", "else:\n", - " raise Exception (\"āŒ Ray job failed\")" + " raise Exception (\"āŒ Job failed\")" ] }, { "cell_type": "markdown", - "id": "a6f8cd11", + "id": "b734852c", "metadata": { - "id": "a6f8cd11" + "id": "b734852c" }, "source": [ - "### 7.3 - Inspect Generated output" + "### 8.3 - Inspect Generated output\n", + "\n", + "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" ] }, { "cell_type": "code", "execution_count": 28, - "id": "e899ad60", + "id": "7b1c1d09", "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 222 + "height": 760 }, - "id": "e899ad60", - "outputId": "17aaaea8-a106-4c9a-ceb3-6760d92f8b59" + "id": "7b1c1d09", + "outputId": "018daa18-e5db-4483-d8d5-30aded80d5e3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (7, 18)\n", - "Output data dimensions (rows x columns)= (6, 18)\n", - "Duplicate chunks removed by fuzzy-dedupe: 1\n" + "Input data dimensions (rows x columns)= (7, 19)\n", + "Output data dimensions (rows x columns)= (7, 20)\n" ] }, { @@ -2962,20 +2962,22 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", + " chunk_hash\n", " chunk_id\n", " removed\n", - " chunk_hash\n", + " embeddings\n", " \n", " \n", " \n", @@ -2985,186 +2987,255 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", + " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", + " Solar System\\nFor more details about the Solar...\n", + " $.main-text[3]\n", + " 1\n", + " [133.18510437, 570.83258057, 374.99838257, 581...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " 5\n", + " [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...\n", + " [-0.051861435, 0.0035226212, 0.030617002, 0.04...\n", + " \n", + " \n", + " 1\n", + " mars.pdf\n", + " 1\n", + " 0\n", + " 11\n", + " pdf\n", + " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", + " 2800\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", " 6\n", " []\n", - " -1\n", + " [0.07728295, 0.024970993, -0.043180738, 0.0580...\n", " \n", " \n", - " 1\n", + " 2\n", " mars.pdf\n", " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Basic facts about Mars:\\nĀ· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", " 7\n", " []\n", - " -1\n", + " [0.10598018, 0.025460618, 0.023627337, 0.03905...\n", " \n", " \n", - " 2\n", + " 3\n", " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 0\n", " []\n", - " -1\n", + " [0.0077404436, -0.02055944, 0.026426593, 0.011...\n", " \n", " \n", - " 3\n", + " 4\n", " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " 1\n", " []\n", - " 5\n", + " [-0.062105548, -0.0053322907, 0.031277698, 0.0...\n", " \n", " \n", - " 4\n", + " 5\n", " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", " 2\n", " []\n", - " -1\n", + " [0.072435796, -0.058001805, -0.019771898, -0.0...\n", " \n", " \n", - " 5\n", + " 6\n", " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nBasic facts about Earth:\\nĀ· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", " 3\n", " []\n", - " -1\n", + " [0.091821924, 0.015197902, 0.07716932, 0.01711...\n", " \n", " \n", "\n", "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 earth.pdf 1 0 11 \n", - "3 earth.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "2 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "2 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "3 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "3 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", "\n", " contents doc_jsonpath \\\n", - "0 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "1 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", - "2 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "5 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\nĀ· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "6 Earth\\nBasic facts about Earth:\\nĀ· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", + " chunk_hash chunk_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", "\n", - " page_number bbox chunk_id \\\n", - "0 1 [132.87440491, 500.84011841, 477.48345947, 534... 6 \n", - "1 1 [133.2026062, 482.90710449, 237.04431152, 493.... 7 \n", - "2 1 [132.87112427, 588.96014404, 479.40917969, 623... 0 \n", - "3 1 [133.20942688, 570.81555176, 375.57919312, 581... 1 \n", - "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 2 \n", - "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 3 \n", + " removed \\\n", + "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [] \n", + "5 [] \n", + "6 [] \n", "\n", - " removed chunk_hash \n", - "0 [] -1 \n", - "1 [] -1 \n", - "2 [] -1 \n", - "3 [] 5 \n", - "4 [] -1 \n", - "5 [] -1 " + " embeddings \n", + "0 [-0.051861435, 0.0035226212, 0.030617002, 0.04... \n", + "1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", + "2 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", + "3 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", + "4 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", + "5 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", + "6 [0.091821924, 0.015197902, 0.07716932, 0.01711... " ] }, "execution_count": 28, @@ -3179,645 +3250,37 @@ "\n", "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "print (\"Duplicate chunks removed by fuzzy-dedupe: \", (input_df.shape[0] - output_df.shape[0]))\n", "\n", "output_df.head(10)" ] }, + { + "cell_type": "markdown", + "id": "f5e12630-be6b-4188-a925-77117155617b", + "metadata": { + "id": "f5e12630-be6b-4188-a925-77117155617b" + }, + "source": [ + "## Step-9: Copy output to final output dir" + ] + }, { "cell_type": "code", "execution_count": 29, - "id": "ab7ea52b", + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", "metadata": { "colab": { - "base_uri": "https://localhost:8080/", - "height": 81 + "base_uri": "https://localhost:8080/" }, - "id": "ab7ea52b", - "outputId": "8e57385f-c925-4ac7-9e0d-ebc64e92530a" - }, - "outputs": [ - { - "data": { - "text/html": [ - "