Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler transform #797

Open
wants to merge 16 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions .github/workflows/test-universal-web2parquet.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
#
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files
#
name: Test - transforms/universal/web2parquet

on:
workflow_dispatch:
push:
branches:
- "dev"
- "releases/**"
tags:
- "*"
paths:
- ".make.*"
- "transforms/.make.transforms"
- "transforms/universal/web2parquet/**"
- "data-processing-lib/**"
- "!transforms/universal/web2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"
pull_request:
branches:
- "dev"
- "releases/**"
paths:
- ".make.*"
- "transforms/.make.transforms"
- "transforms/universal/web2parquet/**"
- "data-processing-lib/**"
- "!transforms/universal/web2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"

# Taken from https://stackoverflow.com/questions/66335225/how-to-cancel-previous-runs-in-the-pr-when-you-push-new-commitsupdate-the-curre
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
check_if_push_image:
# check whether the Docker images should be pushed to the remote repository
# The images are pushed if it is a merge to dev branch or a new tag is created.
# The latter being part of the release process.
# The images tag is derived from the value of the DOCKER_IMAGE_VERSION variable set in the .make.versions file.
runs-on: ubuntu-22.04
outputs:
publish_images: ${{ steps.version.outputs.publish_images }}
steps:
- id: version
run: |
publish_images='false'
if [[ ${GITHUB_REF} == refs/heads/dev && ${GITHUB_EVENT_NAME} != 'pull_request' && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
if [[ ${GITHUB_REF} == refs/tags/* && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
echo "publish_images=$publish_images" >> "$GITHUB_OUTPUT"
test-src:
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform source in transforms/universal/web2parquet
run: |
if [ -e "transforms/universal/web2parquet/Makefile" ]; then
make -C transforms/universal/web2parquet DOCKER=docker test-src
else
echo "transforms/universal/web2parquet/Makefile not found - source testing disabled for this transform."
fi
test-image:
needs: [check_if_push_image]
runs-on: ubuntu-22.04
timeout-minutes: 120
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/lib/jvm /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform image in transforms/universal/web2parquet
run: |
if [ -e "transforms/universal/web2parquet/Makefile" ]; then
if [ -d "transforms/universal/web2parquet/spark" ]; then
make -C data-processing-lib/spark DOCKER=docker image
fi
make -C transforms/universal/web2parquet DOCKER=docker test-image
else
echo "transforms/universal/web2parquet/Makefile not found - testing disabled for this transform."
fi
- name: Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
docker images
- name: Publish images
if: needs.check_if_push_image.outputs.publish_images == 'true'
run: |
if [ -e "transforms/universal/web2parquet/Makefile" ]; then
make -C transforms/universal/web2parquet publish
else
echo "transforms/universal/web2parquet/Makefile not found - publishing disabled for this transform."
fi
2 changes: 1 addition & 1 deletion .make.defaults
Original file line number Diff line number Diff line change
Expand Up @@ -475,7 +475,7 @@ endif
.defaults.test-src:: venv
@# Help: Run pytest on the test directory inside the venv
source venv/bin/activate; \
export PYTHONPATH=../src; \
export PYTHONPATH=../src:../: ; \
cd test; $(PYTEST) .

# This is small convenience and the image itself must already be created.
Expand Down
90 changes: 90 additions & 0 deletions transforms/.make.modules
touma-I marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Define the root of the local git clone for the common rules to be able
# know where they are running from.

# Set this, before including .make.defaults, to
# 1 if requirements reference the latest code in the data processing library
# in this repo (that is not yet published to pypi). This is the default setting.
# 0 if the transforms DPK dependencies are on wheels published to
# pypi (e.g. data-prep-toolkit=0.2.1)
#USE_REPO_LIB_SRC=1

# Include a library of common .transform.* targets which most
# transforms should be able to reuse. However, feel free
# to override/redefine the rules below.
include $(REPOROOT)/transforms/.make.transforms

######################################################################
## Default setting for TRANSFORM_RUNTIME uses folder name-- Old layout
TRANSFORM_RUNTIME=ray
touma-I marked this conversation as resolved.
Show resolved Hide resolved
TRANSFORM_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).$(TRANSFORM_RUNTIME).transform

venv:: .transforms.ray-venv
source venv/bin/activate && $(PYTHON) -m pip install $(REPOROOT)/data-connector-lib

test:: .transforms.test-src test-image

clean:: .transforms.clean

## We need to think how we want to do this going forward
set-versions::

## We need to think how we want to do this going forward
build::

image::
@if [ -e Dockerfile ]; then \
$(MAKE) image-default ; \
else \
echo "Skipping image for $(shell pwd) since no Dockerfile is present"; \
fi

publish::
@if [ -e Dockerfile ]; then \
$(MAKE) publish-default ; \
else \
echo "Skipping publish for $(shell pwd) since no Dockerfile is present"; \
fi

publish-image::
@if [ -e Dockerfile ]; then \
$(MAKE) publish-image-default ; \
else \
echo "Skipping publish-image for $(shell pwd) since no Dockerfile is present"; \
fi

test-image::
@if [ -e Dockerfile ]; then \
$(MAKE) test-image-default ; \
else \
echo "Skipping test-image for $(shell pwd) since no Dockerfile is present"; \
fi

test-src:: .transforms.test-src

setup:: .transforms.setup

publish-default:: publish-image

publish-image-default:: .transforms.publish-image-ray

test-image-default:: image .transforms.test-image-help .defaults.test-image-pytest .transforms.clean

build-lib-wheel:
make -C $(REPOROOT)/data-processing-lib build-pkg-dist

image-default:: build-lib-wheel
@$(eval LIB_WHEEL_FILE := $(shell find $(REPOROOT)/data-processing-lib/dist/*.whl))
rm -fr dist && mv $(REPOROOT)/data-processing-lib/dist .
$(eval WHEEL_FILE_NAME := $(shell basename $(LIB_WHEEL_FILE)))
$(DOCKER) build -t $(DOCKER_IMAGE_NAME) $(DOCKER_BUILD_EXTRA_ARGS) \
--platform $(DOCKER_PLATFORM) \
--build-arg EXTRA_INDEX_URL=$(EXTRA_INDEX_URL) \
--build-arg BASE_IMAGE=$(RAY_BASE_IMAGE) \
--build-arg BUILD_DATE=$(shell date -u +'%Y-%m-%dT%H:%M:%SZ') \
--build-arg WHEEL_FILE_NAME=$(WHEEL_FILE_NAME) \
--build-arg TRANSFORM_NAME=$(TRANSFORM_NAME) \
--build-arg GIT_COMMIT=$(shell git log -1 --format=%h) .
$(DOCKER) tag $(DOCKER_LOCAL_IMAGE) $(DOCKER_REMOTE_IMAGE)
rm -fr dist


23 changes: 23 additions & 0 deletions transforms/universal/web2parquet/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
REPOROOT=../../..
# Use make help, to see the available rules
include $(REPOROOT)/transforms/.make.modules

#
# This is intended to be included across the Makefiles provided within
# a given transform's directory tree, so must use compatible syntax.
#
################################################################################
# This defines the name of the transform and is used to match against
# expected files and is used to define the transform's image name.
TRANSFORM_NAME=$(shell basename `pwd`)

################################################################################
# This defines the transforms' version number as would be used
# when publishing the wheel. In general, only the micro version
# number should be advanced relative to the DPK_VERSION.
#
# If you change the versions numbers, be sure to run "make set-versions" to
# update version numbers across the transform (e.g., pyproject.toml).
#TRANSFORM_VERSION=$(DPK_VERSION)


81 changes: 81 additions & 0 deletions transforms/universal/web2parquet/dpk_web2parquet/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# (C) Copyright IBM Corp. 2024.
# Licensed under the Apache License, Version 2.0 (the “License”);
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an “AS IS” BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################

from argparse import ArgumentParser, Namespace

from data_processing.transform import TransformConfiguration
from data_processing.utils import CLIArgumentProvider
from data_processing.utils import get_logger
from dpk_web2parquet.transform import Web2ParquetTransform

short_name = "web2parquet"
cli_prefix = f"{short_name}_"
urls_cli_param = f"{cli_prefix}urls"
depth_cli_param = f"{cli_prefix}depth"
downloads_cli_param = f"{cli_prefix}downloads"
folder_cli_param = f"{cli_prefix}folder"


logger = get_logger(__name__,"DEBUG")

class Web2ParquetTransformConfiguration(TransformConfiguration):

"""
Provides support for configuring and using the associated Transform class include
configuration with CLI args.
"""

def __init__(self):
super().__init__(
name=short_name,
transform_class=Web2ParquetTransform
)

def add_input_params(self, parser: ArgumentParser) -> None:
"""
Add Transform-specific arguments to the given parser.
This will be included in a dictionary used to initialize the Web2ParquetTransform.
By convention a common prefix should be used for all transform-specific CLI args
(e.g, noop_, pii_, etc.)
"""
parser.add_argument(f"--{depth_cli_param}", type=int, default=1,
help="maxumum depth relative to seed URL",
)
parser.add_argument(f"--{downloads_cli_param}", type=int, default=1,
help="maxumum number of downloaded URLs",
)
parser.add_argument(f"--{folder_cli_param}", type=str, default=None,
help="Folder where to store downloaded files",
)
parser.add_argument(f"--{urls_cli_param}", type=str, default=None,
help="List of Seed URLs for the crawler",
)

def apply_input_params(self, args: Namespace) -> bool:
"""
Validate and apply the arguments that have been parsed
:param args: user defined arguments.
:return: True, if validate pass or False otherwise
"""
captured = CLIArgumentProvider.capture_parameters(args, cli_prefix, False)
if captured.get("urls") is None:
logger.error(f"Parameter web2parquet_urls must specify a seed URL")
return False

self.params = self.params | captured
logger.info(f"web2parquet parameters are : {self.params}")
return True





26 changes: 26 additions & 0 deletions transforms/universal/web2parquet/dpk_web2parquet/local.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# (C) Copyright IBM Corp. 2024.
# Licensed under the Apache License, Version 2.0 (the “License”);
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an “AS IS” BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################


from dpk_web2parquet.transform import Web2Parquet

# create parameters

if __name__ == "__main__":
# Here we show how to run outside of the runtime
# Create and configure the transform.
transform = Web2Parquet(urls= ['https://thealliance.ai/'],
depth=1,
downloads=1)
table_list, metadata = transform.transform()
#print(f"\noutput table: {table_list}")
print(f"output metadata : {metadata}")
Loading
Loading