Skip to content

Commit

Permalink
add dpk_connector to dpk (#637)
Browse files Browse the repository at this point in the history
* add bluecrawl connector to dpk

Signed-off-by: Hiroya Matsubara <[email protected]>

* update test data

Signed-off-by: Hiroya Matsubara <[email protected]>

* rename

Signed-off-by: Hiroya Matsubara <[email protected]>

* correct build

Signed-off-by: Hiroya Matsubara <[email protected]>

* remove unnecessary file

Signed-off-by: Hiroya Matsubara <[email protected]>

* add documentation

Signed-off-by: Hiroya Matsubara <[email protected]>

* rename folder

Signed-off-by: Hiroya Matsubara <[email protected]>

* renamed library workflow files

Signed-off-by: David Wood <[email protected]>

---------

Signed-off-by: Hiroya Matsubara <[email protected]>
Signed-off-by: David Wood <[email protected]>
Co-authored-by: David Wood <[email protected]>
  • Loading branch information
hmtbr and daw3rd authored Oct 4, 2024
1 parent 2a86cec commit 45bb6e3
Show file tree
Hide file tree
Showing 23 changed files with 1,677 additions and 1 deletion.
32 changes: 32 additions & 0 deletions .github/workflows/test-connector-lib.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Test Data Connector lib

on:
workflow_dispatch:
push:
branches:
- "dev"
- "releases/**"
tags:
- "*"
paths:
- "data-connector-lib/**"
- "!data-connector-lib/**.md"
- ".make.*"
pull_request:
branches:
- "dev"
- "releases/**"
paths:
- "data-connector-lib/**"
- "!data-connector-lib/**.md"
- ".make.*"

jobs:
test-dpk-connector:
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Test dpk_connector
run: |
make -C data-connector-lib venv test
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Test DPK libs and (Optionally) Push base DPK images
name: Test Data Processing libs and (Optionally) Push base DPK images

on:
workflow_dispatch:
Expand Down
49 changes: 49 additions & 0 deletions data-connector-lib/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Use make help, to see the available rules
REPOROOT=..
include $(REPOROOT)/.make.defaults

clean::
@# Help: Clean up the distribution build and the venv
rm -rf dist venv
rm -rf src/*egg-info

.check-env::
@echo "Checks passed"

setup::

set-versions: .check-env
$(MAKE) TOML_VERSION=$(DPK_LIB_VERSION) .defaults.update-toml

build:: build-dist

#build:: update-toml .defaults.build-dist
build-dist :: .defaults.build-dist

publish:: publish-dist

publish-dist :: .check-env .defaults.publish-dist

venv:: pyproject.toml
@# Help: Create the virtual environment using pyproject.toml
rm -r dist venv || true
rm -rf src/*egg-info || true
rm makeenv || true
$(PYTHON) -m venv venv
source venv/bin/activate; \
pip install --upgrade pip; \
pip install -e .; \
pip install pytest pytest-mock pytest-datadir pytest-cov moto==5.0.5 markupsafe==2.0.1

image::
@# Help: Placeholder does nothing for now.
@echo "Image building for ray is in the works (comming soon)."

# Here we run each test directory of tests and each ray launched test separately, because
# it seems when running multiple ray launch tests in a single pytest run there is some sort of ray.init() duplication.
# pytest-forked was tried, but then we get SIGABRT in pytest when running the s3 tests, some of which are skipped..
# TODO: the following fails. Why? source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) .
.PHONY: test
test:: venv
@# Help: Use the already-built virtual environment to run pytest on the test directory.
source venv/bin/activate; $(PYTEST);
30 changes: 30 additions & 0 deletions data-connector-lib/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# DPK Connector

DPK Connector is a scalable and compliant web crawler developed for data acquisition towards LLM development. It is built on [Scrapy](https://scrapy.org/).
For more details read [the documentation](doc/overview.md).

## Virtual Environment

The project uses `pyproject.toml` and a Makefile for operations.
To do development you should establish the virtual environment
```shell
make venv
```
and then either activate
```shell
source venv/bin/activate
```
or set up your IDE to use the venv directory when developing in this project

## Library Artifact Build and Publish

To test, build and publish the library
```shell
make test build publish
```

To up the version number, edit the Makefile to change VERSION and rerun the above. This will require committing both the `Makefile` and the autotmatically updated `pyproject.toml` file.

## How to use

See [the overview](doc/overview.md).
47 changes: 47 additions & 0 deletions data-connector-lib/doc/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# DPK Connector Overview

The Data Prep Kit Connector (DPK Connector) is a Python library for scalable and compliant web crawling.

Features:
- Robots.txt compliant: The Connector follows allow/disallow lists and some extended directives such as `Crawl-delay` in robots.txt of websites.
- Sitemap support: The Connector automatically parses sitemap urls from input and tries to find them from robots.txt.
- User agent and headers customization: You can use your own user agent string and request headers.
- Domain and path focus: You can limit domains and paths accessed by the library.
- Mime type filters: You can restrict mime types which can be downloaded.
- Parallel processing: Requests to websites are processed in parallel.

## Example usage

```python
from dpk_connector import crawl, shutdown


def main():
"""
An example of running a crawl.
"""

def on_downloaded(url: str, body: bytes, headers: dict) -> None:
"""
Callback function called when a page has been downloaded.
You have access to the request URL, response body and headers.
"""
print(f"url: {url}, headers: {headers}, body: {body[:64]}")

user_agent = "Mozilla/5.0 (X11; Linux i686; rv:125.0) Gecko/20100101 Firefox/125.0"

# Start crawling
crawl(
["https://crawler-test.com/"],
on_downloaded,
user_agent=user_agent,
depth_limit=0,
) # blocking call

# Shutdown all crawls
shutdown()


if __name__ == "__main__":
main()
```
61 changes: 61 additions & 0 deletions data-connector-lib/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
[project]
name = "dpk_connector"
version = "0.2.2.dev0"
requires-python = ">=3.10"
keywords = [
"data",
"data acquisition",
"crawler",
"web crawler",
"llm",
"generative",
"ai",
"fine-tuning",
"llmapps",
]
description = "Scalable and Compliant Web Crawler"
license = { text = "Apache-2.0" }
readme = { file = "README.md", content-type = "text/markdown" }
authors = [{ name = "Hiroya Matsubara", email = "[email protected]" }]
dependencies = [
"scrapy>=2.11.2",
"pydantic>=2.8.1",
"tldextract>=5.1.2",
]

[project_urls]
Repository = "https://github.com/IBM/data-prep-kit"
Issues = "https://github.com/IBM/data-prep-kit/issues"
Documentation = "https://ibm.github.io/data-prep-kit/"

[build-system]
requires = ["setuptools>=68.0.0", "wheel", "setuptools_scm[toml]>=7.1.0"]
build-backend = "setuptools.build_meta"

[project.optional-dependencies]
dev = [
"twine",
"pytest>=7.3.2",
"pytest-dotenv>=0.5.2",
"pytest-env>=1.0.0",
"pre-commit>=3.3.2",
"pytest-cov>=4.1.0",
"pytest-mock>=3.10.0",
"pytest-datadir>=1.5.0",
"moto==5.0.5",
"markupsafe==2.0.1",
]

[options]
package_dir = ["src", "test"]

[options.packages.find]
where = ["src/dpk_connector"]

[tool.pytest.ini_options]
# Currently we use low coverage since we have to run tests separately (see makefile)
#addopts = "--cov --cov-report term-missing --cov-fail-under 25"
markers = ["unit: unit tests", "integration: integration tests"]

[tool.coverage.run]
include = ["src/*"]
13 changes: 13 additions & 0 deletions data-connector-lib/src/dpk_connector/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# (C) Copyright IBM Corp. 2024.
# Licensed under the Apache License, Version 2.0 (the “License”);
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an “AS IS” BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################

from dpk_connector.core.crawler import async_crawl, crawl, shutdown # noqa
11 changes: 11 additions & 0 deletions data-connector-lib/src/dpk_connector/core/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# (C) Copyright IBM Corp. 2024.
# Licensed under the Apache License, Version 2.0 (the “License”);
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an “AS IS” BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################
Loading

0 comments on commit 45bb6e3

Please sign in to comment.