-
Notifications
You must be signed in to change notification settings - Fork 129
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add bluecrawl connector to dpk Signed-off-by: Hiroya Matsubara <[email protected]> * update test data Signed-off-by: Hiroya Matsubara <[email protected]> * rename Signed-off-by: Hiroya Matsubara <[email protected]> * correct build Signed-off-by: Hiroya Matsubara <[email protected]> * remove unnecessary file Signed-off-by: Hiroya Matsubara <[email protected]> * add documentation Signed-off-by: Hiroya Matsubara <[email protected]> * rename folder Signed-off-by: Hiroya Matsubara <[email protected]> * renamed library workflow files Signed-off-by: David Wood <[email protected]> --------- Signed-off-by: Hiroya Matsubara <[email protected]> Signed-off-by: David Wood <[email protected]> Co-authored-by: David Wood <[email protected]>
- Loading branch information
Showing
23 changed files
with
1,677 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
name: Test Data Connector lib | ||
|
||
on: | ||
workflow_dispatch: | ||
push: | ||
branches: | ||
- "dev" | ||
- "releases/**" | ||
tags: | ||
- "*" | ||
paths: | ||
- "data-connector-lib/**" | ||
- "!data-connector-lib/**.md" | ||
- ".make.*" | ||
pull_request: | ||
branches: | ||
- "dev" | ||
- "releases/**" | ||
paths: | ||
- "data-connector-lib/**" | ||
- "!data-connector-lib/**.md" | ||
- ".make.*" | ||
|
||
jobs: | ||
test-dpk-connector: | ||
runs-on: ubuntu-22.04 | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v4 | ||
- name: Test dpk_connector | ||
run: | | ||
make -C data-connector-lib venv test |
2 changes: 1 addition & 1 deletion
2
.github/workflows/test-lib.yml → .github/workflows/test-processing-lib.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Use make help, to see the available rules | ||
REPOROOT=.. | ||
include $(REPOROOT)/.make.defaults | ||
|
||
clean:: | ||
@# Help: Clean up the distribution build and the venv | ||
rm -rf dist venv | ||
rm -rf src/*egg-info | ||
|
||
.check-env:: | ||
@echo "Checks passed" | ||
|
||
setup:: | ||
|
||
set-versions: .check-env | ||
$(MAKE) TOML_VERSION=$(DPK_LIB_VERSION) .defaults.update-toml | ||
|
||
build:: build-dist | ||
|
||
#build:: update-toml .defaults.build-dist | ||
build-dist :: .defaults.build-dist | ||
|
||
publish:: publish-dist | ||
|
||
publish-dist :: .check-env .defaults.publish-dist | ||
|
||
venv:: pyproject.toml | ||
@# Help: Create the virtual environment using pyproject.toml | ||
rm -r dist venv || true | ||
rm -rf src/*egg-info || true | ||
rm makeenv || true | ||
$(PYTHON) -m venv venv | ||
source venv/bin/activate; \ | ||
pip install --upgrade pip; \ | ||
pip install -e .; \ | ||
pip install pytest pytest-mock pytest-datadir pytest-cov moto==5.0.5 markupsafe==2.0.1 | ||
|
||
image:: | ||
@# Help: Placeholder does nothing for now. | ||
@echo "Image building for ray is in the works (comming soon)." | ||
|
||
# Here we run each test directory of tests and each ray launched test separately, because | ||
# it seems when running multiple ray launch tests in a single pytest run there is some sort of ray.init() duplication. | ||
# pytest-forked was tried, but then we get SIGABRT in pytest when running the s3 tests, some of which are skipped.. | ||
# TODO: the following fails. Why? source venv/bin/activate; export PYTHONPATH=../src; cd test; $(PYTEST) . | ||
.PHONY: test | ||
test:: venv | ||
@# Help: Use the already-built virtual environment to run pytest on the test directory. | ||
source venv/bin/activate; $(PYTEST); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# DPK Connector | ||
|
||
DPK Connector is a scalable and compliant web crawler developed for data acquisition towards LLM development. It is built on [Scrapy](https://scrapy.org/). | ||
For more details read [the documentation](doc/overview.md). | ||
|
||
## Virtual Environment | ||
|
||
The project uses `pyproject.toml` and a Makefile for operations. | ||
To do development you should establish the virtual environment | ||
```shell | ||
make venv | ||
``` | ||
and then either activate | ||
```shell | ||
source venv/bin/activate | ||
``` | ||
or set up your IDE to use the venv directory when developing in this project | ||
|
||
## Library Artifact Build and Publish | ||
|
||
To test, build and publish the library | ||
```shell | ||
make test build publish | ||
``` | ||
|
||
To up the version number, edit the Makefile to change VERSION and rerun the above. This will require committing both the `Makefile` and the autotmatically updated `pyproject.toml` file. | ||
|
||
## How to use | ||
|
||
See [the overview](doc/overview.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# DPK Connector Overview | ||
|
||
The Data Prep Kit Connector (DPK Connector) is a Python library for scalable and compliant web crawling. | ||
|
||
Features: | ||
- Robots.txt compliant: The Connector follows allow/disallow lists and some extended directives such as `Crawl-delay` in robots.txt of websites. | ||
- Sitemap support: The Connector automatically parses sitemap urls from input and tries to find them from robots.txt. | ||
- User agent and headers customization: You can use your own user agent string and request headers. | ||
- Domain and path focus: You can limit domains and paths accessed by the library. | ||
- Mime type filters: You can restrict mime types which can be downloaded. | ||
- Parallel processing: Requests to websites are processed in parallel. | ||
|
||
## Example usage | ||
|
||
```python | ||
from dpk_connector import crawl, shutdown | ||
|
||
|
||
def main(): | ||
""" | ||
An example of running a crawl. | ||
""" | ||
|
||
def on_downloaded(url: str, body: bytes, headers: dict) -> None: | ||
""" | ||
Callback function called when a page has been downloaded. | ||
You have access to the request URL, response body and headers. | ||
""" | ||
print(f"url: {url}, headers: {headers}, body: {body[:64]}") | ||
|
||
user_agent = "Mozilla/5.0 (X11; Linux i686; rv:125.0) Gecko/20100101 Firefox/125.0" | ||
|
||
# Start crawling | ||
crawl( | ||
["https://crawler-test.com/"], | ||
on_downloaded, | ||
user_agent=user_agent, | ||
depth_limit=0, | ||
) # blocking call | ||
|
||
# Shutdown all crawls | ||
shutdown() | ||
|
||
|
||
if __name__ == "__main__": | ||
main() | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
[project] | ||
name = "dpk_connector" | ||
version = "0.2.2.dev0" | ||
requires-python = ">=3.10" | ||
keywords = [ | ||
"data", | ||
"data acquisition", | ||
"crawler", | ||
"web crawler", | ||
"llm", | ||
"generative", | ||
"ai", | ||
"fine-tuning", | ||
"llmapps", | ||
] | ||
description = "Scalable and Compliant Web Crawler" | ||
license = { text = "Apache-2.0" } | ||
readme = { file = "README.md", content-type = "text/markdown" } | ||
authors = [{ name = "Hiroya Matsubara", email = "[email protected]" }] | ||
dependencies = [ | ||
"scrapy>=2.11.2", | ||
"pydantic>=2.8.1", | ||
"tldextract>=5.1.2", | ||
] | ||
|
||
[project_urls] | ||
Repository = "https://github.com/IBM/data-prep-kit" | ||
Issues = "https://github.com/IBM/data-prep-kit/issues" | ||
Documentation = "https://ibm.github.io/data-prep-kit/" | ||
|
||
[build-system] | ||
requires = ["setuptools>=68.0.0", "wheel", "setuptools_scm[toml]>=7.1.0"] | ||
build-backend = "setuptools.build_meta" | ||
|
||
[project.optional-dependencies] | ||
dev = [ | ||
"twine", | ||
"pytest>=7.3.2", | ||
"pytest-dotenv>=0.5.2", | ||
"pytest-env>=1.0.0", | ||
"pre-commit>=3.3.2", | ||
"pytest-cov>=4.1.0", | ||
"pytest-mock>=3.10.0", | ||
"pytest-datadir>=1.5.0", | ||
"moto==5.0.5", | ||
"markupsafe==2.0.1", | ||
] | ||
|
||
[options] | ||
package_dir = ["src", "test"] | ||
|
||
[options.packages.find] | ||
where = ["src/dpk_connector"] | ||
|
||
[tool.pytest.ini_options] | ||
# Currently we use low coverage since we have to run tests separately (see makefile) | ||
#addopts = "--cov --cov-report term-missing --cov-fail-under 25" | ||
markers = ["unit: unit tests", "integration: integration tests"] | ||
|
||
[tool.coverage.run] | ||
include = ["src/*"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# (C) Copyright IBM Corp. 2024. | ||
# Licensed under the Apache License, Version 2.0 (the “License”); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an “AS IS” BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
################################################################################ | ||
|
||
from dpk_connector.core.crawler import async_crawl, crawl, shutdown # noqa |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# (C) Copyright IBM Corp. 2024. | ||
# Licensed under the Apache License, Version 2.0 (the “License”); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an “AS IS” BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
################################################################################ |
Oops, something went wrong.