├── LICENSE
├── README.md
├── config/
│ ├── crawler_config.yaml
│ └── settings.py
├── infrastructure/
│ ├── docker/
│ └── terraform/
├── src/
│ ├── crawler/
│ └── tools/
└── pyproject.toml
This project implements a configurable web crawler using Scrapy and Playwright. It's designed to handle different types of URL crawling patterns and can store the results in a PostgreSQL database.
- Multiple URL crawling strategies
- Playwright integration for JavaScript-rendered content
- PostgreSQL storage
- Configurable crawling patterns and depths
- Docker support
- Terraform infrastructure
- Install uv (if not already installed):
# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
- Create a virtual environment with Python 3.11:
uv venv --python=3.11
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies:
# Install project dependencies from pyproject.toml
uv pip sync
- Configure environment variables:
cp env.example .env
Then edit .env
with your specific configurations.
The crawler behavior is configured through config/crawler_config.yaml
. Each crawling target is defined by a category with the following structure:
categories:
- url_seed_root_id: 0
name: "Category Name"
description: "Category Description"
urls:
- url: "https://example.com"
type: 1
target_patterns:
- ".*\\.pdf$"
seed_pattern: null
max_depth: 0
- Type 0: Direct target URL
- Type 1: Single page with target URLs
- Type 2: Pages with both seed and target URLs
To run the spider and process all categories in the configuration:
python src/run_spider.py
You can process a specific category by providing its url_seed_root_id
:
python src/run_spider.py --url_seed_root_id 0
This will only process the URLs from the category with the matching url_seed_root_id
in the configuration file. This is useful when you want to:
- Test changes on a single category
- Debug specific crawling patterns
- Resume processing for a particular category
- Split processing across different instances
For example, if your config has:
categories:
- url_seed_root_id: 0
name: "Torino"
...
- url_seed_root_id: 1
name: "Bologna"
...
Running python src/run_spider.py --url_seed_root_id 0
will only process the "Torino" category.
To run using Docker:
docker-compose -f infrastructure/docker/docker-compose.yml up
To clean the database:
python src/tools/clean_db.py
The project uses logfire for structured logging. Log level can be configured through the LEVEL_DEEP_LOGGING
environment variable.
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.