Web Crawler with Scrapy and Playwright

Project Structure

├── LICENSE
├── README.md
├── config/
│   ├── crawler_config.yaml
│   └── settings.py
├── infrastructure/
│   ├── docker/
│   └── terraform/
├── src/
│   ├── crawler/
│   └── tools/
└── pyproject.toml

Description

This project implements a configurable web crawler using Scrapy and Playwright. It's designed to handle different types of URL crawling patterns and can store the results in a PostgreSQL database.

Features

Multiple URL crawling strategies
Playwright integration for JavaScript-rendered content
PostgreSQL storage
Configurable crawling patterns and depths
Docker support
Terraform infrastructure

Installation & Setup

Install uv (if not already installed):

# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Create a virtual environment with Python 3.11:

uv venv --python=3.11
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

# Install project dependencies from pyproject.toml
uv pip sync

Configure environment variables:

cp env.example .env

Then edit .env with your specific configurations.

Configuration

The crawler behavior is configured through config/crawler_config.yaml. Each crawling target is defined by a category with the following structure:

categories:
  - url_seed_root_id: 0
    name: "Category Name"
    description: "Category Description"
    urls:
      - url: "https://example.com"
        type: 1
        target_patterns:
          - ".*\\.pdf$"
        seed_pattern: null
        max_depth: 0

URL Types

Type 0: Direct target URL
Type 1: Single page with target URLs
Type 2: Pages with both seed and target URLs

Running the Spider

Basic Usage

To run the spider and process all categories in the configuration:

python src/run_spider.py

Selective Category Processing

You can process a specific category by providing its url_seed_root_id:

python src/run_spider.py --url_seed_root_id 0

This will only process the URLs from the category with the matching url_seed_root_id in the configuration file. This is useful when you want to:

Test changes on a single category
Debug specific crawling patterns
Resume processing for a particular category
Split processing across different instances

For example, if your config has:

categories:
  - url_seed_root_id: 0
    name: "Torino"
    ...
  - url_seed_root_id: 1
    name: "Bologna"
    ...

Running python src/run_spider.py --url_seed_root_id 0 will only process the "Torino" category.

Docker Support

To run using Docker:

docker-compose -f infrastructure/docker/docker-compose.yml up

Database Management

To clean the database:

python src/tools/clean_db.py

Logging

The project uses logfire for structured logging. Log level can be configured through the LEVEL_DEEP_LOGGING environment variable.

Contributing

Fork the repository
Create your feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler with Scrapy and Playwright

Project Structure

Description

Features

Installation & Setup

Configuration

URL Types

Running the Spider

Basic Usage

Selective Category Processing

Docker Support

Database Management

Logging

Contributing

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
config		config
infrastructure		infrastructure
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
env.example		env.example
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

mazzasaverio/scrapy-playwright-scrapegraphai

Folders and files

Latest commit

History

Repository files navigation

Web Crawler with Scrapy and Playwright

Project Structure

Description

Features

Installation & Setup

Configuration

URL Types

Running the Spider

Basic Usage

Selective Category Processing

Docker Support

Database Management

Logging

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages