Serveless Web Scraper using AWS services such as batch and cloudwatch
Diagram of scalable serverless scraping structurePython based web scraper that uses scrapy framework alongside playwight plugin to perform webscraping. Built to be deployed in AWS services, triggered via lambda function in aws folder.
- Python 3.10.11
- Pip
- Linux
- Git
- git clone repository
- cd into repo
- activate virtual environment (source pyscraper/venv/bin/activate )
- pip install -r requirements.txt
- cd repo
- python pyscraper/pyscraper.py [URL]
- cd repo
- cd pyscraper
- playwright install
- pytest tests/testpyscrape.py -v --forked