Serverless Web Scraper

Serveless Web Scraper using AWS services such as batch and cloudwatch

Diagram of scalable serverless scraping structure

Description

Python based web scraper that uses scrapy framework alongside playwight plugin to perform webscraping. Built to be deployed in AWS services, triggered via lambda function in aws folder.

Getting Started

Dependencies

Python 3.10.11
Pip
Linux
Git

Installing

git clone repository
cd into repo
activate virtual environment (source pyscraper/venv/bin/activate )
pip install -r requirements.txt

Running

cd repo
python pyscraper/pyscraper.py [URL]

Testing

cd repo
cd pyscraper
playwright install
pytest tests/testpyscrape.py -v --forked

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
Deprecated		Deprecated
aws		aws
pyscraper		pyscraper
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
buildspec.yml		buildspec.yml
cloudformation.json		cloudformation.json
diagram_aws.jpg		diagram_aws.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serverless Web Scraper

Description

Getting Started

Dependencies

Installing

Running

Testing

About

Releases

Packages

Languages

zhaoJoseph/AWS-Serverless-Scraper

Folders and files

Latest commit

History

Repository files navigation

Serverless Web Scraper

Description

Getting Started

Dependencies

Installing

Running

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages