Web Wanderer

Web Wanderer is a multi-threaded web crawler written in Python, utilizing concurrent.futures.ThreadPoolExecutor and Playwright to efficiently crawl and download web pages. This web crawler is designed to handle dynamically rendered websites, making it capable of extracting content from modern web applications.

How to Use

First install the required dependencies.

Then you can use it as either a cli tool or as a library.

1. As command-line interface

python src/main.py https://python.langchain.com/en/latest/

2. As a Library

To start crawling, simply instantiate the MultithreadedCrawler class with the seed URL and optional parameters:

from crawlers import MultithreadedCrawler

crawler = MultithreadedCrawler("https://python.langchain.com/en/latest/")
crawler.start()

The MultithreadedCrawler class is initialized with the following parameters:

seed_url (str): The URL from which the crawling process will begin.
output_dir (str): The directory where the downloaded pages will be stored. By default, the pages are saved in a folder named after the base URL of the seed. Defaults to web-wanderer/downloads/<base-url-of-seed>"
num_threads (int): The number of threads the crawler should use. This determines the level of concurrency during the crawling process. Defaults to 8.
done_callback (Callable | None): A callback function that will be called after crawling is successfully done.

Features

Multi-Threaded: Web Wanderer employs multi-threading using the ThreadPoolExecutor, which allows for concurrent fetching of web pages, making the crawling process faster and more efficient.
Dynamic Website Support: The integration of Playwright enables Web Wanderer to handle dynamically rendered websites, extracting content from modern web applications that rely on JavaScript for rendering.
Queue-Based URL Management: URLs to be crawled are managed using a shared queue, ensuring efficient and organized distribution of tasks among threads.
Done Callback: You have the option to set a callback function that will be executed after the crawling process is successfully completed, allowing you to perform specific actions or analyze the results.

Dependencies

Web Wanderer relies on the following libraries:

playwright: To handle dynamically rendered websites and interact with web pages.

Getting Started with Development

Note: Have only tested this project with Python 3.11.4.

Clone the repository:

git clone https://github.com/biraj21/web-wanderer.git
cd web-wanderer

Install and setup pipenv
Active virtual environment

pipenv shell

Install dependencies

pipenv install

Install headless browser with playwright

playwright install

Planned things

Replace pipenv with poetry cuz pipenv is shit
asyncio crawler
trio crawler (cuz why not)
Allow choosing between HTML engine (requests/aiohttp) & JavaScript engine (Playwright)

Will do it when I get time.

List created on 30th Nov, 2024

Happy web crawling with Web Wanderer! 🕸️🚀

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Wanderer

How to Use

1. As command-line interface

2. As a Library

Features

Dependencies

Getting Started with Development

Planned things

About

Releases

Packages

Languages

License

biraj21/web-wanderer

Folders and files

Latest commit

History

Repository files navigation

Web Wanderer

How to Use

1. As command-line interface

2. As a Library

Features

Dependencies

Getting Started with Development

Planned things

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages