WebCrawler

The project scrapes information about books from the flipkart website - Title , description and ratings and stores in a json file. The stored information is then used to run a tf-idf algorithm which helps us rank the books based on the algorithm and output the most relevant books to the query. Steps to follow :

Load the library selenium using pip install -U selenium from your command prompt
Load the library beautiful soup using pip install beautifulsoup4
Create an empty bookdetails.json file and an empty books.txt file in the folder where you have stored the file webcrawl.py
Run the file webcrawl.py - it opens up multiple browsers (here I have used chrome - you could use any web browser of your choice) and starts scraping the information.
Once the project starts running all the links to the books are stored in the file books.txt
Each link is picked up and information about each and every book is stored in a json file
The empty bookdetails.json file now has all the information about the books that has been scraped.
Run Ranking.py to run the tf-idf ranking algorithm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WebCrawler

Files

README.md

Latest commit

History

README.md

File metadata and controls

WebCrawler