The project scrapes information about books from the flipkart website - Title , description and ratings and stores in a json file. The stored information is then used to run a tf-idf algorithm which helps us rank the books based on the algorithm and output the most relevant books to the query. Steps to follow :
- Load the library selenium using pip install -U selenium from your command prompt
- Load the library beautiful soup using pip install beautifulsoup4
- Create an empty bookdetails.json file and an empty books.txt file in the folder where you have stored the file webcrawl.py
- Run the file webcrawl.py - it opens up multiple browsers (here I have used chrome - you could use any web browser of your choice) and starts scraping the information.
- Once the project starts running all the links to the books are stored in the file books.txt
- Each link is picked up and information about each and every book is stored in a json file
- The empty bookdetails.json file now has all the information about the books that has been scraped.
- Run Ranking.py to run the tf-idf ranking algorithm