WebCrawler

The project scrapes information about books from the flipkart website - Title , description and ratings and stores in a json file. The stored information is then used to run a tf-idf algorithm which helps us rank the books based on the algorithm and output the most relevant books to the query. Steps to follow :

Load the library selenium using pip install -U selenium from your command prompt
Load the library beautiful soup using pip install beautifulsoup4
Create an empty bookdetails.json file and an empty books.txt file in the folder where you have stored the file webcrawl.py
Run the file webcrawl.py - it opens up multiple browsers (here I have used chrome - you could use any web browser of your choice) and starts scraping the information.
Once the project starts running all the links to the books are stored in the file books.txt
Each link is picked up and information about each and every book is stored in a json file
The empty bookdetails.json file now has all the information about the books that has been scraped.
Run Ranking.py to run the tf-idf ranking algorithm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawler

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
Ranking.py		Ranking.py
bookdetails.json		bookdetails.json
books.txt		books.txt
webcrawl.py		webcrawl.py

sharanya17410/webcrawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages