Skip to content

The project comprises a web crawler that gets all the information about books from flipkart.Given a query the program retrieves the books which are most relevant to the query using tf-idf for information retrieval

Notifications You must be signed in to change notification settings

sharanya17410/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

The project scrapes information about books from the flipkart website - Title , description and ratings and stores in a json file. The stored information is then used to run a tf-idf algorithm which helps us rank the books based on the algorithm and output the most relevant books to the query. Steps to follow :

  1. Load the library selenium using pip install -U selenium from your command prompt
  2. Load the library beautiful soup using pip install beautifulsoup4
  3. Create an empty bookdetails.json file and an empty books.txt file in the folder where you have stored the file webcrawl.py
  4. Run the file webcrawl.py - it opens up multiple browsers (here I have used chrome - you could use any web browser of your choice) and starts scraping the information.
  5. Once the project starts running all the links to the books are stored in the file books.txt
  6. Each link is picked up and information about each and every book is stored in a json file
  7. The empty bookdetails.json file now has all the information about the books that has been scraped.
  8. Run Ranking.py to run the tf-idf ranking algorithm

About

The project comprises a web crawler that gets all the information about books from flipkart.Given a query the program retrieves the books which are most relevant to the query using tf-idf for information retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages