Skip to content

Web Scraping a WebPage and using NLP to deduce tags/topics of the page

Notifications You must be signed in to change notification settings

coolkp/Scrape-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Scraping Webpage and Deducing its Topic using NLP

Project Setup

pip install Scrapy beautifulsoup4
pip install spacy && python3 -m spacy download en_core_web_sm

Project Run

  • All Links to be scraped are in ../brightedge/spiders/main_spider.py
  • The data scraped is cleaned and stored in "domainname".txt in the base of this directory
  • All generated list of topics are stored in "domainname_tags.txt" in the base of this directory

Run

scrapy crawl scraper

Results

Outputed in terminal in form of a list immediately after every webpage being crawled. They are also stored in txt file in base of directory They are stored for every crawl in tags variable in main_spider.py

About

Web Scraping a WebPage and using NLP to deduce tags/topics of the page

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages