Scraping Webpage and Deducing its Topic using NLP

Project Setup

pip install Scrapy beautifulsoup4
pip install spacy && python3 -m spacy download en_core_web_sm

Project Run

All Links to be scraped are in ../brightedge/spiders/main_spider.py
The data scraped is cleaned and stored in "domainname".txt in the base of this directory
All generated list of topics are stored in "domainname_tags.txt" in the base of this directory

Run

scrapy crawl scraper

Results

Outputed in terminal in form of a list immediately after every webpage being crawled. They are also stored in txt file in base of directory They are stored for every crawl in tags variable in main_spider.py

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
brightedge		brightedge
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraping Webpage and Deducing its Topic using NLP

Project Setup

Project Run

Run

Results

About

Releases

Packages

Languages

coolkp/Scrape-NLP

Folders and files

Latest commit

History

Repository files navigation

Scraping Webpage and Deducing its Topic using NLP

Project Setup

Project Run

Run

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages