pip install Scrapy beautifulsoup4
pip install spacy && python3 -m spacy download en_core_web_sm
- All Links to be scraped are in ../brightedge/spiders/main_spider.py
- The data scraped is cleaned and stored in "domainname".txt in the base of this directory
- All generated list of topics are stored in "domainname_tags.txt" in the base of this directory
scrapy crawl scraper
Outputed in terminal in form of a list immediately after every webpage being crawled.
They are also stored in txt file in base of directory
They are stored for every crawl in tags
variable in main_spider.py