Scrape and Store Dark Web Sites along with therir Latency | Crawler for Dark Web | Search Engine Oriented
-
It's implemented using BFS and in coming time it's implementation will be changed to A* Search (Preferential Crawler)
-
fetch onion links
-
recursive fetching
-
store scrapped data
-
user added url
-
url blacklisting
Increase Crawl Depth
Add More starter links
Create more spiders with special focus on Directories
DRL
Link Dir Onion- A big directory of urls
UADD
User Added- Added by user
- presently links are appened in user_added_urls.txt under spider_data
- Crawled in exactly similar fashion as to DRL
- Py3
- Tor
pip install -r requirements.txt
< directions to execute > (For commit of Jul 10, 2020)
- This commit contains pipeline to generate data in csv/json file
- You can run this without much effort
# start tor on port 9150
pproxy -l http://:8181 -r socks5://127.0.0.1:9150 -vv
scrapy crawl name_of_spider # DRL
- This commit contains data pipeline to save data on MongoDB Server
- You need to setup MongoDB Server connection credentials and URI in settings.py/pipeline.py file
Idea hamster: --
Developer: I had prior experience with web scraping but hadn't worked with web crawlers before and scraping the deep-web was also something new to me as it required setting up tor proxy. This project developed my interest in the web-mining subject and therefore it encorged me to take this subject as a subject in college curriculum.
Idea by- Angad Sharma |
1UC1F3R616 (Kush Choudhary) |
Made with ❤️ by DSC VIT