Submission of Team MA-217164 in Web Crawler competition organised by TechFest , IIT Bombay
We had to develop a web crawler which would identify the following key components :
- SSL certificate compliance – Check all links in the site for URL validation of SSL (all hyperlinks should be https://), and verify the validity of the SSL certificate.
- Cookie checker – Verify cookies being used by the website, the cookie checker will scan the cookies on the website, and cookie consent verification links.
- ADA compliance
- Alt text in all images.
- Color contrast for the site as per w3.org guidelines.
- Accessibility issues to check the site markup for null tab index
- The user can run the program individually for each type of problem. There is also a combined script for all executing all tasks.
- We have used streamlit to render the results in a web-interface instead of displaying it in the terminal. The SSL Certificate details (if enabled), cookies present, verification attribute, info regarding null tab index and the image tags without alt text of a website are displayed in a local URL.
ssl
socket
prettytable
streamlit
beautifulsoup
requests
urllib
- Clone the repository
git clone https://github.com/Rajarshi1001/webCrawler.git
- Install the requirements
pip install -r requirements.txt
py pip install streamlit
Specify the url using --link option while executing the script.
This Script displays the ssl details, verification & details about the cookies being used by the website, img tags without alt-text and null tab index. (e.g = https://github.com)
py -m streamlit run script.py -- --link https://github.com
Firstly head to cd .\webCralTF\webCralTF\
(yes twice) then run
scrapy crawl spidey
Now head to the correct directory then run
python colContr.py