Tool to help a human being to classify a list of potential VPM pages into several possible categories. Part of the IRIS project.
The classifier is written in Python, using the PyQt5 library.
It creates a GUI browser that shows sequentially one of the detected pages.
You can interact with the browser with the mouse and you can also use the numerical pad of the keyboard to select one of the categories.
Once you have chosen the right category for a page the software moves to the next page.
The best is to
- Install Git
- Clone this repository with
git clone https://github.com/n3ssuno/iris-classifier.git
- Install Miniconda
- Create an environment with
conda create -n iris-vpm-pages-classifier python=3.9
conda activate iris-vpm-pages-classifier
pip install -r requirements.txt
- If you need to use the pre-classifier, you must also install a headless browser with the following command
playwright install chromium
Note: the code has been tested with Chromium v857950 but the last version of the browser will be installed
If needed, you can add iris_utils
as a submodule with (this should be already in place after point 2 above):
git submodule add https://github.com/n3ssuno/iris-utils.git iris_utils
git commit -m "Add iris-utils submodule"
git push
- Install
qt5-default
on the WSL2 distro - Install X410 on Windows (the free alternatives did not work for me) and select
Allow Public Access
from its menu - Add the following lines into the
~/.bashrc
file of the WSL2 distro (before the bunch of code about Conda)
export DISPLAY=$(awk '/nameserver / {print $2; exit}' /etc/resolv.conf 2>/dev/null):0.0
Instead, do not addexport LIBGL_ALWAYS_INDIRECT=1
as adviced in many online guides.
Before you start to classify the pages by hand, you must run pre-classify.py
to automatically classify some pages.
This script will create a file with five main categories: cases that are (a) very likely true positives; (b) very likely false positives; (c) maybe positive; (d) maybe negative; (e) unknown.
The first two cases are automatically classified. For the second two, a hint is provided and the person is required to choose if the page is actually a VPM page or not. The last case is left to the person, without any hint.
To use it you need a bunch of software that is as easy to install on GNU/Linux as hard to have on MS-Windows. The advice is, therefore, to use a GNU/Linux machine (the instructions that follow are for Debian GNU/Linux) or use WSL2 (to run the GUI classifier from WSL2 is not trivial but possible; follow the instructions here below).
- Install Tesseract with
sudo apt install tesseract-ocr
- Install Poppler
sudo apt install poppler-utils
To run the automatic classifier, please run
python pre-classify.py -I data/scraping_results.jsonl data/websites_to_exclude.txt -o data/pre_classified.jsonl
Once the data have been analyzed by the pre-classifier, you must use its output to populate a database that will be used by the classifier. To do so, please run
python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json
If you want to split the data in sub-databased, so that more than one person can have her/his own data to classify, you can run
python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json -O N
where N
is the number of files that you want to generate.
Note: you cannot overwrite the database once created (you can only update it, if not using the specific commands of Flata). If you want to do so, you must delete the written files and re-run the script.
- Remember, each time, to activate the conda environment created in the setup phase with
conda activate iris-vpm-pages-classifier
- Run
python classify.py -i data/database.json
The authors thank the EuroTech Universities Alliance for sponsoring this work. Carlo Bottai was supported by the European Union's Marie Skłodowska-Curie programme for the project Insights on the "Real Impact" of Science (H2020 MSCA-COFUND-2016 Action, Grant Agreement No 754462).