IRIS Virtual Patent Marking Pages Classifier

Tool to help a human being to classify a list of potential VPM pages into several possible categories. Part of the IRIS project.

The classifier is written in Python, using the PyQt5 library.

It creates a GUI browser that shows sequentially one of the detected pages.

You can interact with the browser with the mouse and you can also use the numerical pad of the keyboard to select one of the categories.

Once you have chosen the right category for a page the software moves to the next page.

Setup the classifier

The best is to

Install Git
Clone this repository with git clone https://github.com/n3ssuno/iris-classifier.git
Install Miniconda
Create an environment with
- conda create -n iris-vpm-pages-classifier python=3.9
- conda activate iris-vpm-pages-classifier
- pip install -r requirements.txt
If you need to use the pre-classifier, you must also install a headless browser with the following command
playwright install chromium
Note: the code has been tested with Chromium v857950 but the last version of the browser will be installed

If needed, you can add iris_utils as a submodule with (this should be already in place after point 2 above):

git submodule add https://github.com/n3ssuno/iris-utils.git iris_utils
git commit -m "Add iris-utils submodule"
git push

GUI classifier on WSL2

Install qt5-default on the WSL2 distro
Install X410 on Windows (the free alternatives did not work for me) and select Allow Public Access from its menu
Add the following lines into the ~/.bashrc file of the WSL2 distro (before the bunch of code about Conda)
export DISPLAY=$(awk '/nameserver / {print $2; exit}' /etc/resolv.conf 2>/dev/null):0.0
Instead, do not add export LIBGL_ALWAYS_INDIRECT=1 as adviced in many online guides.

Pre-processing

Before you start to classify the pages by hand, you must run pre-classify.py to automatically classify some pages. This script will create a file with five main categories: cases that are (a) very likely true positives; (b) very likely false positives; (c) maybe positive; (d) maybe negative; (e) unknown.

The first two cases are automatically classified. For the second two, a hint is provided and the person is required to choose if the page is actually a VPM page or not. The last case is left to the person, without any hint.

To use it you need a bunch of software that is as easy to install on GNU/Linux as hard to have on MS-Windows. The advice is, therefore, to use a GNU/Linux machine (the instructions that follow are for Debian GNU/Linux) or use WSL2 (to run the GUI classifier from WSL2 is not trivial but possible; follow the instructions here below).

Install Tesseract with
sudo apt install tesseract-ocr
Install Poppler
sudo apt install poppler-utils

To run the automatic classifier, please run
python pre-classify.py -I data/scraping_results.jsonl data/websites_to_exclude.txt -o data/pre_classified.jsonl

Populate the database

Once the data have been analyzed by the pre-classifier, you must use its output to populate a database that will be used by the classifier. To do so, please run
python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json

If you want to split the data in sub-databased, so that more than one person can have her/his own data to classify, you can run
python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json -O N
where N is the number of files that you want to generate.

Note: you cannot overwrite the database once created (you can only update it, if not using the specific commands of Flata). If you want to do so, you must delete the written files and re-run the script.

Run the classifier

Remember, each time, to activate the conda environment created in the setup phase with conda activate iris-vpm-pages-classifier
Run python classify.py -i data/database.json

Acknowledgements

The authors thank the EuroTech Universities Alliance for sponsoring this work. Carlo Bottai was supported by the European Union's Marie Skłodowska-Curie programme for the project Insights on the "Real Impact" of Science (H2020 MSCA-COFUND-2016 Action, Grant Agreement No 754462).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
iris_utils @ 75ab02d		iris_utils @ 75ab02d
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
classify.py		classify.py
post-classify.py		post-classify.py
pre-classify.py		pre-classify.py
requirements.txt		requirements.txt
write-database.py		write-database.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IRIS Virtual Patent Marking Pages Classifier

Setup the classifier

GUI classifier on WSL2

Pre-processing

Populate the database

Run the classifier

Acknowledgements

About

Releases

Packages

Languages

License

n3ssuno/iris-classifier

Folders and files

Latest commit

History

Repository files navigation

IRIS Virtual Patent Marking Pages Classifier

Setup the classifier

GUI classifier on WSL2

Pre-processing

Populate the database

Run the classifier

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages