This project is part of the text analytics course at Heidelberg university. The goal of this project is to classify social media posts on hate speech using text analytics methods.
This Repo contains all files of the project.
- The documentation is located in the docs folder. Within this folder you can find among other documents the project proposal and the project report.
- The project's source code is located in the src folder, the tests in the tests folder and the code coverage in the htmlcov folder.
- The assignments of the lecture are located in the assignments folder and are not directly connected to this project.
- Christopher Klammt
- Felix Hausberger
- Nils Krehl
-
Install Python 3.7
-
If the operating system is Windows, install the Microsoft build tools für C++ (needed for fastText installation)
-
Install pipenv
pip install pipenv
-
Install all the dependencies defined in the Pipfile
pipenv install --dev
-
Enter the virtual environment of pipenv
pipenv shell
-
Download and add the original datasets (Automated Hate Speech Detection and the Problem of Offensive Language, Hate speech dataset from a white supremacist forum) The resulting directory structure should look like the following:
-
Run the program (on our computers this takes about 10 min)
pipenv run main
-
Run the tests
pipenv run test && pipenv run report
-
Leave the virtual environment of pipenv
exit
Normally all needed dependencies are downloaded automatically. If this is not the case, try the following:
-
sudo pipenv run spacy download en
(Assignment 2) -
sudo pipenv run nltk.downloader vader_lexicon
-
sudo pipenv run nltk.downloader averaged_perceptron_tagger
-
Set up the git hook scripts
pre-commit install
For running the assignments further dependencies are needed:
- pdftotext (additional os dependencies needed) (Assignment 1)