This very basic data pipeline built in Python is part of the Timelark project. It reads unstructured text from text files, extracts named entities using spaCy, queries the Aleph API to enrich these entities, and saves the enriched data to an SQLite database. From here it can be visualized.
- Python 3.x
- spaCy and spaCy model (e.g., en_core_web_lg)
- Dataset (sqlite wrapper)
- Aleph API access and API key (for example OCCRP's Aleph)
- Confection (for configuration management)
Clone this repository:
git clone https://github.com/jlstro/timelark-pipeline.git
cd timelark-pipeline
Create a virtual environment and nstall the required Python packages:
python3 -m venv venv
source venv/bin/activate
# On Windows: venv\Scripts\activate
python3 -m pip install spacy confection dataset
Download and install the spaCy model (e.g., "en_core_web_lg"):
python3 -m spacy download en_core_web_lg
- Create a configuration file named
config.cfg
in the root directory of the repository. Define the paths to your database, text files, and other configuration values as needed. Refer to the confection documentation for more information on writing the configuration.
Example config.cfg
:
[paths]
db = "./db/data.db"
files = "./text_files"
[aleph]
host = "https://aleph.occrp.org"
collections = 25, 55, 90
The pipeline script expects txt files in the folder set under files in the cfg. It will read in each file and extract the entities, enrich them and then store them.
Make sure you set your Aleph API key as an environment variable named ALEPH_API_KEY.
Run the main script to start the pipeline:
python3 main.py
The pipeline will read text files from the specified directory, extract entities, enrich them using the API, and save the enriched data to the SQLite database.
- Add support for events
- Add relationship extraction, for example using spacy-llm
- De-duplicate entities and fuzzy match
- Convert enriched entities into ftm
- Improve extractor to work with other type of structured information, for example a person's death, from news articles
- Add blacklist/whitelist support to define a clearer scope of which entities may be interesting for a given investigation