Timelark Data Pipeline

This very basic data pipeline built in Python is part of the Timelark project. It reads unstructured text from text files, extracts named entities using spaCy, queries the Aleph API to enrich these entities, and saves the enriched data to an SQLite database. From here it can be visualized.

Prerequisites

Python 3.x
spaCy and spaCy model (e.g., en_core_web_lg)
Dataset (sqlite wrapper)
Aleph API access and API key (for example OCCRP's Aleph)
Confection (for configuration management)

Installation

Clone this repository:

git clone https://github.com/jlstro/timelark-pipeline.git
cd timelark-pipeline

Create a virtual environment and nstall the required Python packages:

python3 -m venv venv
source venv/bin/activate  
# On Windows: venv\Scripts\activate
python3 -m pip install spacy confection dataset

Download and install the spaCy model (e.g., "en_core_web_lg"):

python3 -m spacy download en_core_web_lg

Configuration

Create a configuration file named config.cfg in the root directory of the repository. Define the paths to your database, text files, and other configuration values as needed. Refer to the confection documentation for more information on writing the configuration.

Example config.cfg:

[paths]
db = "./db/data.db"
files = "./text_files"

[aleph]
host = "https://aleph.occrp.org"
collections = 25, 55, 90

The pipeline script expects txt files in the folder set under files in the cfg. It will read in each file and extract the entities, enrich them and then store them.

Make sure you set your Aleph API key as an environment variable named ALEPH_API_KEY.

Running the Pipeline

Run the main script to start the pipeline:

python3 main.py

The pipeline will read text files from the specified directory, extract entities, enrich them using the API, and save the enriched data to the SQLite database.

To-DO

Add support for events
Add relationship extraction, for example using spacy-llm
De-duplicate entities and fuzzy match
Convert enriched entities into ftm
Improve extractor to work with other type of structured information, for example a person's death, from news articles
Add blacklist/whitelist support to define a clearer scope of which entities may be interesting for a given investigation

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
alephutil.py		alephutil.py
config.cfg_example		config.cfg_example
dbmanager.py		dbmanager.py
extractor.py		extractor.py
file_reader.py		file_reader.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Timelark Data Pipeline

Table of Contents

Prerequisites

Installation

Configuration

Running the Pipeline

To-DO

About

Releases

Packages

Languages

License

jlstro/timelark-pipeline

Folders and files

Latest commit

History

Repository files navigation

Timelark Data Pipeline

Table of Contents

Prerequisites

Installation

Configuration

Running the Pipeline

To-DO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages