Skip to content

Commit

Permalink
Remove all data and reference preprocessing logic - it is moving to t…
Browse files Browse the repository at this point in the history
…he ALLIUM PrePro repo
  • Loading branch information
mariya committed Oct 14, 2024
1 parent 0fa0d3d commit 1f2ffde
Show file tree
Hide file tree
Showing 15 changed files with 3 additions and 82,060 deletions.
Binary file removed .RData
Binary file not shown.
1 change: 0 additions & 1 deletion .Rprofile

This file was deleted.

9 changes: 0 additions & 9 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,12 +1,3 @@
__pycache__
allium.egg-info
.DS_Store

# R data and history files
.RData
.Rhistory
.Rproj.user
*.Rproj

# Reference files
data/reference/*.gtf.gz
32 changes: 3 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,11 @@ Krali, O., Marincevic-Zuniga, Y., Arvidsson, G. et al. Multimodal classification
This repository contains:
- the ALLIUM models
- GEX and DNAm prediction clients
- GEX data preprocessing helpers
- metadata generation helpers (use only if changing reference genome versions)
- test data

## Pre-requisites
General:
- Python 3.8+
- Conda

For preprocessing GEX data or regenerating metadata:
- R 4.4.1 or later, and renv

You may need to install additional libraries depending on your operating system.

## Conda environment
[Conda](https://docs.conda.io) must be installed on your system.

You will need to activate the `allium` conda environment before running any subsequent commands.

Install: `conda env create -f environment.yml`
Expand All @@ -36,30 +26,14 @@ Activate: `conda activate allium`

Update (after changes to environment.yml): `conda env update --file environment.yml --prune`

## R Environment (for preprocessing data files only)
Start R from the project directory, then run: `renv::restore()`

## Prediction client
Run `python test_client.py` to run GEX and DNAm prediction on test datasets.

## Tests
Run `pytest`.

## Preprocessing GEX data
To prepare gene expression for prediction using ALLIUM, you will need a CSV file with raw gene transcript counts. The leftmost column should be HGNC gene symbols or Ensembl identifiers.

| | Sample_1 | Sample_2 | ... |
| --------| -------- | -------- | --- |
| ETV6 | 10 | 10 | ... |
| SARS1 | 20 | 10 | ... |
| DOC2B | 5 | 10 | ... |

This file will then need to undergo:
- conversion of gene identifiers to Ensembl ids used in the ALLIUM reference version
- batch identification and processing, if necessary
- normalization

TODO: Add example file to repository, and describe preprocessing script usage.
Preprocessing tools are available in the [ALLIUM PrePro](https://github.com/Molmed/allium_prepro) repository.

## Limitations
The models were trained using an older version of scikit-learn, due to some legacy dependency issues. This package, together with the Python version, should preferably be upgraded when retraining the model. Due to this, the current version of the prediction client does not work on Mac OS.
Loading

0 comments on commit 1f2ffde

Please sign in to comment.