Remove all data and reference preprocessing logic - it is moving to t…

…he ALLIUM PrePro repo
Molmed · Oct 14, 2024 · 1f2ffde · 1f2ffde
1 parent 0fa0d3d
commit 1f2ffde
Show file tree

Hide file tree

Showing 15 changed files with 3 additions and 82,060 deletions.
diff --git a/.RData b/.RData
diff --git a/.Rprofile b/.Rprofile
diff --git a/.gitignore b/.gitignore
@@ -1,12 +1,3 @@
 __pycache__
 allium.egg-info
 .DS_Store
-
-# R data and history files
-.RData
-.Rhistory
-.Rproj.user
-*.Rproj
-
-# Reference files
-data/reference/*.gtf.gz
diff --git a/README.md b/README.md
@@ -13,21 +13,11 @@ Krali, O., Marincevic-Zuniga, Y., Arvidsson, G. et al. Multimodal classification
 This repository contains:
 - the ALLIUM models
 - GEX and DNAm prediction clients
-- GEX data preprocessing helpers
-- metadata generation helpers (use only if changing reference genome versions)
 - test data
 
-## Pre-requisites
-General:
-- Python 3.8+
-- Conda
-
-For preprocessing GEX data or regenerating metadata:
-- R 4.4.1 or later, and renv
-
-You may need to install additional libraries depending on your operating system.
-
 ## Conda environment
+[Conda](https://docs.conda.io) must be installed on your system.
+
 You will need to activate the `allium` conda environment before running any subsequent commands.
 
 Install: `conda env create -f environment.yml`
@@ -36,30 +26,14 @@ Activate: `conda activate allium`
 
 Update (after changes to environment.yml): `conda env update --file environment.yml --prune`
 
-## R Environment (for preprocessing data files only)
-Start R from the project directory, then run: `renv::restore()`
-
 ## Prediction client
 Run `python test_client.py` to run GEX and DNAm prediction on test datasets.
 
 ## Tests
 Run `pytest`.
 
 ## Preprocessing GEX data
-To prepare gene expression for prediction using ALLIUM, you will need a CSV file with raw gene transcript counts. The leftmost column should be HGNC gene symbols or Ensembl identifiers.
-
-|         | Sample_1 | Sample_2 | ... |
-| --------| -------- | -------- | --- |
-| ETV6    | 10       | 10       | ... |
-| SARS1   | 20       | 10       | ... |
-| DOC2B   | 5        | 10       | ... |
-
-This file will then need to undergo:
-- conversion of gene identifiers to Ensembl ids used in the ALLIUM reference version
-- batch identification and processing, if necessary
-- normalization
-
-TODO: Add example file to repository, and describe preprocessing script usage.
+Preprocessing tools are available in the [ALLIUM PrePro](https://github.com/Molmed/allium_prepro) repository.
 
 ## Limitations
 The models were trained using an older version of scikit-learn, due to some legacy dependency issues. This package, together with the Python version, should preferably be upgraded when retraining the model. Due to this, the current version of the prediction client does not work on Mac OS.