Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.4.2 #68

Merged
merged 34 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
c313435
fix processing multiple queries in pos/neg sameby/diffby
alxndrkalinin Nov 14, 2023
5d9b2b3
merge df filter queries before applying
alxndrkalinin Nov 14, 2023
87be587
fix pval: use proportion of null above the _last_ entry of the statis…
alxndrkalinin Nov 21, 2023
1ac78f9
Merge pull request #48 from alxndrkalinin/multi_query
johnarevalo Nov 21, 2023
be2e343
pval: revert default to finding first accurentce of statistic value i…
alxndrkalinin Nov 22, 2023
900c807
fix typo in computing p vals
alxndrkalinin Nov 22, 2023
3955665
feat: add pairwise euclidean distance
alxndrkalinin Feb 13, 2024
45239f7
fix: return ids for rank lists needed for multilabel mode
alxndrkalinin Feb 13, 2024
1f6e4f0
fix processing multiple queries in pos/neg sameby/diffby
alxndrkalinin Nov 14, 2023
9bbc872
merge df filter queries before applying
alxndrkalinin Nov 14, 2023
e8031fc
feat: add pairwise euclidean distance
alxndrkalinin Feb 13, 2024
b518234
fix: return ids for rank lists needed for multilabel mode
alxndrkalinin Feb 13, 2024
8943e9c
Merge branch 'cytomining-main' into v0.4.0
alxndrkalinin Feb 23, 2024
a0dc0e1
merge v0.4.0
alxndrkalinin Feb 23, 2024
7d9c380
Merge branch 'cytomining:main' into v0.4.0
alxndrkalinin May 17, 2024
d5ad0a9
refactor processing multiple queries for filtering the df
alxndrkalinin May 17, 2024
94f004e
add tests for query filtering
alxndrkalinin May 17, 2024
f64449b
add python 3.11-12 to github actions
alxndrkalinin May 20, 2024
51ad560
add phenotypic activity example
alxndrkalinin May 21, 2024
4e0e96f
add phenotypic consistency to example
alxndrkalinin Jul 1, 2024
769c222
refactor(example): clean up, better descroptions and variable naming
alxndrkalinin Jul 2, 2024
0cb843f
Merge branch 'cytomining:main' into v0.4.2
alxndrkalinin Jul 2, 2024
69ca584
add citation, sys recs & dependencies to readme
alxndrkalinin Jul 3, 2024
b5c5308
fix typos
alxndrkalinin Jul 10, 2024
2673e55
bump version to 0.4.2
alxndrkalinin Sep 6, 2024
fac5985
allow distance fn selection; add euclidean, abs_cosine
alxndrkalinin Sep 17, 2024
4a07fac
chore: raise min python version to 3.9; update author list
alxndrkalinin Oct 9, 2024
55a5fa4
chore: remove python 3.8 from github actions
alxndrkalinin Oct 9, 2024
dd6ea96
add manhattan & chebyshev distances
alxndrkalinin Oct 16, 2024
c13e66e
move matching demo to example notebook, update readme
alxndrkalinin Oct 16, 2024
0eb53b3
Support python 3.8
johnarevalo Oct 22, 2024
c3fd0c6
Format using ruff
johnarevalo Oct 22, 2024
4206e57
Add ruff workflow. update README
johnarevalo Oct 22, 2024
7d47818
Remove flake. Add ruff format check. Format nb
johnarevalo Oct 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 3 additions & 12 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.9", "3.10"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v3
Expand All @@ -26,17 +26,8 @@ jobs:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip build
python -m pip install flake8 pytest
python -m build
pip install -e .
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
python -m pip install --upgrade pip
pip install -e .[test]
- name: Test with pytest
run: |
python -m pip install scikit-learn
pytest
11 changes: 11 additions & 0 deletions .github/workflows/ruff.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: Ruff
on: [push, pull_request]
jobs:
ruff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/ruff-action@v1
- uses: astral-sh/ruff-action@v1
with:
args: "format --check"
137 changes: 39 additions & 98 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,119 +1,60 @@
# copairs

Find pairs and compute metrics between them.
`copairs` is a Python package for finding groups of profiles based on metadata and calculate mean Average Precision to assess intra- vs inter-group similarities.

## Installation
## Getting started

```bash
pip install git+https://github.com/cytomining/[email protected]
```

## Usage
### System requirements
copairs supports Python 3.8+ and should work with all modern operating systems (tested with MacOS 13.5, Ubuntu 18.04, Windows 10).

### Data
### Dependencies
copairs depends on widely used Python packages:
* numpy
* pandas
* tqdm
* statsmodels
* [optional] plotly

Say you have a dataset with 20 samples taken in 3 plates `p1, p2, p3`,
each plate is composed of 5 wells `w1, w2, w3, w4, w5`, and each well
has one or more labels (`t1, t2, t3, t4`) assigned.
### Installation

```python
import pandas as pd
import random

random.seed(0)
n_samples = 20
dframe = pd.DataFrame({
'plate': [random.choice(['p1', 'p2', 'p3']) for _ in range(n_samples)],
'well': [random.choice(['w1', 'w2', 'w3', 'w4', 'w5']) for _ in range(n_samples)],
'label': [random.choice(['t1', 't2', 't3', 't4']) for _ in range(n_samples)]
})
dframe = dframe.drop_duplicates()
dframe = dframe.sort_values(by=['plate', 'well', 'label'])
dframe = dframe.reset_index(drop=True)
To install copairs and dependencies, run:
```bash
pip install copairs
```

| | plate | well | label |
|---:|:--------|:-------|:--------|
| 0 | p1 | w2 | t4 |
| 1 | p1 | w3 | t2 |
| 2 | p1 | w3 | t4 |
| 3 | p1 | w4 | t1 |
| 4 | p1 | w4 | t3 |
| 5 | p2 | w1 | t1 |
| 6 | p2 | w2 | t1 |
| 7 | p2 | w3 | t1 |
| 8 | p2 | w3 | t2 |
| 9 | p2 | w3 | t3 |
| 10 | p2 | w4 | t2 |
| 11 | p2 | w5 | t1 |
| 12 | p2 | w5 | t3 |
| 13 | p3 | w1 | t3 |
| 14 | p3 | w1 | t4 |
| 15 | p3 | w4 | t2 |
| 16 | p3 | w5 | t2 |
| 17 | p3 | w5 | t4 |

### Getting valid pairs

To get pairs of samples that share the same `label` but comes from different
`plate`s at different `well` positions:

```python
from copairs import Matcher
matcher = Matcher(dframe, ['plate', 'well', 'label'], seed=0)
pairs_dict = matcher.get_all_pairs(sameby=['label'], diffby=['plate', 'well'])
To also install dependencies for running examples, run:
```bash
pip install copairs[demo]
```

`pairs_dict` is a `label_id: pairs` dictionary containing the list of valid
pairs for every unique value of `labels`
### Testing

```
{'t4': [(0, 17), (0, 14), (17, 2), (2, 14)],
't2': [(1, 16), (1, 10), (1, 15), (8, 16), (8, 15), (10, 16)],
't1': [(3, 11), (3, 5), (3, 6), (3, 7)],
't3': [(9, 4), (9, 13), (13, 4), (13, 12), (4, 12)]}
To run tests, run:
```bash
pip install -e .[test]
pytest
```

### Getting valid pairs from a multilabel column

For eficiency reasons, you may not want to have duplicated rows. You can
group all the labels in a single row and use `MatcherMultilabel` to find the
corresponding pairs:
## Usage

```python
dframe_multi = dframe.groupby(['plate', 'well'])['label'].unique().reset_index()
```
We provide examples demonstrating how to use copairs for:
- [grouping profiles based on their metadata](./examples/finding_pairs.ipynb)
- [calculating mAP to assess phenotypic activity and consistnecy of perturbation using real data](./examples/mAP_demo.ipynb)

| | plate | well | label |
|---:|:--------|:-------|:-------------------|
| 0 | p1 | w2 | ['t4'] |
| 1 | p1 | w3 | ['t2', 't4'] |
| 2 | p1 | w4 | ['t1', 't3'] |
| 3 | p2 | w1 | ['t1'] |
| 4 | p2 | w2 | ['t1'] |
| 5 | p2 | w3 | ['t1', 't2', 't3'] |
| 6 | p2 | w4 | ['t2'] |
| 7 | p2 | w5 | ['t1', 't3'] |
| 8 | p3 | w1 | ['t3', 't4'] |
| 9 | p3 | w4 | ['t2'] |
| 10 | p3 | w5 | ['t2', 't4'] |

```python
from copairs import MatcherMultilabel
matcher_multi = MatcherMultilabel(dframe_multi,
columns=['plate', 'well', 'label'],
multilabel_col='label',
seed=0)
pairs_multi = matcher_multi.get_all_pairs(sameby=['label'],
diffby=['plate', 'well'])
```
## Citation
If you find this work useful for your research, please cite our [pre-print](https://doi.org/10.1101/2024.04.01.587631):

`pairs_multi` is also a `label_id: pairs` dictionary with the same
structure discussed before:
Kalinin, A.A., Arevalo, J., Vulliard, L., Serrano, E., Tsang, H., Bornholdt, M., Rajwa, B., Carpenter, A.E., Way, G.P. and Singh, S., 2024. A versatile information retrieval framework for evaluating profile strength and similarity. bioRxiv, pp.2024-04. doi:10.1101/2024.04.01.587631

BibTeX:
```
{'t4': [(0, 10), (0, 8), (10, 1), (1, 8)],
't2': [(1, 10), (1, 6), (1, 9), (5, 10), (5, 9), (6, 10)],
't1': [(2, 7), (2, 3), (2, 4), (2, 5)],
't3': [(5, 2), (5, 8), (8, 2), (8, 7), (2, 7)]}
@article{kalinin2024versatile,
title={A versatile information retrieval framework for evaluating profile strength and similarity},
author={Kalinin, Alexandr A and Arevalo, John and Vulliard, Loan and Serrano, Erik and Tsang, Hillary and Bornholdt, Michael and Rajwa, Bartek and Carpenter, Anne E and Way, Gregory P and Singh, Shantanu},
journal={bioRxiv},
pages={2024--04},
year={2024},
doi={10.1101/2024.04.01.587631}
}
```
Loading
Loading