Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metagenomics Exchange upload command and related bookkeeping code #350

Merged
merged 49 commits into from
Feb 29, 2024
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
0a8a761
adds incremental ebi search dumping for analysisjobs
mberacochea Jan 29, 2024
4d51fe3
integrates indexable sample data into analyses ebi search dump
mberacochea Jan 29, 2024
240d8f4
adds ebi search dump for studies/projects (plus some tweaks to analyses)
mberacochea Jan 29, 2024
ed97ccb
resolve migration merge conflicts
SandyRogers Nov 10, 2023
ab829f4
adds support for chunking of analysis-runs ebi search dump
mberacochea Jan 29, 2024
fcb9300
bugfixes on analysis-run dumper
mberacochea Jan 29, 2024
6438c3b
Add metagenomics exchange support
KateSakharova Sep 11, 2023
64f4900
Added 2 tests for ME, population script WIP
mberacochea Jan 29, 2024
7484116
Move metagenomic exchange commands to separate Class. Tests fail in P…
KateSakharova Sep 28, 2023
fee19bd
My revised version:
mberacochea Sep 29, 2023
e7d9b17
Modified ME dj function to add new ME records, fixed tests for ME and…
KateSakharova Oct 20, 2023
2e0b019
Move test for mock API to test_mgx_api.py
KateSakharova Oct 23, 2023
c92c381
Add settings
KateSakharova Oct 23, 2023
bdac39c
change models and add patch and delete to ME
mberacochea Jan 29, 2024
65e64e0
Add delete and patch commands with tests to ME API
KateSakharova Oct 24, 2023
3c664e6
add population scripts and tests for dry-run
mberacochea Jan 29, 2024
f082415
add population scripts and tests for dry-run
KateSakharova Nov 17, 2023
bb1ce81
change api adress
KateSakharova Nov 20, 2023
0f309c4
change api adress
KateSakharova Nov 20, 2023
83b6b3a
WIP
KateSakharova Nov 21, 2023
97e226e
Minor refactor of the MGX indexation process WIP
mberacochea Jan 29, 2024
e229f5c
This is still a WIP - MGX
mberacochea Jan 18, 2024
a6e9eda
Fixed the migrations for EBI Search and MGX.
mberacochea Jan 18, 2024
2f83224
Fix some ME tests
KateSakharova Jan 29, 2024
5961f7d
WIP - migration tidy up
mberacochea Jan 29, 2024
02d3943
Merge branch 'feature/metagenomics_exchange' of github.com:EBI-Metage…
mberacochea Jan 29, 2024
60d2a37
Adjust field names to new convention (last_ebi_search_indexed)
mberacochea Jan 29, 2024
ad539e7
WIP testing
KateSakharova Jan 30, 2024
986f222
Tests for populate ME
KateSakharova Jan 31, 2024
916ce45
rm dev from tests and leave only DEV API
KateSakharova Jan 31, 2024
948caa8
fix pipeline mock
KateSakharova Jan 31, 2024
6a55c40
fixes
KateSakharova Feb 1, 2024
18d4641
add mock to patch test
KateSakharova Feb 1, 2024
bfb1b54
some fixes after review
KateSakharova Feb 6, 2024
7dbbb14
Rename EBI Seach last_indexed fields
mberacochea Feb 7, 2024
4a8f3b4
Add an extra migrate step to rename last_indexed -> last_ebi_search_i…
mberacochea Feb 7, 2024
633b3f7
Merge pull request #325 from EBI-Metagenomics/feature/metagenomics_ex…
mberacochea Feb 8, 2024
f3daefe
Fixes on the MGX based on feedback
mberacochea Feb 12, 2024
f0a72ff
Bump version
mberacochea Feb 12, 2024
b2d60bd
A few bug fixes for the metagenomics exchange command and api wrapper.
mberacochea Feb 16, 2024
e3e9cfb
Fix typos in method args
mberacochea Feb 16, 2024
afa3f18
Typo sequence_accesion -> sequence_accession
mberacochea Feb 16, 2024
c3ba441
Send the sequence_accession, not a Run object
mberacochea Feb 16, 2024
4b145c0
Add a bit of logging when posting a record to MGX
mberacochea Feb 16, 2024
6235f03
Capture the API error data response, if present in the response.
mberacochea Feb 16, 2024
fe6d2b8
Mock the Metagenomics Exchange API
mberacochea Feb 16, 2024
783f275
Fix mgx_api check_analysis invocation typo
mberacochea Feb 19, 2024
1b031df
Mark piece of code to revisit and improve
mberacochea Feb 19, 2024
c98171c
Merge pull request #351 from EBI-Metagenomics/bugfix/metagenomics-exc…
mberacochea Feb 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,4 @@ dumps

/config/*.yml
/config/*.yaml
!/config/*local*
!/config/*local*
4 changes: 3 additions & 1 deletion ci/configuration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,6 @@ emg:
results_path: 'fixtures/'
celery_broker: 'redis://localhost:6379/0'
celery_backend: 'redis://localhost:6379/1'
results_production_dir: '/dummy/path/results'
results_production_dir: '/dummy/path/results'
# metagenomics exchange
mberacochea marked this conversation as resolved.
Show resolved Hide resolved
me_api: 'https://wwwdev.ebi.ac.uk/ena/registry/metagenome/api'
4 changes: 3 additions & 1 deletion config/local-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,6 @@ emg:
results_path: 'fixtures/'
celery_broker: 'redis://localhost:6379/0'
celery_backend: 'redis://localhost:6379/1'
results_production_dir: '/dummy/path/results'
results_production_dir: '/dummy/path/results'
# metagenomics exchange
me_api: 'https://wwwdev.ebi.ac.uk/ena/registry/metagenome/api'
4 changes: 2 additions & 2 deletions emgapi/management/commands/ebi_search_analysis_dump.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,6 @@ def handle(self, *args, **options):
# Small buffer into the future so that the indexing time remains ahead of auto-now updated times.

for analysis in page:
analysis.last_indexed = nowish
analysis.last_ebi_search_indexed = nowish

AnalysisJob.objects.bulk_update(page, fields=["last_indexed"])
AnalysisJob.objects.bulk_update(page, fields=["last_ebi_search_indexed"])
4 changes: 2 additions & 2 deletions emgapi/management/commands/ebi_search_study_dump.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,6 @@ def handle(self, *args, **options):
# Small buffer into the future so that the indexing time remains ahead of auto-now updated times.

for study in studies:
study.last_indexed = nowish
study.last_ebi_search_indexed = nowish

Study.objects.bulk_update(studies, fields=["last_indexed"])
Study.objects.bulk_update(studies, fields=["last_ebi_search_indexed"])
196 changes: 196 additions & 0 deletions emgapi/management/commands/populate_metagenomics_exchange.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

# Copyright 2017-2024 EMBL - European Bioinformatics Institute
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging

from django.conf import settings
from django.core.management import BaseCommand
from django.utils import timezone
from django.core.paginator import Paginator
from datetime import timedelta

from emgapi.models import AnalysisJob
from emgapi.metagenomics_exchange import MetagenomicsExchangeAPI

logger = logging.getLogger(__name__)

RETRY_COUNT = 5


class Command(BaseCommand):
help = "Check and populate metagenomics exchange (ME)."

def add_arguments(self, parser):
super(Command, self).add_arguments(parser)
parser.add_argument(
"-s",
"--study",
required=False,
type=str,
help="Study accession list (rather than all)",
nargs="+",
)
parser.add_argument(
"-p",
"--pipeline",
help="Pipeline version (rather than all). Not applicable to Genomes.",
action="store",
dest="pipeline",
choices=[1.0, 2.0, 3.0, 4.0, 4.1, 5.0],
required=False,
type=float,
)
parser.add_argument(
"--dry-run",
action="store_true",
required=False,
help="Dry mode, no population of ME",
)

def handle(self, *args, **options):
self.study_accession = options.get("study")
self.dry_run = options.get("dry_run")
self.pipeline_version = options.get("pipeline")

self.mgx_api = MetagenomicsExchangeAPI(base_url=settings.METAGENOMICS_EXCHANGE_API)

# never indexed or updated after indexed
analyses_to_index_and_update = AnalysisJob.objects_for_mgx_indexing.to_add()
# suppressed only
analyses_to_delete = AnalysisJob.objects_for_mgx_indexing.to_delete()

if self.study_accession:
analyses_to_index_and_update = analyses_to_index_and_update.filter(
study__secondary_accession__in=self.study_accession
)
analyses_to_delete = analyses_to_delete.filter(
study__secondary_accession__in=self.study_accession
)

if self.pipeline_version:
analyses_to_index_and_update = analyses_to_index_and_update.filter(
pipeline__release_version=self.pipeline_version
)
analyses_to_delete = analyses_to_delete.filter(
pipeline__release_version=self.pipeline_version
)

self.process_to_index_and_update_records(analyses_to_index_and_update)
self.process_to_delete_records(analyses_to_delete)

logging.info("Done")

def process_to_index_and_update_records(self, analyses_to_index_and_update):
logging.info(f"Indexing {len(analyses_to_index_and_update)} new analyses")

for page in Paginator(analyses_to_index_and_update, settings.METAGENOMICS_EXCHANGE_PAGINATOR_NUMBER):
jobs_to_update = []
for ajob in page:
metadata = self.mgx_api.generate_metadata(
mgya=ajob.accession,
run_accession=ajob.run,
status="public" if not ajob.is_private else "private",
)
registry_id, metadata_match = self.mgx_api.check_analysis(
source_id=ajob.accession, sequence_id=ajob.run, metadata=metadata
)
# The job is not registered
if not registry_id:
logging.info(f"Add new {ajob}")
if self.dry_run:
logging.info(f"Dry-mode run: no addition to real ME for {ajob}")
continue

response = self.mgx_api.add_analysis(
mgya=ajob.accession,
run_accession=ajob.run,
public=not ajob.is_private,
)
if response.ok:
logging.info(f"Successfully added {ajob}")
registry_id, metadata_match = self.mgx_api.check_analysis(
source_id=ajob.accession, sequence_id=ajob.run)
ajob.mgx_accession = registry_id
ajob.last_mgx_indexed = timezone.now() + timedelta(minutes=1)
jobs_to_update.append(ajob)
else:
logging.error(f"Error adding {ajob}: {response.message}")

# else we have to check if the metadata matches, if not we need to update it
else:
if not metadata_match:
logging.info(f"Patch existing {ajob}")
if self.dry_run:
logging.info(
f"Dry-mode run: no patch to real ME for {ajob}"
)
continue
if self.mgx_api.patch_analysis(
registry_id=registry_id, data=metadata
):
logging.info(f"Analysis {ajob} updated successfully")
# Just to be safe, update the MGX accession
ajob.mgx_accession = registry_id
ajob.last_mgx_indexed = timezone.now() + timedelta(minutes=1)
jobs_to_update.append(ajob)
else:
logging.error(f"Analysis {ajob} update failed")
else:
logging.debug(f"No edit for {ajob}, metadata is correct")

AnalysisJob.objects.bulk_update(
jobs_to_update, ["last_mgx_indexed", "mgx_accession"], batch_size=settings.METAGENOMICS_EXCHANGE_PAGINATOR_NUMBER
)

def process_to_delete_records(self, analyses_to_delete):
"""
This function removes suppressed records from ME.
"""
logging.info(f"Processing {len(analyses_to_delete)} analyses to remove")

for page in Paginator(analyses_to_delete, settings.METAGENOMICS_EXCHANGE_PAGINATOR_NUMBER):
jobs_to_update = []

for ajob in page:
metadata = self.mgx_api.generate_metadata(
mgya=ajob.accession,
run_accession=ajob.run,
status="public" if not ajob.is_private else "private",
)
registry_id, _ = self.mgx_api.check_analysis(
source_id=ajob.accession, sequence_id=ajob.run, metadata=metadata
)
if registry_id:
logging.info(f"Deleting {ajob}")
if self.dry_run:
logging.info(f"Dry-mode run: no delete from real ME for {ajob}")
continue

if self.mgx_api.delete_analysis(registry_id):
logging.info(f"{ajob} successfully deleted")
ajob.last_mgx_indexed = timezone.now()
jobs_to_update.append(ajob)
else:
logging.info(f"{ajob} failed on delete")
else:
logging.info(
f"{ajob} doesn't exist in the registry, nothing to delete"
)

# BULK UPDATE #
AnalysisJob.objects.bulk_update(
jobs_to_update, ["last_mgx_indexed"], batch_size=settings.METAGENOMICS_EXCHANGE_PAGINATOR_NUMBER
)
Loading
Loading