-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/ebi search dump migration #332
Conversation
This is a WIP, but the analysis should be nearly done. I've moved the logic for "Runs" of EBI Search Dump: - Code that reads the analysis data from the DB and flat files -> https://github.com/EBI-Metagenomics/MetagenomicsSearchDump/blob/master/src/input/metagenomicsDB.py#L191 - I've replaced the file reading bits with calls to Mongo / MySQL. Which makes the code independent from the filesystem. - Code that generates the xml output file for **one** analysis: https://github.com/EBI-Metagenomics/MetagenomicsSearchDump/blob/master/src/output/run.py We need to port the Samples and Studies; and the analysis aggregation. The aggregation is required to merge the analysis into XML bundles that EBI Search ingest. AnalysesAggregation script -> https://github.com/EBI-Metagenomics/MetagenomicsSearchDump/blob/master/src/RunEntryAggregator.py Another task: Is to add a indexed_at field on Analyses, Samples and Studies. This is going to be useful to have incremental dump generation, instead of having to re-generate all the files. Moving to this will also required the modification of the AnalysesAggregation script OR for us to move to EBI Search incremental dumps (https://www.ebi.ac.uk/seqdb/confluence/display/EXT/Preparation+for+incremental+indexing) Last one. EBI Search also supports JSONs (https://www.ebi.ac.uk/seqdb/confluence/display/EXT/JSON+data+format)
# Conflicts: # emgcli/__init__.py # pyproject.toml
# Conflicts: # emgapi/models.py # emgcli/__init__.py # pyproject.toml
# Conflicts: # emgcli/__init__.py # pyproject.toml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff @SandyRogers
|
||
sample_metadata = {} | ||
for sample_metadata_entry in analysis.sample.metadata.all(): | ||
if (vn := sample_metadata_entry.var.var_name) in sample_annotations_to_index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quality code review
if page.number > mp: | ||
logger.warning("Skipping remaining pages") | ||
break | ||
logger.info(f"Dumping {page.number = }/{paginated_analyses.num_pages}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo?
logger.info(f"Dumping {page.number = }/{paginated_analyses.num_pages}") | |
logger.info(f"Dumping {page.number}/{paginated_analyses.num_pages}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one wasn't actually a typo, this is just the short-hand f string syntax typically useful in logs to print var names along with values, i.e. outputs
Dumping page.number=1/999
This PR:
last_updated
fields be auto-nows. This is so that if the object changes, it'll be included in the next incremental EBI Search indexing.