Improve `preprocess_only` implementation #108

nictru · 2024-10-27T12:45:24Z

Description of feature

The goal is to create full-blown merged anndata objects for each input sample if preprocess_only is active

The text was updated successfully, but these errors were encountered:

albertulll · 2024-10-28T10:09:10Z

some people only want to do the preprocessing step which was not easily allowed initially
some code was added to allow for that but outputs files needs processing and aggregation
FINALIZE - merges together the output of the entire pipeline. We want a version for the preprocessing workflow.
Currently, finalized is only ran when the preprocess_only is false. So, preprocessing returns data in multiple categories for all samples. What we want is to have one file per sample will all the data related to them.
After L53, add another process which uses python to merge the files. Collect the h5ad files from ch_h5ad.
The rest is when processing_only is failse, or there is no input or MultiQC, nothing more is important.
explore the current implementation
look over FINALIZE function
ran nf locally
analyse the output of the full pipeline or partial

    //
    // Group h5ad files by sample and merge them
    //
    // Group h5ad files by sample_id
    ch_h5ad_grouped = ch_h5ad.groupBy { meta, h5ad_file -> meta.sample_id }

    // Prepare data for merging
    ch_sample_h5ad_files = ch_h5ad_grouped.map { sample_id, h5ad_list ->
        def h5ad_files = h5ad_list.collect { it[1] } // Extract h5ad_file from [meta, h5ad_file]
        return [sample_id, h5ad_files]
    }

    // Merge h5ad files per sample
    MERGE_H5AD_PER_SAMPLE(ch_sample_h5ad_files)

    //
    // Collect merged h5ad files
    //
    ch_merged_h5ad_per_sample = MERGE_H5AD_PER_SAMPLE.out

    //
    // Emit outputs
    //
    emit:
        merged_h5ad_per_sample = ch_merged_h5ad_per_sample
        multiqc_report = ch_multiqc_files
        versions = ch_versions

Merging process

process MERGE_H5AD_PER_SAMPLE {
    conda 'environment.yml'
    tag "$sample_id"

    input:
    val sample_id
    path h5ad_files

    output:
    tuple val(sample_id), path "merged_${sample_id}.h5ad" into ch_merged_h5ad_per_sample

    script:
    """
    python - << EOF
    import scanpy as sc

    h5ad_files = [${h5ad_files.collect { '"' + it.name + '"' }.join(", ")}]

    adatas = [sc.read(h5ad_file) for h5ad_file in h5ad_files]
    adata_merged = adatas[0].concatenate(*adatas[1:], join='outer')
    adata_merged.write('merged_${sample_id}.h5ad')
    EOF
    """
}

albertulll · 2024-10-28T15:57:49Z

Checkpoint 1st day:

briefly understood:
- scdownstream metromap,
- preprocessing workflow,
- finalize module,
- how to run nextflow locally,
- what is the output when preprocessing_onlyis True or False
- NF workflows, processes, channels and templates
- What h5ad files are and how they are used in SC analysis
Goal: merge preprocessing h5ad files, per sample.
Create new logic for merging preprocessing output per sample
- Collect the preprocessing data by sample
- Add merging code
- Collect the merged h5ad
- Emit outputs
Add new logic to existing FINALIZE function.

albertulll · 2024-10-30T15:31:41Z

End of hack checkpoint:

adapt adata extend (or create new function) to process a list of paths for one sample and merge them, instead of merging one sample and its metadata.

albertulll · 2024-10-31T15:03:16Z

preprocessing returns only the filtered ann data file and it writes out other files in the process.
What I did is to return all ann data files from preprocess and group them by sample.
- need to create another channel in preprocess which collects the delta of the main h5ad file and returns them (pkl format)
finalize should be changed to take a list of lists instead of one input per delta file, this should accommodate its usage for preprocessing = True or False
just look at the data, marius
? Why pkl?

nictru added the enhancement New feature or request label Oct 27, 2024

nictru added this to Hackathon October 2024 Oct 27, 2024

nictru moved this to Todo in Hackathon October 2024 Oct 27, 2024

albertulll self-assigned this Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `preprocess_only` implementation #108

Improve `preprocess_only` implementation #108

nictru commented Oct 27, 2024

albertulll commented Oct 28, 2024 •

edited

Loading

albertulll commented Oct 28, 2024 •

edited

Loading

albertulll commented Oct 30, 2024 •

edited

Loading

albertulll commented Oct 31, 2024 •

edited

Loading

Improve preprocess_only implementation #108

Improve preprocess_only implementation #108

Comments

nictru commented Oct 27, 2024

Description of feature

albertulll commented Oct 28, 2024 • edited Loading

albertulll commented Oct 28, 2024 • edited Loading

albertulll commented Oct 30, 2024 • edited Loading

albertulll commented Oct 31, 2024 • edited Loading

Improve `preprocess_only` implementation #108

Improve `preprocess_only` implementation #108

albertulll commented Oct 28, 2024 •

edited

Loading

albertulll commented Oct 28, 2024 •

edited

Loading

albertulll commented Oct 30, 2024 •

edited

Loading

albertulll commented Oct 31, 2024 •

edited

Loading