Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve preprocess_only implementation #108

Open
nictru opened this issue Oct 27, 2024 · 4 comments
Open

Improve preprocess_only implementation #108

nictru opened this issue Oct 27, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@nictru
Copy link
Collaborator

nictru commented Oct 27, 2024

Description of feature

The goal is to create full-blown merged anndata objects for each input sample if preprocess_only is active

@nictru nictru added the enhancement New feature or request label Oct 27, 2024
@albertulll
Copy link

albertulll commented Oct 28, 2024

  • some people only want to do the preprocessing step which was not easily allowed initially

  • some code was added to allow for that but outputs files needs processing and aggregation

  • FINALIZE - merges together the output of the entire pipeline. We want a version for the preprocessing workflow.

  • Currently, finalized is only ran when the preprocess_only is false. So, preprocessing returns data in multiple categories for all samples. What we want is to have one file per sample will all the data related to them.

  • After L53, add another process which uses python to merge the files. Collect the h5ad files from ch_h5ad.

  • The rest is when processing_only is failse, or there is no input or MultiQC, nothing more is important.

  • explore the current implementation

  • look over FINALIZE function

  • ran nf locally

  • analyse the output of the full pipeline or partial

    //
    // Group h5ad files by sample and merge them
    //
    // Group h5ad files by sample_id
    ch_h5ad_grouped = ch_h5ad.groupBy { meta, h5ad_file -> meta.sample_id }

    // Prepare data for merging
    ch_sample_h5ad_files = ch_h5ad_grouped.map { sample_id, h5ad_list ->
        def h5ad_files = h5ad_list.collect { it[1] } // Extract h5ad_file from [meta, h5ad_file]
        return [sample_id, h5ad_files]
    }

    // Merge h5ad files per sample
    MERGE_H5AD_PER_SAMPLE(ch_sample_h5ad_files)

    //
    // Collect merged h5ad files
    //
    ch_merged_h5ad_per_sample = MERGE_H5AD_PER_SAMPLE.out

    //
    // Emit outputs
    //
    emit:
        merged_h5ad_per_sample = ch_merged_h5ad_per_sample
        multiqc_report = ch_multiqc_files
        versions = ch_versions

Merging process

process MERGE_H5AD_PER_SAMPLE {
    conda 'environment.yml'
    tag "$sample_id"

    input:
    val sample_id
    path h5ad_files

    output:
    tuple val(sample_id), path "merged_${sample_id}.h5ad" into ch_merged_h5ad_per_sample

    script:
    """
    python - << EOF
    import scanpy as sc

    h5ad_files = [${h5ad_files.collect { '"' + it.name + '"' }.join(", ")}]

    adatas = [sc.read(h5ad_file) for h5ad_file in h5ad_files]
    adata_merged = adatas[0].concatenate(*adatas[1:], join='outer')
    adata_merged.write('merged_${sample_id}.h5ad')
    EOF
    """
}

@albertulll albertulll self-assigned this Oct 28, 2024
@albertulll
Copy link

albertulll commented Oct 28, 2024

Checkpoint 1st day:

  • briefly understood:
    • scdownstream metromap,
    • preprocessing workflow,
    • finalize module,
    • how to run nextflow locally,
    • what is the output when preprocessing_onlyis True or False
    • NF workflows, processes, channels and templates
    • What h5ad files are and how they are used in SC analysis
  • Goal: merge preprocessing h5ad files, per sample.
  • Create new logic for merging preprocessing output per sample
    • Collect the preprocessing data by sample
    • Add merging code
    • Collect the merged h5ad
    • Emit outputs
  • Add new logic to existing FINALIZE function.

@albertulll
Copy link

albertulll commented Oct 30, 2024

End of hack checkpoint:

  • adapt adata extend (or create new function) to process a list of paths for one sample and merge them, instead of merging one sample and its metadata.

@albertulll
Copy link

albertulll commented Oct 31, 2024

  • preprocessing returns only the filtered ann data file and it writes out other files in the process.
  • What I did is to return all ann data files from preprocess and group them by sample.
    • need to create another channel in preprocess which collects the delta of the main h5ad file and returns them (pkl format)
  • finalize should be changed to take a list of lists instead of one input per delta file, this should accommodate its usage for preprocessing = True or False
  • just look at the data, marius
    ? Why pkl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Todo
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants