Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-123: new file-pairing API for data uploader #1452

Merged
merged 23 commits into from
Nov 5, 2024

Conversation

davidangb
Copy link
Contributor

@davidangb davidangb commented Oct 9, 2024

Introduces a new API at

POST /api/workspaces/{workspaceNamespace}/{workspaceName}/entities/{entityType}/paired-tsv

This API will:

  1. List the files in the workspace's bucket, filtered to a given bucket prefix
  2. Attempt to pair those files together based on Illumina paired-end file naming conventions as well as other well-known naming conventions supplied by Product
  3. Generate and download a TSV containing the results of those file pairings

The driver use case for this API is the "Data Uploader" in Terra UI, though we may find that scripters/notebook users also want to use the API.

I have tested this running locally against ~100,000 files in a bucket, and the file-matching portion of the algorithm executes in < 2 seconds. The end result is a 30MB TSV so the API is slow overall, but the size is unavoidable at that scale.

@davidangb davidangb changed the title POC: file-matching for data uploader CORE-123: new file-pairing API for data uploader Nov 1, 2024
@davidangb davidangb marked this pull request as ready for review November 1, 2024 20:29
Copy link

@kevinmarete kevinmarete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I just added a comment on implementing a paging solution to handle a large number of files.

@davidangb davidangb requested a review from dvoet November 4, 2024 19:34
val fileList: List[GcsObjectName] =
googleServicesDao.listBucket(workspaceBucket, Option(matchingOptions.prefix), recursive)

logger.info(s"found ${fileList.length} files")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to place a limit on this based on the number of files returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 9505b6e

@davidangb davidangb requested a review from dvoet November 5, 2024 14:55
@davidangb davidangb merged commit 611acf8 into develop Nov 5, 2024
13 checks passed
@davidangb davidangb deleted the da_AJ-2025_fileMatchingPOC branch November 5, 2024 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants