Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SourceUnitEnumChunker for GitHub #3298

Merged
merged 8 commits into from
Sep 23, 2024
Merged

Implement SourceUnitEnumChunker for GitHub #3298

merged 8 commits into from
Sep 23, 2024

Conversation

mcastorina
Copy link
Collaborator

Description:

Implements SourceUnitEnumChunker for GitHub source. When merged, github scans will use the Enumerate and ChunkUnit methods instead of Chunks to scan.

Builds off of #3296 and #3292

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@mcastorina mcastorina marked this pull request as ready for review September 19, 2024 21:28
@mcastorina mcastorina requested a review from a team as a code owner September 19, 2024 21:28
Copy link
Collaborator

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title terrified me, but this ended up looking super clean, nice work! The one thing I'm worried about is ChunkUnit, which I commented on inline.

pkg/sources/github/github.go Outdated Show resolved Hide resolved
@@ -372,7 +373,11 @@ func (s *Source) enumerate(ctx context.Context, reporter sources.UnitReporter) e
// Report any values that were already configured.
for _, name := range s.filteredRepoCache.Keys() {
url, _ := s.filteredRepoCache.Get(name)
_ = dedupeReporter.UnitOk(ctx, RepoUnit{name: name, url: url})
url, err := s.ensureRepoInfoCache(ctx, url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The day we get rid of the fourteen different caches in this source I will whistle a jaunty tune

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14?! Ridiculous! We need to develop one universal cache that covers all the use cases!

pkg/sources/github/github.go Outdated Show resolved Hide resolved
@@ -603,7 +617,6 @@ func (s *Source) scan(ctx context.Context, reporter sources.ChunkReporter) error
reposToScan, progressIndexOffset := sources.FilterReposToResume(s.repos, s.GetProgress().EncodedResumeInfo)
s.repos = reposToScan

scanErrs := sources.NewScanErrors()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

pkg/sources/github/github.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! But I have thought of another question - is running it in "actual separate jobs" mode (which we won't be doing yet), where the source scanning a repo is a distinct object from the source that enumerated, going to result in twice as many API calls to get repo information? We have users actively hitting rate limits now, so I think we're going to need to think up a solution for that. (I realize that I don't even know why we cache repo info in the first place.)

@mcastorina
Copy link
Collaborator Author

is running it in "actual separate jobs" mode going to result in twice as many API calls to get repo information?

Unfortunately, yes. It seems that we obtain repo metadata during enumeration, so we can opportunistically save it as chunk metadata. If we enumerate and scan in two separate steps, the current solution throws away the metadata during enumeration and fetches it again during the scan.

One idea is to formalize metadata rehydration, where we only fetch if we find a secret in the chunk. We're still duplicating some calls, but it is theoretically better than duplicating all calls.

@mcastorina mcastorina merged commit 2f3a410 into main Sep 23, 2024
12 checks passed
@mcastorina mcastorina deleted the github-units branch September 23, 2024 17:56
@rgmz
Copy link
Contributor

rgmz commented Nov 14, 2024

@mcastorina Am I missing something or is scan now effectively dead and/or duplicate code? In what circumstances would it be called over ChunkUnit?

I have similar confusion with the Git source (#3005 (comment)) where one path has error handling+logging but never gets called, and the other seems to always get called but silently swallows errors (at least for the OSS CLI).

@mcastorina
Copy link
Collaborator Author

We're sort of in a limbo for sources currently. I'd like to move all sources to implement SourceUnitEnumChunker instead of just a single Chunks method that does both enumeration and scanning, but right now we need both.

The main (unfortunate) reason, is that enterprise is still using the Chunks method, but we're working on transitioning to the new methods.

OSS does not need the Chunks method anymore, which is why it always uses the new methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants