-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement SourceUnitEnumChunker for GitHub #3298
Conversation
This change refactors the internal scan method to introduce a scanRepo method to perform the actual scan.
875cfb9
to
2516f72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title terrified me, but this ended up looking super clean, nice work! The one thing I'm worried about is ChunkUnit
, which I commented on inline.
@@ -372,7 +373,11 @@ func (s *Source) enumerate(ctx context.Context, reporter sources.UnitReporter) e | |||
// Report any values that were already configured. | |||
for _, name := range s.filteredRepoCache.Keys() { | |||
url, _ := s.filteredRepoCache.Get(name) | |||
_ = dedupeReporter.UnitOk(ctx, RepoUnit{name: name, url: url}) | |||
url, err := s.ensureRepoInfoCache(ctx, url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The day we get rid of the fourteen different caches in this source I will whistle a jaunty tune
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14?! Ridiculous! We need to develop one universal cache that covers all the use cases!
@@ -603,7 +617,6 @@ func (s *Source) scan(ctx context.Context, reporter sources.ChunkReporter) error | |||
reposToScan, progressIndexOffset := sources.FilterReposToResume(s.repos, s.GetProgress().EncodedResumeInfo) | |||
s.repos = reposToScan | |||
|
|||
scanErrs := sources.NewScanErrors() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! But I have thought of another question - is running it in "actual separate jobs" mode (which we won't be doing yet), where the source scanning a repo is a distinct object from the source that enumerated, going to result in twice as many API calls to get repo information? We have users actively hitting rate limits now, so I think we're going to need to think up a solution for that. (I realize that I don't even know why we cache repo info in the first place.)
Unfortunately, yes. It seems that we obtain repo metadata during enumeration, so we can opportunistically save it as chunk metadata. If we enumerate and scan in two separate steps, the current solution throws away the metadata during enumeration and fetches it again during the scan. One idea is to formalize metadata rehydration, where we only fetch if we find a secret in the chunk. We're still duplicating some calls, but it is theoretically better than duplicating all calls. |
@mcastorina Am I missing something or is I have similar confusion with the Git source (#3005 (comment)) where one path has error handling+logging but never gets called, and the other seems to always get called but silently swallows errors (at least for the OSS CLI). |
We're sort of in a limbo for sources currently. I'd like to move all sources to implement The main (unfortunate) reason, is that enterprise is still using the OSS does not need the |
Description:
Implements
SourceUnitEnumChunker
for GitHub source. When merged,github
scans will use theEnumerate
andChunkUnit
methods instead ofChunks
to scan.Builds off of #3296 and #3292
Checklist:
make test-community
)?make lint
this requires golangci-lint)?