Implement SourceUnitEnumChunker for GitHub #3298

mcastorina · 2024-09-14T02:16:55Z

Description:

Implements SourceUnitEnumChunker for GitHub source. When merged, github scans will use the Enumerate and ChunkUnit methods instead of Chunks to scan.

Builds off of #3296 and #3292

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

This change refactors the internal scan method to introduce a scanRepo method to perform the actual scan.

rosecodym

The title terrified me, but this ended up looking super clean, nice work! The one thing I'm worried about is ChunkUnit, which I commented on inline.

pkg/sources/github/github.go

rosecodym · 2024-09-20T13:04:45Z

pkg/sources/github/github.go

@@ -372,7 +373,11 @@ func (s *Source) enumerate(ctx context.Context, reporter sources.UnitReporter) e
 	// Report any values that were already configured.
 	for _, name := range s.filteredRepoCache.Keys() {
 		url, _ := s.filteredRepoCache.Get(name)
-		_ = dedupeReporter.UnitOk(ctx, RepoUnit{name: name, url: url})
+		url, err := s.ensureRepoInfoCache(ctx, url)


The day we get rid of the fourteen different caches in this source I will whistle a jaunty tune

14?! Ridiculous! We need to develop one universal cache that covers all the use cases!

pkg/sources/github/github.go

rosecodym · 2024-09-20T13:10:18Z

pkg/sources/github/github.go

@@ -603,7 +617,6 @@ func (s *Source) scan(ctx context.Context, reporter sources.ChunkReporter) error
 	reposToScan, progressIndexOffset := sources.FilterReposToResume(s.repos, s.GetProgress().EncodedResumeInfo)
 	s.repos = reposToScan

-	scanErrs := sources.NewScanErrors()


pkg/sources/github/github.go

rosecodym

This looks great! But I have thought of another question - is running it in "actual separate jobs" mode (which we won't be doing yet), where the source scanning a repo is a distinct object from the source that enumerated, going to result in twice as many API calls to get repo information? We have users actively hitting rate limits now, so I think we're going to need to think up a solution for that. (I realize that I don't even know why we cache repo info in the first place.)

mcastorina · 2024-09-23T14:54:38Z

is running it in "actual separate jobs" mode going to result in twice as many API calls to get repo information?

Unfortunately, yes. It seems that we obtain repo metadata during enumeration, so we can opportunistically save it as chunk metadata. If we enumerate and scan in two separate steps, the current solution throws away the metadata during enumeration and fetches it again during the scan.

One idea is to formalize metadata rehydration, where we only fetch if we find a secret in the chunk. We're still duplicating some calls, but it is theoretically better than duplicating all calls.

rgmz · 2024-11-14T22:53:06Z

@mcastorina Am I missing something or is scan now effectively dead and/or duplicate code? In what circumstances would it be called over ChunkUnit?

I have similar confusion with the Git source (#3005 (comment)) where one path has error handling+logging but never gets called, and the other seems to always get called but silently swallows errors (at least for the OSS CLI).

mcastorina · 2024-11-14T23:38:39Z

We're sort of in a limbo for sources currently. I'd like to move all sources to implement SourceUnitEnumChunker instead of just a single Chunks method that does both enumeration and scanning, but right now we need both.

The main (unfortunate) reason, is that enterprise is still using the Chunks method, but we're working on transitioning to the new methods.

OSS does not need the Chunks method anymore, which is why it always uses the new methods.

mcastorina mentioned this pull request Sep 16, 2024

Instrument GitHub source with a ChunkReporter #3296

Merged

2 tasks

mcastorina added 4 commits September 19, 2024 14:22

Implement SourceUnitEnumChunker for GitHub

6663caf

This change refactors the internal scan method to introduce a scanRepo method to perform the actual scan.

Export unit fields so the values are captured in the report

80ffe9c

Add comment for scanRepo

bea1df1

Break out ensureRepoInfoCache into a method

2516f72

mcastorina force-pushed the github-units branch from 875cfb9 to 2516f72 Compare September 19, 2024 21:22

mcastorina marked this pull request as ready for review September 19, 2024 21:28

mcastorina requested a review from a team as a code owner September 19, 2024 21:28

rosecodym reviewed Sep 20, 2024

View reviewed changes

mcastorina added 4 commits September 20, 2024 09:43

Update comments and check errors

bee72fa

Ensure that the repoInfoCache contains the repo during ChunkUnit

c539c7b

Add integration test for ChunkUnit

63c7e42

Move s.scanOptions initialization to Init()

80b6219

rosecodym approved these changes Sep 23, 2024

View reviewed changes

mcastorina merged commit 2f3a410 into main Sep 23, 2024
12 checks passed

mcastorina deleted the github-units branch September 23, 2024 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement SourceUnitEnumChunker for GitHub #3298

Implement SourceUnitEnumChunker for GitHub #3298

mcastorina commented Sep 14, 2024

rosecodym left a comment

rosecodym Sep 20, 2024

mcastorina Sep 20, 2024

rosecodym Sep 20, 2024

rosecodym left a comment

mcastorina commented Sep 23, 2024

rgmz commented Nov 14, 2024 •

edited

Loading

mcastorina commented Nov 14, 2024

Implement SourceUnitEnumChunker for GitHub #3298

Implement SourceUnitEnumChunker for GitHub #3298

Conversation

mcastorina commented Sep 14, 2024

Description:

Checklist:

rosecodym left a comment

Choose a reason for hiding this comment

rosecodym Sep 20, 2024

Choose a reason for hiding this comment

mcastorina Sep 20, 2024

Choose a reason for hiding this comment

rosecodym Sep 20, 2024

Choose a reason for hiding this comment

rosecodym left a comment

Choose a reason for hiding this comment

mcastorina commented Sep 23, 2024

rgmz commented Nov 14, 2024 • edited Loading

mcastorina commented Nov 14, 2024

rgmz commented Nov 14, 2024 •

edited

Loading