-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENG-13633: Initial discovery and classification implementation #51
Merged
Changes from 5 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
977618a
Initial discovery and classification CLI implementation
ccampo133 b5051e8
Simplfy classification
ccampo133 69c1731
Refactor
ccampo133 b6f658d
Fix OPA tests
ccampo133 d44ab9b
Cleanup
ccampo133 ef635fd
OPA fmt + lint
ccampo133 f959d27
Merge branch 'main' into ENG-13633
ccampo133 11a1d59
Repository refactoring
ccampo133 05827ac
Documentation
ccampo133 b5409e7
More refactoring
ccampo133 cef628e
More refactoring
ccampo133 f3a42ca
Reduce public API surface
ccampo133 db43f9b
PR comments
ccampo133 87db7a6
Add custom label support
ccampo133 c4628c8
Fix relative path bugs.
ccampo133 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
name: Build CLI Docker Image | ||
|
||
on: | ||
push: | ||
branches: [ "main" ] | ||
pull_request: | ||
branches: [ "main" ] | ||
|
||
jobs: | ||
build: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v4 | ||
|
||
- name: Build Docker image | ||
run: docker build . --file Dockerfile |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
name: Run OPA Tests | ||
|
||
on: | ||
push: | ||
branches: [ "main" ] | ||
pull_request: | ||
branches: [ "main" ] | ||
|
||
jobs: | ||
test: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Check out repository code | ||
uses: actions/checkout@v3 | ||
|
||
- name: Setup OPA | ||
uses: open-policy-agent/setup-opa@v2 | ||
with: | ||
version: latest | ||
|
||
- name: Run OPA Tests | ||
run: opa test classification/rego/*.rego -v |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -253,3 +253,6 @@ __debug_bin | |
|
||
# Backup files | ||
*~ | ||
|
||
# Other | ||
/.run/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
FROM golang:1.22 as build | ||
|
||
# Set destination for COPY. | ||
WORKDIR /app | ||
|
||
# Download dependencies. | ||
COPY go.mod go.sum ./ | ||
RUN go mod download | ||
|
||
# Copy the source code. | ||
COPY . . | ||
|
||
# Build. | ||
RUN CGO_ENABLED=0 go build -ldflags="-X main.version=$(git rev-parse HEAD)" -o dmap cmd/*.go | ||
|
||
FROM gcr.io/distroless/static-debian12:nonroot | ||
|
||
COPY --from=build /app/dmap /dmap | ||
|
||
ENTRYPOINT ["/dmap"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
// Package classification provides various types and functions to facilitate | ||
// data classification. The type Classifier provides an interface which takes | ||
// arbitrary data as input and returns a classified version of that data as | ||
// output. The package contains at least one implementation which uses OPA and | ||
// Rego to perform the actual classification logic (see LabelClassifier), | ||
// however other implementations may be added in the future. | ||
package classification | ||
|
||
import ( | ||
"context" | ||
"fmt" | ||
"maps" | ||
|
||
"github.com/cyralinc/dmap/discovery/repository" | ||
) | ||
|
||
// Classifier is an interface that represents a data classifier. A classifier | ||
// takes a set of data attributes and classifies them into a set of labels. | ||
type Classifier interface { | ||
// Classify takes the given input, which amounts to essentially a "row of | ||
// data", and returns the data classifications for that input. The input is | ||
// a map of attribute names (i.e. columns) to their values. The returned | ||
// Result is a map of attribute names to the set of labels that attributes | ||
// were classified as. | ||
Classify(ctx context.Context, input map[string]any) (Result, error) | ||
} | ||
|
||
// ClassifiedTable represents a database table that has been classified. The | ||
// classifications are stored in the Classifications field, which is a map of | ||
// attribute names (i.e. columns) to the set of labels that attributes were | ||
// classified as. | ||
type ClassifiedTable struct { | ||
Repo string `json:"repo"` | ||
Database string `json:"database"` | ||
Schema string `json:"schema"` | ||
Table string `json:"table"` | ||
yoursnerdly marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Classifications Result `json:"classifications"` | ||
} | ||
|
||
// Result represents the classifications for a set of data attributes. The key | ||
// is the attribute (i.e. column) name and the value is the set of labels | ||
// that attribute was classified as. | ||
type Result map[string]LabelSet | ||
yoursnerdly marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// Merge combines the given other Result into this Result (the receiver). If | ||
// an attribute from other is already present in this Result, the existing | ||
// labels for that attribute are merged with the labels from other, otherwise | ||
// labels from other for the attribute are simply added to this Result. | ||
func (c Result) Merge(other Result) { | ||
if c == nil { | ||
return | ||
} | ||
for attr, labelSet := range other { | ||
if _, ok := c[attr]; !ok { | ||
c[attr] = make(LabelSet) | ||
} | ||
maps.Copy(c[attr], labelSet) | ||
} | ||
} | ||
|
||
// ClassifySamples uses the provided classifiers to classify the sample data | ||
// passed via the "samples" parameter. It is mostly a helper function which | ||
// loops through each repository.Sample, retrieves the attribute names and | ||
// values of that sample, passes them to Classifier.Classify, and then | ||
// aggregates the results. Please see the documentation for Classifier and its | ||
// Classify method for more details. The returned slice represents all the | ||
// unique classification results for a given sample set. | ||
func ClassifySamples( | ||
ctx context.Context, | ||
samples []repository.Sample, | ||
classifier Classifier, | ||
) ([]ClassifiedTable, error) { | ||
tables := make([]ClassifiedTable, 0, len(samples)) | ||
for _, sample := range samples { | ||
// Classify each sampled row and combine the results. | ||
result := make(Result) | ||
for _, sampleResult := range sample.Results { | ||
res, err := classifier.Classify(ctx, sampleResult) | ||
if err != nil { | ||
return nil, fmt.Errorf("error classifying sample: %w", err) | ||
} | ||
result.Merge(res) | ||
} | ||
if len(result) > 0 { | ||
table := ClassifiedTable{ | ||
Repo: sample.Metadata.Repo, | ||
Database: sample.Metadata.Database, | ||
Schema: sample.Metadata.Schema, | ||
Table: sample.Metadata.Table, | ||
Classifications: result, | ||
} | ||
tables = append(tables, table) | ||
} | ||
} | ||
return tables, nil | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of building within the Dockerfile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that it simplified the build process, since we just need to run a single command and don't need to do any sort of relative path copying, and also makes it somewhat self contained. However if we plan to release a binary in addition to the Docker image (we probably should), then we should also change this up to pull the binary from that build step and include it in the image.