Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENG-13633: Initial discovery and classification implementation #51

Merged
merged 15 commits into from
Apr 8, 2024
16 changes: 16 additions & 0 deletions .github/workflows/docker.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: Build CLI Docker Image

on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Build Docker image
run: docker build . --file Dockerfile
22 changes: 22 additions & 0 deletions .github/workflows/opa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: Run OPA Tests

on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Check out repository code
uses: actions/checkout@v3

- name: Setup OPA
uses: open-policy-agent/setup-opa@v2
with:
version: latest

- name: Run OPA Tests
run: opa test classification/rego/*.rego -v
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -253,3 +253,6 @@ __debug_bin

# Backup files
*~

# Other
/.run/
20 changes: 20 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
FROM golang:1.22 as build

# Set destination for COPY.
WORKDIR /app

# Download dependencies.
COPY go.mod go.sum ./
RUN go mod download

# Copy the source code.
COPY . .

# Build.
RUN CGO_ENABLED=0 go build -ldflags="-X main.version=$(git rev-parse HEAD)" -o dmap cmd/*.go

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of building within the Dockerfile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that it simplified the build process, since we just need to run a single command and don't need to do any sort of relative path copying, and also makes it somewhat self contained. However if we plan to release a binary in addition to the Docker image (we probably should), then we should also change this up to pull the binary from that build step and include it in the image.


FROM gcr.io/distroless/static-debian12:nonroot

COPY --from=build /app/dmap /dmap

ENTRYPOINT ["/dmap"]
96 changes: 96 additions & 0 deletions classification/classification.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
// Package classification provides various types and functions to facilitate
// data classification. The type Classifier provides an interface which takes
// arbitrary data as input and returns a classified version of that data as
// output. The package contains at least one implementation which uses OPA and
// Rego to perform the actual classification logic (see LabelClassifier),
// however other implementations may be added in the future.
package classification

import (
"context"
"fmt"
"maps"

"github.com/cyralinc/dmap/discovery/repository"
)

// Classifier is an interface that represents a data classifier. A classifier
// takes a set of data attributes and classifies them into a set of labels.
type Classifier interface {
// Classify takes the given input, which amounts to essentially a "row of
// data", and returns the data classifications for that input. The input is
// a map of attribute names (i.e. columns) to their values. The returned
// Result is a map of attribute names to the set of labels that attributes
// were classified as.
Classify(ctx context.Context, input map[string]any) (Result, error)
}

// ClassifiedTable represents a database table that has been classified. The
// classifications are stored in the Classifications field, which is a map of
// attribute names (i.e. columns) to the set of labels that attributes were
// classified as.
type ClassifiedTable struct {
Repo string `json:"repo"`
Database string `json:"database"`
Schema string `json:"schema"`
Table string `json:"table"`
yoursnerdly marked this conversation as resolved.
Show resolved Hide resolved
Classifications Result `json:"classifications"`
}

// Result represents the classifications for a set of data attributes. The key
// is the attribute (i.e. column) name and the value is the set of labels
// that attribute was classified as.
type Result map[string]LabelSet
yoursnerdly marked this conversation as resolved.
Show resolved Hide resolved

// Merge combines the given other Result into this Result (the receiver). If
// an attribute from other is already present in this Result, the existing
// labels for that attribute are merged with the labels from other, otherwise
// labels from other for the attribute are simply added to this Result.
func (c Result) Merge(other Result) {
if c == nil {
return
}
for attr, labelSet := range other {
if _, ok := c[attr]; !ok {
c[attr] = make(LabelSet)
}
maps.Copy(c[attr], labelSet)
}
}

// ClassifySamples uses the provided classifiers to classify the sample data
// passed via the "samples" parameter. It is mostly a helper function which
// loops through each repository.Sample, retrieves the attribute names and
// values of that sample, passes them to Classifier.Classify, and then
// aggregates the results. Please see the documentation for Classifier and its
// Classify method for more details. The returned slice represents all the
// unique classification results for a given sample set.
func ClassifySamples(
ctx context.Context,
samples []repository.Sample,
classifier Classifier,
) ([]ClassifiedTable, error) {
tables := make([]ClassifiedTable, 0, len(samples))
for _, sample := range samples {
// Classify each sampled row and combine the results.
result := make(Result)
for _, sampleResult := range sample.Results {
res, err := classifier.Classify(ctx, sampleResult)
if err != nil {
return nil, fmt.Errorf("error classifying sample: %w", err)
}
result.Merge(res)
}
if len(result) > 0 {
table := ClassifiedTable{
Repo: sample.Metadata.Repo,
Database: sample.Metadata.Database,
Schema: sample.Metadata.Schema,
Table: sample.Metadata.Table,
Classifications: result,
}
tables = append(tables, table)
}
}
return tables, nil
}
Loading
Loading