-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENG-13633: Initial discovery and classification implementation #51
Conversation
3f0d8fe
to
72dd28a
Compare
5118b3e
to
27aea6b
Compare
54baeac
to
9b629dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big PR, thanks for the effort @ccampo133 - I've taken a quick initial pass, skipping over repo specific code that I know is from an existing implementation.
Dockerfile
Outdated
COPY . . | ||
|
||
# Build. | ||
RUN CGO_ENABLED=0 go build -ldflags="-X main.version=$(git rev-parse HEAD)" -o dmap cmd/*.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of building within the Dockerfile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that it simplified the build process, since we just need to run a single command and don't need to do any sort of relative path copying, and also makes it somewhat self contained. However if we plan to release a binary in addition to the Docker image (we probably should), then we should also change this up to pull the binary from that build step and include it in the image.
classification/label_classifier.go
Outdated
if err != nil { | ||
return nil, fmt.Errorf("error evaluating query for label %s: %w", lbl.Name, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should log the error and move on - if there's a mistake in one of the rego classifiers, we can still use the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I tried to find a way of not doing this stuff at runtime and instead at compile time, but came up blank. If you have any ideas, LMK. We will need runtime parsing of classifier code to support custom labels in any case, but I'd prefer not having the possibility of us releasing a binary with potentially broken classifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well at compile time you could do a "test" run with some input and check for the output format. That still doesn't rule out the possibility (in theory) though that the rego code will give output in some other format for some other inputs.
c9dbf4b
to
46cf074
Compare
03e79f6
to
36885f3
Compare
@VictorGFM @yoursnerdly after a bunch of churn, I believe this is good to go, at least for the initial implementation. I put a lot of effort into minimizing the public API surface through a bunch of refactoring, but the code is largely the same. The main difference being representing attributes as a path array (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccampo133 The package organization and type definitions look really good! I left a few comments below for your consideration. I'll let the approval to @yoursnerdly since he got the chance to take a look at the implementation in more detail.
@@ -0,0 +1,93 @@ | |||
package sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about moving the Repository
implementations to a separate package? Maybe a package within sql
named repository
, seems easier to understand the repo implementations from the type definitions and utilities if they're in separated packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the original design actually. If you think it makes it clearer, I am happy to do that. I thought just having a single package was easier from an API consumer perspective, but if you don't think so, let's change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually what ends up happening is you get a circular dependency, which is difficult to avoid. For example, if we have the following package layout:
sql/
scanner.go
sample.go
repository/
mysql.go
The scanner depends on the repository package, but the repository package will depend on the sql package to use the Sample
type, and thus you get the circular dependency. What ends up happening is that the only thing that can live in the sql
package by itself is the Scanner
type. Everything else needs to go in the repository
package, which is annoying and sort of defeats the purpose.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see the problem with circular dependency now. In that case I think it's fine to keep the way it is on the sql
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good @ccampo133, thanks for the huge PR. Please look at my comments (mostly nitpicks).
sql/scanner.go
Outdated
// "databases", therefore a single repository instance will always scan the | ||
// entire database. | ||
if s.Config.RepoConfig.Database != "" || s.Config.RepoType == RepoTypeOracle { | ||
samples, err = s.sampleDb(ctx, s.Config.RepoConfig.Database) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oracle does support multiple databases (in a very confusing way) since version 12c, there is a root CDB and then multiple PDBs within that. I think we need to support those scenarios as well - perhaps in a future PR.
For now, we can tell the UI to expect no database name for Oracle (just schema, table, column) but if we do later support PDB etc, the attribute path may have 3 or 4 entries depending upon on the version etc. In the latter case, the first entry should be interpreted as the database name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - we should add support for this in a future PR then. I will need to research how Oracle works to add support for it, or delegate it to somebody more experienced with Oracle.
854ac64
to
87676f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Chris, this looks good to me. Just a couple of nitpicks.
} | ||
rule, err := parseRego(string(b)) | ||
rule, err := readLabelRule(ruleFname, ruleFs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work if the path begins with ..
. Based on the documentation, it looks like it should but just checking.
We should document somewhere that the relative paths in the yaml file are relative to the directory containing the yaml file itself (and not the current directory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - it works now but there were actually some bugs around this. I added some test cases to cover them. The part about the path being relative to the file is documented in the embedded labels.yaml file as a header comment. I will also ensure this is documented in the public README, when it is updated.
// successfully loaded. | ||
var errs classification.InvalidLabelsError | ||
if errors.As(err, &errs) { | ||
log.WithError(errs).Warnf("%s: some labels were not loaded", errMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should return error if len(lbls) == 0 since there is no point scanning the db if there are no labels.
golangci-lint seems to be running into some internal errors - will need to look into that. |
Description of the change
Initial implementation of Dmap's discovery and classification feature.
Includes a
dmap
CLI which can be used as follows:The
repo-scan
sub-command performs the data discovery and classification. Currently it just prints the output to stdout, in JSON form, example:Note that some of the details like command name, parameters, etc. are subject to change until the first stable version is released.
Additionally, most of the code that powers the CLI has been added as public packages to the main module, enhancing the API of the existing Dmap library. Users can use these packages to implement their own discovery and classification tooling if desired. There are two new top-level packages added to the public API:
classification
- provides an API to perform data classification on arbitrary string data.sql
- provides an API to introspect, sample, and scan (which is introspect + sample + classify) SQL data repositories.A new
RepoScanner
interface was also added to thescan
package.Type of change
Checklists
Development
Code review
Testing
Unit and manual testing.