ENG-13633: Initial discovery and classification implementation #51

ccampo133 · 2024-03-27T16:14:13Z

Description of the change

Initial implementation of Dmap's discovery and classification feature.

Includes a dmap CLI which can be used as follows:

$ dmap --help             
Usage: dmap <command> [flags]

Assess your data security posture in AWS.

Flags:
  -h, --help                 Show context-sensitive help.
      --log-level="info"     Set the logging level (trace|debug|info|warn|error|fatal)
      --log-format="text"    Set the logging format (text|json)
      --version              Print version information and quit

Commands:
  repo-scan    Perform data discovery and classification on a data repository.

Run "dmap <command> --help" for more information on a command.

The repo-scan sub-command performs the data discovery and classification. Currently it just prints the output to stdout, in JSON form, example:

$ dmap repo-scan --type postgres --database postgres --host ... --port ...  --user ... --password ...
{
    "labels": [
        {
            "name": "ADDRESS",
            "description": "Address",
            "tags": [
                "PII"
            ]
        },
        ...
    ],
    "classifications": [
        {
            "attributePath": [
                "postgres",
                "public",
                "doctors",
                "address2"
            ],
            "labels": [
                "ADDRESS"
            ]
        },
        ...
    ]
}

Note that some of the details like command name, parameters, etc. are subject to change until the first stable version is released.

Additionally, most of the code that powers the CLI has been added as public packages to the main module, enhancing the API of the existing Dmap library. Users can use these packages to implement their own discovery and classification tooling if desired. There are two new top-level packages added to the public API:

classification - provides an API to perform data classification on arbitrary string data.
sql - provides an API to introspect, sample, and scan (which is introspect + sample + classify) SQL data repositories.

A new RepoScanner interface was also added to the scan package.

Type of change

Bug fix (non-breaking change that fixes an issue).
New feature (non-breaking change that adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).

Checklists

Development

Lint rules pass locally.
The code changed/added as part of this pull request has been covered with tests.
All tests related to the changed code pass.

Code review

This pull request has a descriptive title and information useful to a reviewer. There may be a screenshot or screencast attached.
Jira issue referenced in commit message and/or PR title.

Testing

Unit and manual testing.

Refactor

yoursnerdly

A big PR, thanks for the effort @ccampo133 - I've taken a quick initial pass, skipping over repo specific code that I know is from an existing implementation.

yoursnerdly · 2024-03-29T21:43:28Z

Dockerfile

+COPY . .
+
+# Build.
+RUN CGO_ENABLED=0 go build -ldflags="-X main.version=$(git rev-parse HEAD)" -o dmap cmd/*.go


What's the benefit of building within the Dockerfile?

I thought that it simplified the build process, since we just need to run a single command and don't need to do any sort of relative path copying, and also makes it somewhat self contained. However if we plan to release a binary in addition to the Docker image (we probably should), then we should also change this up to pull the binary from that build step and include it in the image.

classification/classification.go

classification/rego/ip_address.rego

classification/rego/labels.yaml

discovery/config/util.go

yoursnerdly · 2024-03-30T00:08:45Z

classification/label_classifier.go

+		if err != nil {
+			return nil, fmt.Errorf("error evaluating query for label %s: %w", lbl.Name, err)
+		}


Maybe you should log the error and move on - if there's a mistake in one of the rego classifiers, we can still use the others.

Good call. I tried to find a way of not doing this stuff at runtime and instead at compile time, but came up blank. If you have any ideas, LMK. We will need runtime parsing of classifier code to support custom labels in any case, but I'd prefer not having the possibility of us releasing a binary with potentially broken classifiers.

Well at compile time you could do a "test" run with some input and check for the output format. That still doesn't rule out the possibility (in theory) though that the rego code will give output in some other format for some other inputs.

classification/label.go

discovery/repository/metadata.go

discovery/repository/repository.go

testutil/mock/scanner.go

ccampo133 · 2024-04-04T16:16:51Z

@VictorGFM @yoursnerdly after a bunch of churn, I believe this is good to go, at least for the initial implementation. I put a lot of effort into minimizing the public API surface through a bunch of refactoring, but the code is largely the same. The main difference being representing attributes as a path array (e.g. [db, schema, table, column]) and also reporting the labels along with the classifications.

VictorGFM

@ccampo133 The package organization and type definitions look really good! I left a few comments below for your consideration. I'll let the approval to @yoursnerdly since he got the chance to take a look at the implementation in more detail.

scan/scanner.go

classification/label_classifier.go

VictorGFM · 2024-04-04T19:05:32Z

sql/mysql.go

@@ -0,0 +1,93 @@
+package sql


What do you think about moving the Repository implementations to a separate package? Maybe a package within sql named repository, seems easier to understand the repo implementations from the type definitions and utilities if they're in separated packages.

That was the original design actually. If you think it makes it clearer, I am happy to do that. I thought just having a single package was easier from an API consumer perspective, but if you don't think so, let's change it.

Actually what ends up happening is you get a circular dependency, which is difficult to avoid. For example, if we have the following package layout:

sql/ scanner.go sample.go repository/ mysql.go

The scanner depends on the repository package, but the repository package will depend on the sql package to use the Sample type, and thus you get the circular dependency. What ends up happening is that the only thing that can live in the sql package by itself is the Scanner type. Everything else needs to go in the repository package, which is annoying and sort of defeats the purpose.

WDYT?

Oh, I see the problem with circular dependency now. In that case I think it's fine to keep the way it is on the sql package.

yoursnerdly

This is looking really good @ccampo133, thanks for the huge PR. Please look at my comments (mostly nitpicks).

classification/label.go

classification/label_classifier.go

sql/scanner.go

yoursnerdly · 2024-04-04T21:58:11Z

sql/scanner.go

+	// "databases", therefore a single repository instance will always scan the
+	// entire database.
+	if s.Config.RepoConfig.Database != "" || s.Config.RepoType == RepoTypeOracle {
+		samples, err = s.sampleDb(ctx, s.Config.RepoConfig.Database)


Oracle does support multiple databases (in a very confusing way) since version 12c, there is a root CDB and then multiple PDBs within that. I think we need to support those scenarios as well - perhaps in a future PR.

For now, we can tell the UI to expect no database name for Oracle (just schema, table, column) but if we do later support PDB etc, the attribute path may have 3 or 4 entries depending upon on the version etc. In the latter case, the first entry should be interpreted as the database name.

Thanks - we should add support for this in a future PR then. I will need to research how Oracle works to add support for it, or delegate it to somebody more experienced with Oracle.

discovery/config/config.go

classification/label_classifier.go

yoursnerdly

Thanks Chris, this looks good to me. Just a couple of nitpicks.

yoursnerdly · 2024-04-05T22:13:21Z

classification/label.go

 		}
-		rule, err := parseRego(string(b))
+		rule, err := readLabelRule(ruleFname, ruleFs)


Does this work if the path begins with ... Based on the documentation, it looks like it should but just checking.

We should document somewhere that the relative paths in the yaml file are relative to the directory containing the yaml file itself (and not the current directory).

Thanks - it works now but there were actually some bugs around this. I added some test cases to cover them. The part about the path being relative to the file is documented in the embedded labels.yaml file as a header comment. I will also ensure this is documented in the public README, when it is updated.

yoursnerdly · 2024-04-05T22:19:00Z

sql/scanner.go

+		// successfully loaded.
+		var errs classification.InvalidLabelsError
+		if errors.As(err, &errs) {
+			log.WithError(errs).Warnf("%s: some labels were not loaded", errMsg)


Maybe we should return error if len(lbls) == 0 since there is no point scanning the db if there are no labels.

yoursnerdly · 2024-04-05T22:20:54Z

golangci-lint seems to be running into some internal errors - will need to look into that.

ccampo133 changed the title ~~Initial discovery and classification CLI implementation~~ ENG-13633: Initial discovery and classification CLI implementation Mar 27, 2024

ccampo133 force-pushed the ENG-13633 branch 5 times, most recently from 3f0d8fe to 72dd28a Compare March 27, 2024 16:30

Initial discovery and classification CLI implementation

977618a

ccampo133 force-pushed the ENG-13633 branch from 72dd28a to 977618a Compare March 27, 2024 16:44

Simplfy classification

b5051e8

ccampo133 force-pushed the ENG-13633 branch 5 times, most recently from 5118b3e to 27aea6b Compare March 29, 2024 16:17

ccampo133 changed the title ~~ENG-13633: Initial discovery and classification CLI implementation~~ ENG-13633: Initial discovery and classification implementation Mar 29, 2024

Refactor

69c1731

Refactor

ccampo133 force-pushed the ENG-13633 branch 4 times, most recently from 54baeac to 9b629dd Compare March 29, 2024 17:33

Fix OPA tests

b6f658d

ccampo133 force-pushed the ENG-13633 branch from 9b629dd to b6f658d Compare March 29, 2024 17:55

ccampo133 requested review from VictorGFM and yoursnerdly March 29, 2024 18:23

ccampo133 marked this pull request as ready for review March 29, 2024 18:24

Cleanup

d44ab9b

ccampo133 force-pushed the ENG-13633 branch from 67d6c7a to d44ab9b Compare March 29, 2024 18:58

yoursnerdly reviewed Mar 30, 2024

View reviewed changes

OPA fmt + lint

ef635fd

ccampo133 force-pushed the ENG-13633 branch from 282c36a to a480a98 Compare April 1, 2024 16:34

ccampo133 force-pushed the ENG-13633 branch 7 times, most recently from c9dbf4b to 46cf074 Compare April 3, 2024 23:52

More refactoring

cef628e

ccampo133 force-pushed the ENG-13633 branch from 46cf074 to cef628e Compare April 4, 2024 00:12

ccampo133 commented Apr 4, 2024

View reviewed changes

testutil/mock/scanner.go Outdated Show resolved Hide resolved

ccampo133 force-pushed the ENG-13633 branch 2 times, most recently from 03e79f6 to 36885f3 Compare April 4, 2024 16:13

Reduce public API surface

f3a42ca

ccampo133 force-pushed the ENG-13633 branch from 36885f3 to f3a42ca Compare April 4, 2024 16:14

ccampo133 requested a review from yoursnerdly April 4, 2024 16:15

VictorGFM reviewed Apr 4, 2024

View reviewed changes

yoursnerdly reviewed Apr 4, 2024

View reviewed changes

PR comments

db43f9b

ccampo133 force-pushed the ENG-13633 branch 2 times, most recently from 854ac64 to 87676f2 Compare April 5, 2024 21:32

Add custom label support

87db7a6

ccampo133 force-pushed the ENG-13633 branch from 87676f2 to 87db7a6 Compare April 5, 2024 21:36

ccampo133 requested review from VictorGFM and yoursnerdly April 5, 2024 21:41

yoursnerdly approved these changes Apr 5, 2024

View reviewed changes

Fix relative path bugs.

c4628c8

ccampo133 merged commit f1d5fc0 into main Apr 8, 2024
3 checks passed

ccampo133 deleted the ENG-13633 branch April 8, 2024 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENG-13633: Initial discovery and classification implementation #51

ENG-13633: Initial discovery and classification implementation #51

ccampo133 commented Mar 27, 2024 •

edited

Loading

yoursnerdly left a comment

yoursnerdly Mar 29, 2024

ccampo133 Apr 1, 2024

yoursnerdly Mar 30, 2024

ccampo133 Apr 1, 2024 •

edited

Loading

yoursnerdly Apr 1, 2024

ccampo133 commented Apr 4, 2024 •

edited

Loading

VictorGFM left a comment •

edited

Loading

VictorGFM Apr 4, 2024 •

edited

Loading

ccampo133 Apr 4, 2024

ccampo133 Apr 4, 2024 •

edited

Loading

VictorGFM Apr 4, 2024

yoursnerdly left a comment

yoursnerdly Apr 4, 2024

ccampo133 Apr 5, 2024

yoursnerdly left a comment

yoursnerdly Apr 5, 2024

ccampo133 Apr 8, 2024

yoursnerdly Apr 5, 2024

yoursnerdly commented Apr 5, 2024

ENG-13633: Initial discovery and classification implementation #51

ENG-13633: Initial discovery and classification implementation #51

Conversation

ccampo133 commented Mar 27, 2024 • edited Loading

Description of the change

Type of change

Checklists

Development

Code review

Testing

yoursnerdly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ccampo133 Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ccampo133 commented Apr 4, 2024 • edited Loading

VictorGFM left a comment • edited Loading

Choose a reason for hiding this comment

VictorGFM Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ccampo133 Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yoursnerdly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yoursnerdly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yoursnerdly commented Apr 5, 2024

ccampo133 commented Mar 27, 2024 •

edited

Loading

ccampo133 Apr 1, 2024 •

edited

Loading

ccampo133 commented Apr 4, 2024 •

edited

Loading

VictorGFM left a comment •

edited

Loading

VictorGFM Apr 4, 2024 •

edited

Loading

ccampo133 Apr 4, 2024 •

edited

Loading