Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add license check transform #257

Merged

Conversation

shivdeep-singh-ibm
Copy link
Collaborator

Why are these changes needed?

Add license check filter

Related issue number (if any).

@daw3rd daw3rd added the enhancement New feature or request label Jun 10, 2024
@shivdeep-singh-ibm shivdeep-singh-ibm force-pushed the add_license_check_filter branch 2 times, most recently from 378a170 to b539226 Compare June 11, 2024 18:02
.make.versions Outdated Show resolved Hide resolved
transforms/code/license_check/python/Dockerfile Outdated Show resolved Hide resolved
.make.versions Outdated Show resolved Hide resolved
transforms/code/license_check/ray/Dockerfile Outdated Show resolved Hide resolved
@Bytes-Explorer
Copy link
Collaborator

@shivdeep-singh-ibm This module (folder) should be called license filtering

daw3rd

This comment was marked as resolved.

@revit13
Copy link
Collaborator

revit13 commented Jun 19, 2024

@shivdeep-singh-ibm it seems that the workflow- rules in transforms/code/license_check/Makefile needs to be changed (they refer to kfp_ray/v1 directory which does not exist). This causes the CI fail: Please see the Makefile in other code transforms to fix that.

make[4]: Leaving directory '/home/runner/work/data-prep-kit/data-prep-kit/transforms/code/ingest_2_parquet'
Using recursive workflow-build rule in license_check/
make[4]: Entering directory '/home/runner/work/data-prep-kit/data-prep-kit/transforms/code/license_check'
make -C kfp_ray/v1 workflow-build
make[5]: *** kfp_ray/v1: No such file or directory.  Stop.

@shivdeep-singh-ibm shivdeep-singh-ibm force-pushed the add_license_check_filter branch 3 times, most recently from 41bcc4f to d3d6b41 Compare June 26, 2024 07:53
transforms/code/license_check/python/pyproject.toml Outdated Show resolved Hide resolved
transforms/code/license_check/ray/pyproject.toml Outdated Show resolved Hide resolved
transforms/code/license_check/python/Makefile Outdated Show resolved Hide resolved
transforms/code/license_check/python/Makefile Outdated Show resolved Hide resolved
transforms/code/license_check/python/README.md Outdated Show resolved Hide resolved
transforms/code/license_check/python/README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@Param-S Param-S left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor doc/readme comment. otherwise changes good to me.

@@ -0,0 +1,13 @@
# License Check

The License Check transform checks if the '`license` of input data is in approved/denied list as
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence looks incomplete. The transformer verifies and marks the sample accordingly in the output right? should we convey that in this description.

@daw3rd
Copy link
Member

daw3rd commented Jun 28, 2024

@shivdeep-singh-ibm suggest you merge dev into this branch given some recent issues with versioning. See #355

`--lc_license_column_name` - set the name of the column holds license to process
`--lc_allow_no_license` - allow entries with no associated license (default: false)
`--lc_licenses_file` - S3 or local path to allowed/denied licenses JSON file
`--lc_deny_licenses` - allow all licences except those in licenses_file (default: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you have your own DataAccessFactory, aren't there some lc_data params that would allow reading the license file from S3, for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A hidden feature, to see the --help text for a given transform

make .transforms.test-image-help

This should show the -lc_data_* options

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no additional data options used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, the code below from line 134 of this file will add the --lc_data_* parameters (specifically the last line below).

       # Create the DataAccessFactor to use CLI args
        self.daf = DataAccessFactory(CLI_PREFIX, False)
        # Add the DataAccessFactory parameters to the transform's configuration parameters.
        self.daf.add_input_params(parser)

transforms/code/license_check/python/pyproject.toml Outdated Show resolved Hide resolved
transforms/code/license_check/ray/pyproject.toml Outdated Show resolved Hide resolved
@shivdeep-singh-ibm shivdeep-singh-ibm force-pushed the add_license_check_filter branch 2 times, most recently from 3ffc512 to c43993e Compare July 15, 2024 10:34
Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addtion to my specific comments. a more global comment is to rename this to license_select which then matches proglang_select naming.

transforms/code/license_check/ray/Makefile Outdated Show resolved Hide resolved
@@ -0,0 +1,11 @@
# License Check

The License Check transform checks if the `license` of input data is in approved/denied list. It is implemented as per the set of [transform project conventions](../../README.md#transform-project-conventions) the following runtimes are available:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific and say it selects only the specified licenses and filters out the other licenses..

task_image = "quay.io/dataprep1/data-prep-kit/license_check-ray:0.4.0.dev6"

# components
base_kfp_image = "quay.io/dataprep1/data-prep-kit/kfp-data-processing:0.2.0.dev6"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you need to run make set-versions as these are old. Probably should be 0.2.1.dev0 everywhere


# distribution versions is the same as image version.
set-versions:
$(MAKE) TRANSFORM_PYTHON_VERSION=${LICENSE_CHECK_PYTHON_VERSION} .transforms.set-versions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now back to requiring TOML_VERSION. For example, from noop

$(MAKE) TRANSFORM_PYTHON_VERSION=$(NOOP_PYTHON_VERSION) TOML_VERSION=$(NOOP_PYTHON_VERSION) .transforms.set-versions ```

set-versions:
$(MAKE) TRANSFORM_PYTHON_VERSION=${LICENSE_CHECK_PYTHON_VERSION} .transforms.set-versions

build-dist:: set-versions .defaults.build-dist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove set-versions here. The convention now is that this generally only to be called from the top of the repo outside of the makefiles (unless you get out of sync, like you are in the the _wf.py file and maybe others)


## Summary

This filter scans the license column of an input dataset and appends the `license_status` column to the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific and say it selects only the specified licenses and filters out the other licenses. (same comment as on the top level readme)

Then

ls output
To see results of the transform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set-versions:
$(MAKE) TRANSFORM_PYTHON_VERSION=${LICENSE_CHECK_RAY_VERSION} .transforms.set-versions

build-dist:: set-versions .defaults.build-dist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove set-versions here

@daw3rd
Copy link
Member

daw3rd commented Jul 15, 2024

Also, not sure how far behind dev this is. It's probably a good idea to merge with dev

@shivdeep-singh-ibm shivdeep-singh-ibm force-pushed the add_license_check_filter branch 4 times, most recently from fed1461 to e4d6efa Compare July 26, 2024 04:19
EXEC_SCRIPT_NAME: str = "license_select_transform_ray.py"
PREFIX: str = ""

task_image = "quay.io/dataprep1/data-prep-kit/license_select-ray:0.4.0.dev6"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are now using latest tag on images

RUN pip install --no-cache-dir -e .

# Added to enable running test
RUN cp src/license_select_transform_python.py .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can/should be removed.


# Set environment
ENV PYTHONPATH /home/dpk
ENV PATH $PATH:/home/dpk/.local/bin/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why is this added?


TRANSFORM_NAME=license_select
# $(REPOROOT)/.make.versions file contains the versions
DOCKER_IMAGE_VERSION=${LICENSE_SELECT_PYTHON_VERSION}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOCKER_IMAGE_VERSION can be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivdeep-singh-ibm can you please check this input & address it


TRANSFORM_NAME=license_select
# $(REPOROOT)/.make.versions file contains the versions
DOCKER_IMAGE_VERSION=${LICENSE_SELECT_RAY_VERSION}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOCKER_IMAGE_VERSION can be removed.

# copy test
COPY test/ test/
COPY test-data/ test-data/
ENV PATH $PATH:/home/dpk/.local/bin/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you are setting PATH?

@@ -0,0 +1,39 @@
FROM docker.io/rayproject/ray:2.9.3-py310
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first lines of this file need to be updated. From code2parquet, for example

ARG BASE_IMAGE=docker.io/rayproject/ray:2.24.0-py310

FROM ${BASE_IMAGE}

RUN pip install --upgrade --no-cache-dir pip 

# install pytest
RUN pip install --no-cache-dir pytest

RUN cd python-transform && pip install --no-cache-dir -e .

COPY --chown=ray:users python-transform lc_python
RUN cd lc_python && pip install .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this is a duplication of lines 12 and 13 above (the previous 2 statements)?

task_image = "quay.io/dataprep1/data-prep-kit/license_select-ray:0.4.0.dev6"

# components
base_kfp_image = "quay.io/dataprep1/data-prep-kit/kfp-data-processing:0.2.0.dev6"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this should be 0.2.1.dev0. you might try running make set-versions from ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops this should also be latest

}
}
```
license_column_name - The name of the column with licenses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a list of bullets. as is, it is wrapping into a single paragraph.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@Param-S Param-S left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: Shivdeep Singh <[email protected]>
@shivdeep-singh-ibm shivdeep-singh-ibm merged commit 7a8a949 into IBM:dev Sep 30, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants