Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocess_molgan_database #139

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions utils/preprocess-molgan-database-plugin/.bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
[bumpversion]
current_version = 0.1.0
commit = False
tag = False
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\-(?P<release>[a-z]+)(?P<dev>\d+))?
serialize =
{major}.{minor}.{patch}-{release}{dev}
{major}.{minor}.{patch}

[bumpversion:part:release]
optional_value = _
first_value = dev
values =
dev
_

[bumpversion:part:dev]

[bumpversion:file:pyproject.toml]
search = version = "{current_version}"
replace = version = "{new_version}"

[bumpversion:file:VERSION]

[bumpversion:file:README.md]

[bumpversion:file:plugin.json]

[bumpversion:file:src/polus/mm/utils/preprocess_molgan_database/__init__.py]
4 changes: 4 additions & 0 deletions utils/preprocess-molgan-database-plugin/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.venv
out
tests
__pycache__
1 change: 1 addition & 0 deletions utils/preprocess-molgan-database-plugin/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
poetry.lock
5 changes: 5 additions & 0 deletions utils/preprocess-molgan-database-plugin/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# CHANGELOG

## 0.1.0

Initial release.
35 changes: 35 additions & 0 deletions utils/preprocess-molgan-database-plugin/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# docker build -f Dockerfile -t polusai/molgan-tool:0.1.0 .
FROM condaforge/mambaforge
# NOT mambaforge-pypy3 (rdkit is incompatible with pypy)

# RDKIT logging
ENV RDKIT_ERROR_LOGGING="OFF"

RUN apt-get update && apt-get install -y wget git

# Clone MolGAN
RUN git clone https://github.com/ndonyapour/MolGAN.git

# Build and install python bindings
# MolGAN was initially implemented using TensorFlow v1, and TensorFlow version 2 offers support
# for v1 functionalities. However, it's important to mention that the current patch for upgrading
# to v2 is not truly upgrading v1 API to v2 API, but calling legacy v1 API from v2 package via
# "tf.compat.v1". Essentially, it’s still v1.certain. Truely upgrade to v2 requires rewriting most
# functions of MolGAN, including model creation, data processing, and training.

RUN mamba install -c conda-forge rdkit "tensorflow<2.13" numpy scikit-learn xorg-libxrender

# Make sure rdkit is activated
RUN python -c "import rdkit"

# Train a Model
WORKDIR /MolGAN

# Download the gdb9 database
RUN bash data/download_dataset.sh data/gdb9.sdf data/NP_score.pkl.gz data/SA_score.pkl.gz

# Download the pretrained model
RUN wget -nv --no-clobber https://huggingface.co/ndonyapour/MolGAN/resolve/main/MolGAN_model.tar.gz && tar xvzf MolGAN_model.tar.gz
RUN mv MolGAN_model trained_models
RUN wget -nv --no-clobber https://huggingface.co/ndonyapour/MolGAN/resolve/main/data.pkl -O data/data.pkl
ADD Dockerfile .
13 changes: 13 additions & 0 deletions utils/preprocess-molgan-database-plugin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# preprocess_molgan_database (0.1.0)

MolGAN tool for generating small molecules

## Options

This plugin takes 2 input arguments and 1 output argument:

| Name | Description | I/O | Type | Default |
|---------------|-------------------------|--------|--------|---------|
| input_sdf_path | Path to the input file, Type: File, File type: input, Accepted formats: sdf | Input | File | File |
| output_data_path | Path to the output data file, Type: string, File type: output, Accepted formats: pkl, Example file: https://github.com/bioexcel/biobb_ml/raw/master/biobb_ml/test/reference/classification/ref_output_model_support_vector_machine.pkl | Input | string | string |
| output_data_path | Path to the output data file | Output | File | File |
1 change: 1 addition & 0 deletions utils/preprocess-molgan-database-plugin/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0.1.0
44 changes: 44 additions & 0 deletions utils/preprocess-molgan-database-plugin/ict.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
specVersion: "0.1.0"
name: preprocess_molgan_database
version: 0.1.0
container: preprocess-molgan-database-plugin
entrypoint:
title: preprocess_molgan_database
description: MolGAN tool for generating small molecules
author: Data Scientist
contact: [email protected]
repository:
documentation:
citation:

inputs:
- name: input_sdf_path
required: true
description: Path to the input file, Type File, File type input, Accepted formats sdf
type: File
defaultValue: system.sdf
format:
uri: edam:format_3814
- name: output_data_path
required: true
description: Path to the output data file, Type string, File type output, Accepted formats pkl, Example file https//github.com/bioexcel/biobb_ml/raw/master/biobb_ml/test/reference/classification/ref_output_model_support_vector_machine.pkl
type: string
defaultValue: system.pkl
format:
uri: edam:format_3653
outputs:
- name: output_data_path
required: true
description: Path to the output data file
type: File
format:
uri: edam:format_3653
ui:
- key: inputs.input_sdf_path
title: "input_sdf_path: "
description: "Path to the input file, Type File, File type input, Accepted formats sdf"
type: File
- key: inputs.output_data_path
title: "output_data_path: "
description: "Path to the output data file, Type string, File type output, Accepted formats pkl, Example file https//github.com/bioexcel/biobb_ml/raw/master/biobb_ml/test/reference/classification/ref_output_model_support_vector_machine.pkl"
type: string
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/usr/bin/env cwl-runner
cwlVersion: v1.0

class: CommandLineTool

label: MolGAN tool for generating small molecules

baseCommand: ["python", "/MolGAN/utils/sparse_molecular_dataset.py"]

hints:
DockerRequirement:
dockerPull: polusai/molgan-tool@sha256:e008e74170be12dcf50a936a417b8c330ccdebf7fe17abaa8fa2689dac210725

inputs:
input_sdf_path:
label: Path to the input file
doc: |-
Path to the input file
Type: File
File type: input
Accepted formats: sdf
type: File
format: edam:format_3814 # sdf
inputBinding:
prefix: --input_sdf_path

output_data_path:
label: Path to the output data file
doc: |-
Path to the output data file
Type: string
File type: output
Accepted formats: pkl
Example file: https://github.com/bioexcel/biobb_ml/raw/master/biobb_ml/test/reference/classification/ref_output_model_support_vector_machine.pkl
type: string
format: edam:format_3653
inputBinding:
prefix: --output_data_path
default: system.pkl

outputs:
output_data_path:
label: Path to the output data file
doc: |-
Path to the output data file
type: File
outputBinding:
glob: $(inputs.output_data_path)
format: edam:format_3653 # sdf

$namespaces:
edam: https://edamontology.org/

$schemas:
- https://raw.githubusercontent.com/edamontology/edamontology/master/EDAM_dev.owl
29 changes: 29 additions & 0 deletions utils/preprocess-molgan-database-plugin/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
[tool.poetry]
name = "polus-mm-utils-preprocess-molgan-database"
version = "0.1.0"
description = "MolGAN tool for generating small molecules"
authors = ["Data Scientist <[email protected]>"]
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.9,<3.12"
cwl-utils = "0.33"
cwltool = "3.1.20240404144621"

[tool.poetry.group.dev.dependencies]
bump2version = "^1.0.1"
pytest = "^7.4"
pytest-sugar = "^0.9.6"
pre-commit = "^3.2.1"
black = "^23.3.0"
mypy = "^1.1.1"
ruff = "^0.0.270"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

[tool.pytest.ini_options]
pythonpath = [
"."
]
1 change: 1 addition & 0 deletions utils/preprocess-molgan-database-plugin/tests/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Tests for preprocess_molgan_database."""
39 changes: 39 additions & 0 deletions utils/preprocess-molgan-database-plugin/tests/gdb9_5.sdf
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
gdb_1
RDKit 3D

1 0 0 0 0 0 0 0 0 0999 V2000
-0.0127 1.0858 0.0080 C 0 0 0 0 0 0 0 0 0 0 0 0
M END
$$$$
gdb_2
RDKit 3D

1 0 0 0 0 0 0 0 0 0999 V2000
-0.0404 1.0241 0.0626 N 0 0 0 0 0 0 0 0 0 0 0 0
M END
$$$$
gdb_3
RDKit 3D

1 0 0 0 0 0 0 0 0 0999 V2000
-0.0344 0.9775 0.0076 O 0 0 0 0 0 0 0 0 0 0 0 0
M END
$$$$
gdb_4
RDKit 3D

2 1 0 0 0 0 0 0 0 0999 V2000
0.5995 0.0000 1.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.5995 0.0000 1.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 3 0
M END
$$$$
gdb_5
RDKit 3D

2 1 0 0 0 0 0 0 0 0999 V2000
-0.0133 1.1325 0.0083 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0023 -0.0192 0.0019 N 0 0 0 0 0 0 0 0 0 0 0 0
1 2 3 0
M END
$$$$
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
"""Tests for preprocess_molgan_database."""
import sys
from pathlib import Path

current_dir = Path(__file__).resolve().parent
target_dir = current_dir.parent.parent.parent / "cwl_utils"
sys.path.append(str(target_dir))

from cwl_utilities import call_cwltool # noqa: E402
from cwl_utilities import create_input_yaml # noqa: E402
from cwl_utilities import parse_cwl_arguments # noqa: E402


def test_preprocess_molgan_database() -> None:
"""Test preprocess_molgan_database."""
cwl_file = Path("preprocess_molgan_database_0.1.0.cwl")
input_to_props = parse_cwl_arguments(cwl_file)
file_path_str = "gdb9_5.sdf"
file_path = str(Path(__file__).resolve().parent / Path(file_path_str))
input_to_props["input_sdf_path"]["path"] = file_path
input_yaml_path = Path("preprocess_molgan_database_0.1.0.yml")

create_input_yaml(input_to_props, input_yaml_path)
call_cwltool(cwl_file, input_yaml_path)
assert Path("system.pkl").exists()
Loading