Optimisation of fragmentation process through the use of a postgres database to previously already fragmented data. This will allow delta changes to an existing database rather than having to completely re-fragment the input files - speeding up the loading of extracts to the Neo4j database used by fragment search.
For kubernetes execution of the fragmentation process refer to our fragmentor-ansible repository, which contains a playbook that launches the fragmentor processes as Pods within a suitably configured Kubernetes cluster.
Summary of Contents:
- Ansible playbooks to populate the database: standardisation, fragmentation and inchi creation.
- A playbook to extract datasets of single and combinations of vendors from the database for import into the Fragnet Search Neo4j database.
- A playbook to combine different extracted datasets (from, say, different databases) to create new Fragnet search Neo4j databases.
- Playbooks are also provided to create, start, stop, and backup the database.
The libraries currently supported are as follows:
- Xchem: dsip, probing library, spotfinder
- Molport
- Chemspace: bb
- Enamine: ro5
- Enamine: Ruben's extract (see Important Issues below)
- sdf (generic)
Further datasets are planned.
During the database reconstruction the Fragmentor was used in a slightly different way highlighting two manual actions that must be performed in the process. These are discussed in detail in sc-2829: https://app.shortcut.com/informaticsmatters/story/2829/fragmentor-process-bug-fixes-and-necessary-changes
In short:
- Xchem standardisation python script - If you process the new enamine (or other XCHEM libraries), it uses the xchem standardisation python script is used and the ID prefix defaults to "XCHEM"
- so you have to update the mol_source records to have the correct ID for the graph to work as the ID in graph must be unique. For example:
Begin work;
UPDATE mol_source SET code = replace(code, 'XCHEM:', 'XCHEM-DISP') where code like 'XCHEM:%' and source_id = 3;
select * from mol_source where source_id = 3 limit 5;
select count(*) from mol_source where code like 'XCHEM:%' and source_id = 3;
commit work;
- Combination of libraries with different source_id's for the same vendor As enamine was so large, it was split into 5 with files run through the process individually. When these files are combined at the end (using site-combine), the supplier-nodes file has duplicate records that will cause problems in the graph. The simplest fix is to modifiy the the supplier-nodes.csv file after it has been generated to just take the first of the enamine supplier nodes records ('REAL')
Notes
- Standardise and fragmentation code based on the Fragalysis Repository
- Nextflow scripts are used to control a cluster for Standardisation, Fragmentation, Inchi calculation and combine plays.
- The scripts contain parameter controlled chunking of input files at various stages to control and optimise throughput to the sql database. This can be tuned to the hardware/cluster.
- Processing normally starts and ends with an AWS S3 repository - assumed to contain the smiles data from vendors to be imported into the process and will be the destination for Neo4j compatible extract files - and where they can be picked up by Fragnet Search.
The images are built by the Travis CI/CD process. Review the .travis.yml
file for details. A convenient docker-compose can be used to build
the images for local experimentation.
For the production configuration, it is assumed that the user has access to a cluster and that the following instructions would be run on the head node.
It is always worth running the playbooks from within your own Python environment, to avoid disturbing the system Python: -
$ python3 -m venv ~/.venv/fragmentor
$ source ~/.venv/fragmentor/bin/activate
$ pip install --upgrade pip
And, depending on your distribution, you might need SELinux bindings...
$ pip install selinux
Install requirements: -
$ pip install -r requirements.txt
$ ansible-galaxy install -r requirements.yaml
Some ansible playbook postgres tasks require the postgres client (psql) to be installed on the head machine. Details can be found at https://www.postgresql.org/download/linux/redhat/
You will also need credentials for S3, so you will need to set up the following parameters. Regardless of whether you're actually using an AWS bucket or not, we use the standard AWS variables: -
$ export AWS_ACCESS_KEY_ID=<Access Key ID>
$ export AWS_SECRET_ACCESS_KEY=<Secret Access Key>
$ export AWS_DEFAULT_REGION=eu-central-1
If you're using an alternative provider's S3 store, just put the relevant details into the appropriate AWS variable.
AWS_REGION
can be blank (''
)', and for some providers it needs to be.
If you are using a non-AWS S3 bucket you will need to provide the S3 endpoint, but leaving this environment variable undefined if you are using AWS: -
$ export AWS_ENDPOINT_URL=<Non-AWS S3 Service Endpoint URL>
The playbooks will use psql
commands to interact with the database, so you
must ensure that the control host has access to psql
. For ubuntu this can
be satisfied by installing an appropriate client: -
$ sudo apt install postgresql-client-common
$ sudo apt-get install postgresql-client -y
Now navigate to the ansible Project's ansible directory: -
$ cd ansible
Test connection to the hosts using ping
. You will need to ensure that the
user's ~/.ssh/id_rsa
is set correctly (or use ssh-agent) so that Ansible
can ssh to the servers.
If the following works you should be able to run the project playbooks: -
$ ansible -m ping all
If using the OpenStack cloud provider you will have to provide suitable production and backup servers - our playbooks do not create these physicals.
If you want to deploy to OpenStack consult with us on server preparation. Some basic installation notes can be found in this repository's README-UBUNTU-DB-SERVER.md file.
If you're using AWS a production server can be instantiated automatically.
You will need AWS credentials to allow general EC2 configuration and a set of
parameters (in parameters.yaml
) to define the server flavour: -
You will need an AWS VPC and subnet.
---
db_server_state: present
aws_db_instance_type: t3a.2xlarge
db_volume_size_g: 10
database_cloud_provider: aws
db_shared_buffers_g: 4
db_max_parallel_workers: 8
aws_vpc_subnet_id: <CLUSTER_PUBLIC_SUBNET_ID>
aws_vpc_id: <CLUSTER_VPC_ID>
Now create the server: -
$ ansible-playbook site-db-server.yaml -e @parameters.yaml
Adjust your parameters so that they include the address of the database server. You'll need this with other plays. The server's IP address is printed by the above play: -
TASK [db-server : Display DB server address (Private IP)] *****************
Thursday 22 October 2020 18:54:00 +0000 (0:00:00.048) 0:00:24.557 **
ok: [localhost] => {
"server_result.instances[0].private_ip": "10.0.0.192"
}
In this case you'd add the following to the parameter file: -
database_login_host: 10.0.0.192
Using the AWS console wait for the database server instance to become ready (initialise) before trying to create the database.
You may need to adjust your ansible/inventory
to provide the address of your database server, as the default we have provided is unlikely
to be of any use to you. You may also need to alter ansible_user
and
ansible_ssh_private_key_file
to match your installation.
You will also need to make matching changes to the 'inventory.yaml'
From this point on you can create playbook parameters to match your installation in
ansible/parameters.yaml
. For example: -
deployment: production
database_cloud_provider: openstack
database_login_host: 130.246.81.18
configure_hardware: yes
The deployment
parameter must be one of development
or production
to
setup/use a production database instance or a separate local database
instance.
The
parameters.yaml
file will be excluded by the project's.gitignore
file.
The database is created in a docker container. Configuring the production
database server (a destructive action) is done via an ansible playbook.
This playbook also pre-loads the vendor_name
table in the database with
the vendor file types currently supported by the process.
$ ansible-playbook site-db-server-configure.yaml -e @parameters.yaml
You only really need to run the
site-db-server-configure
play once. It configures the server with Docker and runs the designated database image and then formats the initial DB.
Note that if there are automatic update jobs running that periodically stop servers (to install new software for example), it may be prudent to disable them and use alternative arrangements to prevent the database being brought down when in use.
Note that site-db-server-configure
is primarily aimed at a production
database instance. For development (local) a postgres container could be
launched with a command something like:
$ docker run --shm-size=1g --name psql \
-v /home/user/project/postgresql/data:/var/lib/postgresql/data
-e POSTGRES_PASSWORD=1234 \
-p 5432:5432 \
-d postgres:alpine \
-c 'logging_collector=true'
In this case, an extract
subdirectory directory must also be created within
the data
directory for the extract play to successfully run. This must
have correct permissions so that ansible can write to it.
Example: navigate to the ansible directory
$ ansible-playbook site-db-server-configure_create-database.yaml \
-e deployment=development
For a production deployment, the ansible playbook site-standardise
is
configured to import vendor data files from a tree structure defined as either:
raw/<vendor>/<library>/<version>
or
raw/<vendor>/<version>
Examples
-
Data files for version
v1
of thedsip
library for vendorxchem
should be placed in directoryraw/xchem/dsip/v1
-
Data files for version
2020-10
for vendorMolport
(only one library) should be placed in directoryraw/molport/2020-10
The ansible playbook site-extract
is configured to export datasets to a
tree structure defined as:
For Neo4j extracts for individual vendor libraries: -
extract/<vendor_library>/<version>
For Neo4j extracts for combinations of vendor libraries: -
combination/<first vendor_library>/<date>
Examples
- The Neo4j extract for version
v1
of thedsip
library for vendorxchem
is exported to directory:xchem_dsip/v1
- The Neo4j extract for version
2020-10
for vendorMolport
(only one library) is exported to directorymolport/2020-10
- The Neo4j extract for a combination of the above libraries run on 1/1/2020
(assuming xchem is listed first in the request) is exported to directory
combination/xchem_dsip/2020-01-01
The sequence diagram below shows the basic steps in the fragmentation process including a fragmentation database called FairMolecules. The advantage of the database approach is that each time a new dataset of molecules is provided by the vendor, the relatively lightweight standardisation step must still be performed - but only new molecules will have to go through the hardware intensive fragmentation step.
- The process is run by an operator. The operator configures the process and place the input files in the correct location on AWS S3.
- The Controller is the head node where the fragmentor repository is installed. All Ansible playbooks are run from the head node.
- The Cluster is a Cluster-group used by nextflow for the standardisation, fragmentation and inchi key generation steps. Ansible handles these machines.
- FairMolecules is the postgres database containing standardisation and fragmentation data as well as indexes used by the extraction playbook. Extracted Neo4j datasets are uploaded from here back to S3 to complete the process.
- Once a new library has been added to the database, processing to extract and combine datasets are done via the extract and combine plays. See below for more details.
The process consists of three steps, described below:
Standardisation "normalises" the customer SMILES into a format suitable for on-going processing. It relies on files located in an S3 bucket as described above in Configuring the S3 Directory Structure.
Using Ansible variables: -
unpacker
identifies the task file (inroles/standardise/tasks
) that will be used to unpack the raw data files. A value ofdecompress-gz-all
will use the logic defined in the task fileunpack-raw-decompress-gz-all.yaml
.standinputfile
identifies the names of the decompressed files to be standardised, and is typically a regular expression likes*.cxsmiles
standardiser
identifies the Python module in the projectfrag/standardise/scripts
directory that will interpret the raw data and produce the standardised file (based on the files collected by thestandinputfile
filename filter). It is responsible for parsing the decompressed input file so it and the file must be compatible.
These values are provided to the standardise play using a parameter file in the
roles/standardise/vars
directory. Its name is based on the vendor/library name.
For example, the xchem_dsip
library, e.g. xchem_dsip-variables.yaml
.
It's easier to run all our playbooks using a YAML-based parameter file,
where all the main control variables can be set. A typical parameter file
(parameters.yaml
) might look like this: -
---
database_login_host: 130.246.214.154
deployment: production
runpath: /data/fragmentor/run-01
clean_start: yes
vendor: xchem_dsip
version: v1
extracts:
- lib:
vendor: xchem_dsip
version: v1
regenerate_index: yes
hardware:
production:
parallel_jobs: 20
postgres_jobs: 400
There is a template file with all the above settings and more explanation
at: ansible/run-parameters.template
that you can copy.
Armed with a parameter file, like the one shown above, the standardisation step is run as follows: -
$ ansible-playbook site-standardise.yaml -e @parameters.yaml
Armed with a parameter file, the fragmentation step is run as follows: -
$ ansible-playbook site-fragment.yaml -e @parameters.yaml
A fragmentation step would normally be processed directly after a standardisation step, but as this step is driven by the database it is possible to run multiple standardisation steps for a vendor/library followed by a single fragmentation step.
Optional parameter add_backup: default: no
If set to yes, this automatically backs up the database after the fragmentation play.
The Create Inchi step will identify any molecules that have not had inchi keys generated yet and process them. It is not vendor/library specific and can be run as follows:
$ ansible-playbook site-inchi.yaml -e @parameters.yaml
As this step is driven by the database it possible to run multiple standardisation/fragmentation steps for different vendor/libraries followed by a single create inchi step.
Optional parameter add_backup: default: yes
If set to yes, this automatically backs up the database after the Inchi play. Note that this defaults to "yes" as at this stage the data import for a new library would be complete.
The Extract Neo4j Dataset playbook will create a dataset exportable to Neo4j containing either a single vendor or combination of vendors from information contained in the database. The export is based on parameters provided in a parameter file containing vendor(s) and version(s) in the following example format:
---
database_login_host: 130.246.214.154
deployment: production
runpath: /data/fragmentor/run-01
clean_start: yes
hardware:
production:
parallel_jobs: 20
extracts:
- lib:
vendor: enamine_ro5
version: jun2018
regenerate_index: no
- lib:
vendor: molport
version: 2020-02
regenerate_index: yes
The first time a library version is extracted, regenerate_index
should be
set to yes
so that the index of edges for the latest library version can be
regenerated. For subsequent runs (e.g. to extract combinations) it should be set to
no
for speed - for the larger vendors this can be a significant amount of time.
A template (extract-parameters.template
) is provided for this file.
The command is:
$ ansible-playbook site-extract.yaml -e @parameters.yaml
ote that for the larger extracts to complete there needs to be sufficient temporary space on the postgres pgdata directory for the database queries to complete. In the case of the complete extract including enamine and molport, for example, around 900GB if temporary space is required.
This playbook can be used to combine existing Neo4j datasets to produce one
new Neo4j dataset. For example, if datasets were produced using more than one
database, this playbook can combine them allowing a combination of
public/proprietary data. Datasets can be downloaded from up to two AWS S3
repositories or directly from disk. The export can either be to disk or saved
up to AWS S3. The combination is based on parameters provided in a parameter
file containing extracts(s) and version(s) in the following example format,
which writes results to the local filesystem (disk
) rather than the
default (s3
):
---
deployment: production
hardware:
production:
parallel_jobs: 100
runpath: /share/fragmentor
combine:
- lib:
path: xchem_dsip
data_source: disk
- lib:
path: extract/xchem_spot/v1
data_source: s3
bucket: "{{ bucket_in_1 }}"
s3_access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
s3_secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
s3_region: "{{ lookup('env', 'AWS_DEFAULT_REGION') }}"
s3_url: "{{ lookup('env', 'AWS_ENDPOINT_URL') }}"
- lib:
path: extract/xchem_probe/v1
data_source: s3
bucket: "{{ bucket_in_2 }}"
s3_access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
s3_secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
s3_region: "{{ lookup('env', 'AWS_DEFAULT_REGION') }}"
s3_url: "{{ lookup('env', 'AWS_ENDPOINT_URL') }}"
# The path_out parameter defines the subdirectory that will be used
# for the output in "combinations" .
path_out: xchem_combi_20200715
data_source_out: disk
There is a template file with all the above settings and more explanation
in the file ansible/combine-parameters.template
that you can copy.
Notes
The AWS access keys are provided as external parameters to the script in a
similar way to the other playbooks. In the repository configuration,
they are all set to AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
, but this mapping
can be changed in the file roles/combine/defaults/main.yaml
.
For data sourced from s3
, the path is the full path to the dataset in the bucket
rather than the vendor. This allows existing combinations to also be used as
input sources.
For data sourced from disk
- files are expected (by default) to be in a directory
called extract
(configurable) in the runpath
. If data_source_out
is set
to disk
, then the combined extract will be available in the combine
directory on the runpath
.
The command is:
$ ansible-playbook site-combine.yaml -e @parameters.yaml
The backup playbook can also be used in isolation from the other playbooks.
It uses pg_dumpall
to create a zipped copy of the complete database.
The play will automatically clear up all versions of backups - the number
of backups retained defaults to 2, but can be adjusted per database
(the backup_count parameter in group_vars/all.yaml).
Example:
$ ansible-playbook site-backup.yaml -e @parameters.yaml
A backed up database can be restored with the restore playbook.
It defaults to the latest saved backup. Other backups can be chosen by setting
the optional restore_file parameter to the name of the file in the hourly
folder.
Examples:
$ ansible-playbook site-restore.yaml -e @parameters.yaml
$ ansible-playbook site-restore.yaml \
-e @parameters.yaml \
-e restore_file=backup-2020-07-23T16:12:56Z-dumpall.sql.gz
Similarly there are playbooks to stop and start the database server. The postgres startup configuration items are set up in the start-database playbook.
$ ansible-playbook site-db-server-configure_stop-database.yaml \
-e @parameters.yaml
$ ansible-playbook site-db-server-configure_start-database.yaml \
-e @parameters.yaml
As an alternative to the pg_dump based backup and restore plays a simple backup play has also been created to copy the database volume to the backup volume in the DB server. It stops the database, copies the files and then restarts the database: -
Example: navigate to the ansible directory
$ ansible-playbook site-backup-copy.yaml -e @parameters.yaml
The file all.yaml
contains the following parameters used to control the
different steps of the process.
Hardware sizing - used to calculate size of chunks in standardization/fragmentation. A map of variables based on deployment and should reflect the approximate CPUs available on the machine/cluster and how many postgres parallel jobs that can be safely run. The second parameter is currently only used in building the index in the site-extract play.
Fragmentation will typically require cores that have access to at least 200MB of memory.
For example:
hardware:
development:
# Parallel_jobs is used in calculating chunk sizes for processing -
# the normal assumption is that twice as many jobs are created as cores
# are used. If this is less than cluster_cores, then not all the cluster
# will be used. It is usually safe to set the parallel_jobs and
# cluster_cores to the same value.
parallel_jobs: 8
# Cluster cores is used in the nextflow_config.
# It defines the size of the cluster to use.
cluster_cores: 8
# Sort memory defines the minimum memory required for deduplication/sorting.
sort_memory: 8 MB
# Number of connections to postgres - note that this should be less
# than max_worker_processes in start-database.yaml.
postgres_jobs: 6
production:
# 100 parallel jobs/cores should be sufficient to process all but the
# largest libraries (full enamine/molport load).
# For role combine, ideally this might be increased when combining
# large databases - otherwise there might be a timeout
parallel_jobs: 100
cluster_cores: 100
sort_memory: 8GB
postgres_jobs: 18
Vendor defaults.
A map of defaults based on the vendor and library. In general it should not be necessary to change these unless there is a significant change to the size/composition of vendor supplied libraries.
For example xchem_dsip is set up as follows.
vendors:
xchem_dsip:
# Used for sizing timeouts and processing parameters
approx_vendor_molecules: 800
# Total time (in minutes) across all CPUs -
# Used for sizing timeouts and processing parameters
est_total_fragmentation_time: 10
# Minimum heavy atom count for extraction/fragmentation processing
fragminhac: 0
# Maximum heavy atom count for extraction/fragmentation processing
fraghac: 36
# Maximum frag cycles for fragmentation processing
fragmaxfrags: 12
# Limit for partial fragmentation processing (not operational)
fraglimit: 0
# Extract Playbook: Chunk of molecules to be processed before insert to index
# This is a sensitive value - settings for each vendor should be tuned.
# So the values below are set based on the number of edges per mol_source value
# and validated by testing.
indexchunksize: 100
# Total time (in minutes) to build index (will be divided by number of postgres_jobs)
index_build_time: 10
The
indexchunksize
is is used in the build_index playbook to balance deduplication vs record insertion and indicates how many source molecules are processed in memory before a database commit. The most efficient value appears to depend on how "clean" molecules are. For Molport and Enamine a value around 300 has been found to be efficient, whereas for Chemspace it was possible to increase it to 2000.
If you are required to request server-side encryption for S3 bucket put
operations you can set the variable bucket_requires_ecryption: yes
(or
bucket_out_requires_encryption: yes
during the combination play).
Numbers are given below were achieved using the maximum fragmentation cycles parameter set to 12.
Vendor/Lib | Version | Molecules | Nodes | Edges |
---|---|---|---|---|
Xchem_dsip | v1 | 768 | 5099 | 14421 |
Molport | 2020-02 | 7118865 | 104407052 | 582264651 |
Chemspace_bb | December2019 | 17257752 | 27265866 | 111716670 |
Enamine_ro5 | Jun2018 | 39765321 | 178240230 | 1130306251 |
Xchem_spot | v1 | 96 | 576 | 1388 |
Xchem_probe | v1 | 239 | 857 | 2396 |
Full Combination | 20200623 | --- | 307499949 | 1754938701 |
Note that the full combination de-duplicates nodes and edges found in more than one library.
The combined datasets resulted in a postgres database size of approximately 1.5TB. Disk space must, however, allow for future increases and for temporary workspace for queries so a minimum of 3TB is recommended for the main and backup directories.
The diagram below shows the FairMolecules database schema:
Tables beginning with “i_” (not shown) are used in the loading process and tables beginning with "o_" are used in the extract process.
The following steps summarize the changes required to add a new vendor library to the fragmentor ansible process:
-
The new vendor/library details must be added to the vendor name table in the database. The load script in the site-configure role (configure/files/p10_load_config_data.sql) should be updated to reflect this.
For example:
insert into vendor_name (vendor_name, currency, supplier_node_name, supplier_node_label) values ('xchem_spot', NULL, 'Xchem','V_XSPOT');
- Folder changes must be made to add the new library in the data directory of the repository (subsequently matched on S3 directory) matching the structure given above. Note that the input file should be compressed with gzip and, if using an existing format, the columns/header line must match the existing format. For example:
xchem/spot/v1/spotfinder.smi.gz
- Group-vars. The overall configuration in ansible/group-vars/all.yaml requires a configuration for each vendor/library. For example:
xchem_spot:
approx_vendor_molecules: 100
est_total_fragmentation_time: 10
fragminhac: 0
fraghac: 36
fragmaxfrags: 12
fraglimit: 0
indexchunksize: 100
index_build_time: 10
- Add a new vendor/library specific configuration file to the standardise role to identify the input file format, python script, upload table and copy columns that will be used for standardisation. Note that the name should begin with the vendor/library identifier. For example standardise/vars/xchem_spot-variables.yaml contains:
# The ansible unpack tasks - can be vendor specific
unpacker: decompress-gz-all
# Python script used to standardise the molecules
standardiser: frag.standardise.scripts.dsip.standardise_xchem_compounds
# Input file template (unzipped) expected by standardiser.
# If there are multiple files this can be a glob.
standinputfile: spotfinder.smi
# Upload table template to match fields in standardised compounds.
standardise_copy_table: i_mols_dsip
# Fields to upload to table (used in copy statement).
standardise_copy_columns: osmiles,isosmiles,nonisosmiles,hac,cmpd_id
- Add a new vendor/library specific yaml task file to unpack the raw data if special processing is required (see chemspace for an example). If this is an existing format then the standard unpacker is referred to (as above).
- If the new vendor input file is the same as a current format, then an existing standardisation python script can be used. If the layout varies then more customisation is required. Please see below for more details on this.
- Add a new create script for creating the i_mols_table (ansible/standardise/files/vendor/f40_create_stand_database.sql) and a newload script (ansible/standardise/files/vendor/f40_load_standardised_data.sql) to upload this data into the database. If a new python script has been written, then please see the customisation section below. Otherwise, if this is an existing format then the script for that format can be copied. For example, spot matches dsip:
-- Create Fragmentation Database SQL Statements for Libary xchem_spot
-- Purpose: Creates fragmentation company specific tables. This is the same as the xchem_dsip library.
DROP TABLE IF EXISTS i_mols_dsip;
-- Create i_mols
CREATE TABLE i_mols_dsip (
osmiles TEXT,
isosmiles TEXT,
nonisosmiles TEXT,
hac SMALLINT,
cmpd_id TEXT,
isomol_id bigint,
nonisomol_id bigint
);
- The site-fragment playbook should not require any specific changes apart from the fragmentation configuration parameters in all.yaml.
- The site-inchi playbook should not require any specific changes
- The site-extract playbook should not require any specific changes.
- The site-combine playbook should not require any specific changes.
- It is recommended to backup the "production" database using the site-backup play.
- Update the "production" database to add the new vendor_name records.
- Upload the input files to AWS S3 in the correct file location so they can be found by the fragmentor playbooks (see above).
- Pull the updated repository from the head node.
Some vendors (for example Molport and Chemspace) contain additional information in the input file, usually in the form of pricing information. Pricing information can be provided in different formats, so a flexible database layout has been provided. However, additional steps are required:
- A specific standardisation module will need to be written to be able to read the additional columns. The standardisation python scripts are located in frag/standardise/scripts/(vendor). They are all written in similar formats with the same input parameters as these are used in the ansible playbook.
- Any additional columns added are passed through to new table separated fields in the staging file standardised-compounds.tab.
- Note also, that adding a new python module requires the git repo to be tagged so that the nextflow processes can find the repository and download it to the cluster. The nextflow_container_tag parameter in ansible/group_vars/all.yaml has to be updated with this tag.
- A dedicated i_mols_(vendor) table will have to be created in f40_create_stand_database.sql to allow for the new columns in standardised-compounds.tab. Similarly, the columns will have to be added to the parameters and the f40_load_standardised_data.sql adapted. If the new fields do contain pricing information, then an sql insert statement will have to be coded to add this data to the price table. The f40_load_standardised_data.sql scripts for chemspace and Molport are two examples of how this might be accomplished.
If an existing Vendor Library supports your fragmentation needs you can provide (over-ride) the built-in parameters by supplying your own. You will need to: -
- Site your data appropriately
- Provide parameters that over-ride the built-in values for your source file-type
- Verify that the database is sufficiently configured to deal with your fragmentation needs
The following example illustrates how you might utilise SDF fragment processing for an SDF file of your choice.
Place SDF files are located in your AWS bucket using the
path raw/vendor/sdf/<version>/<standinputfile>.gz
The version can be any string i.e. v1. It's just a string (with no '_')
When running playbooks to process you data, for example standardising your custom SDF file you may need to over-ride the built-in defaults for the file type being processed. You may have an especially small file or an especially large one. If so you may want to adjust parameters that determine the number of sub-processes that will run and timeouts that will be used.
The built-in
group_vars/all.yaml
has documentation for the parameters in itsvendors/xchem_dsip
variable.
You will need to provide a vendors
map and some additional variables.
A set of parameters for SDF might be defined like this: -
---
vendors:
sdf:
approx_vendor_molecules: 50
est_total_fragmentation_time: 10
fragminhac: 0
fraghac: 36
fragmaxfrags: 12
fraglimit: 0
indexchunksize: 100
index_build_time: 10
# Input file template (unzipped) expected by standardiser.
# It can be a name of a file or, if there are multiple files, it can be a glob.
standinputfile: test.sdf
# Compound Prefix used in creating the compound-id's.
# For the other vendors this is hardcoded as the company name.
standard_compound_id_prefix: SDF
# Field in SDF file to be used for compound id.
standard_compound_id_field: mr_id
By default the fragmentation process uses the approx_vendor_molecules
and est_total_fragmentation_time
(minutes) to estimate the time that should be
allowed for the nextflow freagmentation process to run. The process is run
asynchronously in Ansible and so the playbook will need to estinate
the time to allow it to run, polling the process exit code regularly, so that the
playbook does not end prematurely.
Estimating the fragmentation time is critical to the process as the playbook is not written to run for ever - it expects the process to complete at some point.
Internally the estimated time is calculated from the following equation: -
estimate (minutes) =
safety_factor x
(est_total_fragmentation_time / parallel_jobs) x
(actual_vendor_molecules / approx_vendor_molecules)
Where: safety_factor
has a defautl of 3
.
So, to get a good estimate you have to have some idea of how long it
would take to fragment all the molecules (if you had just one core), that's
your est_total_fragmentation_time
.
If you have 500 cores (parallel_jobs
) and an estimated total time of
two days (2880
minutes) and an estimate of 1 million molecules
but the fragmentor finds 800,000 the estimated run time
(for the nextflow fragmentatioin step) will be: -
3 x (2880 / 500) x (800000 / 1000000)
(about 14 minutes)
This is a very simple example.
If all else fails, and you simply want to specify the fragmentation time yourself (for the nextflow process) you can override the calculation with the following: -
nextflow_timeout_minutes: 240
force_nextflow_timeout_minutes: yes
The nextflow_timeout_minutes
will be used as a lower limit, regardless
of whether you force it or not. If the calculation
results in a value less than nextflow_timeout_minutes
then nextflow_timeout_minutes
wil be used.
Create the environment:
$ conda env create -f environment.yml
Activate the environment:
$ conda activate fragmentor
Removing the environment:
$ conda env remove --name fragmentor
# Building the "player" image
The player image, produced by Dockerfile-player
is used by the fragmentor-ansible
respository for execution of the playbooks in this reposity using an image
of this repository's content in Kubernetes.
To build the player image you will need to build and push the image from this dockerfile: -
$ docker build . -f Dockerfile-player \
-t informaticsmatters/fragmentor-player:latest \
--push
Or, an Intel image on Apple/ARM silicon using buildx
: -
$ docker buildx build . -f Dockerfile-player \
--platform linux/amd64 \
-t informaticsmatters/fragmentor-player:latest \
--push