Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BasicResultCollection.from_server crashes if an entry is missing a CMILES #299

Open
amcisaac opened this issue Oct 3, 2024 · 9 comments

Comments

@amcisaac
Copy link

amcisaac commented Oct 3, 2024

When downloading a BasicResultCollection using BasicResultCollection.from_server(), it crashes if it encounters an entry that is missing a CMILES, with the error:

Traceback (most recent call last):
  File "/dfs6/pub/amcisaac/sage-2.2.1-sdata/03_generate-initial-ff/create-msm-ff.py", line 284, in <module>
    main()
  File "/data/homezvol3/amcisaac/miniconda3/envs/sep-2024-env/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/homezvol3/amcisaac/miniconda3/envs/sep-2024-env/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/data/homezvol3/amcisaac/miniconda3/envs/sep-2024-env/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/homezvol3/amcisaac/miniconda3/envs/sep-2024-env/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dfs6/pub/amcisaac/sage-2.2.1-sdata/03_generate-initial-ff/create-msm-ff.py", line 133, in main
    hess_dataset = BasicResultCollection.from_server(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/homezvol3/amcisaac/miniconda3/envs/sep-2024-env/lib/python3.11/site-packages/openff/qcsubmit/results/results.py", line 365, in from_server
    return cls.from_datasets(
           ^^^^^^^^^^^^^^^^^^
  File "/data/homezvol3/amcisaac/miniconda3/envs/sep-2024-env/lib/python3.11/site-packages/openff/qcsubmit/results/results.py", line 324, in from_datasets
    cmiles = entry.attributes[
             ^^^^^^^^^^^^^^^^^
KeyError: 'canonical_isomeric_explicit_hydrogen_mapped_smiles'

Reproducing example:

from qcportal import PortalClient
from openff.qcsubmit.results import BasicResultCollection

client = PortalClient(address="https://api.qcarchive.molssi.org:443/")
test_hess_dataset = BasicResultCollection.from_server(
    client=client,
    datasets=['OpenFF Gen 2 Opt Set 1 Roche'],
    spec_name="spec_1",
)

It looks like it's ultimately coming from BasicResultCollection.from_dataset():

            for entry_name, spec_name, record in dataset.iterate_records(
                specification_names=spec_name, status=RecordStatusEnum.complete
            ):
                entry = dataset.get_entry(entry_name)
                molecule = entry.molecule

                cmiles = (
                    molecule.identifiers.canonical_isomeric_explicit_hydrogen_mapped_smiles
                )
                if not cmiles:
                    cmiles = molecule.extras.get(
                        "canonical_isomeric_explicit_hydrogen_mapped_smiles"
                    )
                if not cmiles:
                    cmiles = entry.attributes[
                        "canonical_isomeric_explicit_hydrogen_mapped_smiles"
                    ]
                if not cmiles:
                    print(f"MISSING CMILES! entry = {entry_name}")
                    continue

the error is raised by the second if statement, so the third if statement (which should apply) is never reached.

Replacing the statement with

                if not cmiles:
                    cmiles = entry.attributes.get(
                        "canonical_isomeric_explicit_hydrogen_mapped_smiles"
                    )

per @ntBre 's suggestion produces what I believe to be the intended outcome (e.g. it goes to the third statement, and prints "MISSING CMILES!", and skips the entry.

However, a related issue is that I believe a number of our older datasets don't have CMILES for any entries, the two I checked were OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy and OpenFF Gen 2 Opt Set 1 Roche. It may be ideal to allow datasets to be loaded without CMILES.

@j-wags
Copy link
Member

j-wags commented Oct 4, 2024

Hm, the CMILES are there, maybe the older datasets just have them in a place that QCSubmit isn't looking:

from qcportal import PortalClient

client = PortalClient(address="https://api.qcarchive.molssi.org:443/")
dataset = client.get_dataset("optimization", "OpenFF Gen 2 Opt Set 1 Roche")
for entry_name, spec_name, record in dataset.iterate_records():
    entry = dataset.get_entry(entry_name)
    print(entry.attributes["canonical_isomeric_explicit_hydrogen_mapped_smiles"])

returns

[H:7][C:1]1([C:2]([C:3]1([C:4]([H:11])([H:12])[O:6][H:14])[O:5][H:13])([H:9])[H:10])[H:8]
[H:7][C:1]1([C:2]([C:3]1([C:4]([H:11])([H:12])[O:6][H:14])[O:5][H:13])([H:9])[H:10])[H:8]
[H:7][C:1]1([C:2]([C:3]1([C:4]([H:11])([H:12])[O:6][H:14])[O:5][H:13])([H:9])[H:10])[H:8]
[H:15][c:1]1[c:2]([c:6]2[c:7](=[O:14])[c:3]([c:5]([n:13]([c:8]2[n:12][c:4]1[H:18])[C:11]3([C:9]([C:10]3([H:22])[H:23])([H:20])[H:21])[H:24])[H:19])[H:17])[H:16]
[H:13][c:1]1[c:2]([c:4]([c:6]2[c:5]([c:3]1[H:15])[C:7](=[O:12])[N:11]([C:10]([C:9]([C:8]2([H:17])[H:18])([H:19])[H:20])([H:21])[H:22])[H:23])[H:16])[H:14]

I'm seeing the same with dataset = client.get_dataset("optimization", "OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy") - The entries DO have CMILES, QCSubmit is just not looking in the right place (?) to find them.

@ntBre would you be up to try fixing the CMILES lookups in these datasets in a PR? It would also be nice to fix the conditional cascade so that entries with missing CMILES really do get skipped, instead or erroring out (but that's of secondard importance)

@ntBre
Copy link
Contributor

ntBre commented Oct 4, 2024

Very weird, it looks like qcsubmit is also accessing entry.attributes, but yes, happy to take a stab at fixing both of these!

@amcisaac
Copy link
Author

amcisaac commented Oct 4, 2024

Hm, unfortunately it looks like that's because you specified "optimization" for the dataset type. For the single points, they aren't there:

client = PortalClient(address="https://api.qcarchive.molssi.org:443/")
dataset = client.get_dataset("singlepoint", "OpenFF Gen 2 Opt Set 1 Roche")
for entry_name, spec_name, record in dataset.iterate_records():
    entry = dataset.get_entry(entry_name)
    print(entry.attributes["canonical_isomeric_explicit_hydrogen_mapped_smiles"])

throws

KeyError                                  Traceback (most recent call last)
Cell In[95], line 5
      3 for entry_name, spec_name, record in dataset.iterate_records():
      4     entry = dataset.get_entry(entry_name)
----> 5     print(entry.attributes["canonical_isomeric_explicit_hydrogen_mapped_smiles"])

KeyError: 'canonical_isomeric_explicit_hydrogen_mapped_smiles'

@amcisaac
Copy link
Author

amcisaac commented Oct 4, 2024

It does look like

entry.attributes.get("canonical_isomeric_explicit_hydrogen_mapped_smiles")

returns None if there is no CMILES and the CMILES if there is a CMILES, so makes sense as a workaround. Alternatively, a try/except block should work

@ntBre
Copy link
Contributor

ntBre commented Oct 4, 2024

Good catch Lexie. As you said, get returning None is a helpful intermediate step, but that crashes a few lines later when cmiles=None gets passed to the BasicResult constructor:

from openff.qcsubmit.results import BasicResult
BasicResult(record_id=0, cmiles=None, inchi_key="inchikey")

# ValidationError: 1 validation error for BasicResult
# cmiles
#   none is not an allowed value (type=type_error.none.not_allowed)

The inchi_key is also constructed from the Molecule constructed from the cmiles, so we can't really initialize a BasicResult at all if we can't locate some kind of SMILES in the record. Similarly, the BasicResult.cmiles field is used later in BasicResultCollection.to_records, for example, so a dummy SMILES that passes validation wouldn't help either.

The only SMILES I've found in the record so far is in the entry.name field, but they aren't canonical or isomeric or ..., and they are also suffixed by a dash and number, I guess the conformer number for entries with the same SMILES in a single dataset.

from qcportal import PortalClient

client = PortalClient("https://api.qcarchive.molssi.org:443")
ds = client.get_dataset("singlepoint", 'OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy')
entries = [e for e in ds.iterate_entries()]
entries[0].name

# 'CC(=O)N(C)[C@@H](c1cccnc1)C(=O)NC-0'

In short, I can definitely fix the crash by fixing the conditional cascade as @j-wags said, but I don't think there's another field to pull a valid CMILES from on these records, so the final result collection will still be empty for these datasets.

@j-wags
Copy link
Member

j-wags commented Oct 4, 2024

Ahhhh, you're right @amcisaac and @ntBre - My mistake. I didn't realize that the same dataset could be accessed as an optimization and a singlepoint at the same time, where only the optimization view can access the CMILES. That's a bit of a mess :-(

Ok, so the "error vs warning" thing would still be good to fix. And maybe if a dataset is 100% CMILES-less and is a single point, we could print a message to the user like "maybe you should try loading this as an optimization dataset".

Unrelated, but touching on a different point from the top post:

It may be ideal to allow datasets to be loaded without CMILES.

We've deliberately disallowed this. The CMILES contains essential info about the cheminformatics representation of the molecule that isn't stored in the QC representation (per-atom formal charges and per-bond bond orders). It's an interesting conceptual question why we need those at all, but lots of discussion years ago landed at "yes, for the purposes of making a force field, we absolutely need that information in our QC records". Tools like xyz2mol can do a decent job of guessing the chemical graph from a pure QC representation, but places where it makes errors would contaminate our dataset and so we decided to always store the chemical graph explicitly in CMILES.

@ntBre
Copy link
Contributor

ntBre commented Oct 4, 2024

One other observation, I think to_basic_result_collection works because it starts with the OptimizationResults and gets the molecule data from the record.final_molecule_id field, making the optimization record effectively the parent of the singlepoint record. Unfortunately I don't think there's an easy way to reverse this relationship, ie to retrieve the parent optimization step (which contains a valid CMILES) from the single-point record.

@amcisaac
Copy link
Author

amcisaac commented Oct 4, 2024

We've deliberately disallowed this.

That makes sense. I had been thinking there must be a way to generate the CMILES, but it sounds like it needs to be done from a topology-like representation before being converted to a QC representation.

Unfortunately I don't think there's an easy way to reverse this relationship

I think it would be possible to retrieve the CMILES from the OptimizationRecord, especially if they have the same molecule ID, but the only way I can think to do it requires converting both datasets to records in order to compare molecule ID. The to_records call would be expensive though. I'm not familiar enough with the QCPortal interface to know if there's a way to do it directly through there, but would be happy to look into it if we decide we want to pursue this direction. It sounds like this may be more trouble than it's worth.

@ntBre
Copy link
Contributor

ntBre commented Oct 4, 2024

With Lexie's help, I found a way to get the CMILES from the parent optimization dataset by

  1. Locating the parent optimization record with the same final_molecule_id
  2. Querying the server for datasets containing that optimization record
  3. Retrieving one of these datasets (there were three in this example)
  4. Searching through the dataset for the matching record
  5. Retrieving the DatasetEntry (distinct from the OptimizationRecord)
  6. Pulling the CMILES from the entry attributes
from qcportal import PortalClient

client = PortalClient("https://api.qcarchive.molssi.org:443")
ds = client.get_dataset(
    "singlepoint", "OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy"
)
target = next(ds.iterate_entries())

# prints None
print(
    target.molecule.identifiers.canonical_isomeric_explicit_hydrogen_mapped_smiles
)

opt_record = next(
    client.query_optimizations(final_molecule_id=target.molecule.id)
)

# list of datasets containing this record, 3 in this case so take the first
opt_ds_dict = client.query_dataset_records(opt_record.id)[0]

opt_ds = client.get_dataset(
    opt_ds_dict["dataset_type"], opt_ds_dict["dataset_name"]
)

entry = None
for entry_name, _spec, record in opt_ds.iterate_records():
    if record.final_molecule_id == target.molecule.id:
        entry = opt_ds.get_entry(entry_name)
        break

print(
    entry.attributes.get("canonical_isomeric_explicit_hydrogen_mapped_smiles")
)

This feels pretty hacky to me (and likely expensive as Lexie said), so I kinda doubt we'd want to include this as default behavior in the case of a missing CMILES, but it is possible. It looks a little nicer encapsulated as a function, but it's still questionable.

def get_opt_entry(
    client: PortalClient, target: SinglepointDatasetEntry
) -> OptimizationDatasetEntry:
    opt_record = next(
        client.query_optimizations(final_molecule_id=target.molecule.id)
    )
    opt_ds_dict = client.query_dataset_records(opt_record.id)[0]
    opt_ds = client.get_dataset(
        opt_ds_dict["dataset_type"], opt_ds_dict["dataset_name"]
    )
    for entry_name, _spec, record in opt_ds.iterate_records():
        if record.final_molecule_id == target.molecule.id:
            return opt_ds.get_entry(entry_name)

    return None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants