-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BasicResultCollection.from_server crashes if an entry is missing a CMILES #299
Comments
Hm, the CMILES are there, maybe the older datasets just have them in a place that QCSubmit isn't looking: from qcportal import PortalClient
client = PortalClient(address="https://api.qcarchive.molssi.org:443/")
dataset = client.get_dataset("optimization", "OpenFF Gen 2 Opt Set 1 Roche")
for entry_name, spec_name, record in dataset.iterate_records():
entry = dataset.get_entry(entry_name)
print(entry.attributes["canonical_isomeric_explicit_hydrogen_mapped_smiles"]) returns
I'm seeing the same with @ntBre would you be up to try fixing the CMILES lookups in these datasets in a PR? It would also be nice to fix the conditional cascade so that entries with missing CMILES really do get skipped, instead or erroring out (but that's of secondard importance) |
Very weird, it looks like qcsubmit is also accessing entry.attributes, but yes, happy to take a stab at fixing both of these! |
Hm, unfortunately it looks like that's because you specified
throws
|
It does look like
returns |
Good catch Lexie. As you said, from openff.qcsubmit.results import BasicResult
BasicResult(record_id=0, cmiles=None, inchi_key="inchikey")
# ValidationError: 1 validation error for BasicResult
# cmiles
# none is not an allowed value (type=type_error.none.not_allowed) The The only SMILES I've found in the record so far is in the from qcportal import PortalClient
client = PortalClient("https://api.qcarchive.molssi.org:443")
ds = client.get_dataset("singlepoint", 'OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy')
entries = [e for e in ds.iterate_entries()]
entries[0].name
# 'CC(=O)N(C)[C@@H](c1cccnc1)C(=O)NC-0' In short, I can definitely fix the crash by fixing the conditional cascade as @j-wags said, but I don't think there's another field to pull a valid CMILES from on these records, so the final result collection will still be empty for these datasets. |
Ahhhh, you're right @amcisaac and @ntBre - My mistake. I didn't realize that the same dataset could be accessed as an Ok, so the "error vs warning" thing would still be good to fix. And maybe if a dataset is 100% CMILES-less and is a single point, we could print a message to the user like "maybe you should try loading this as an optimization dataset". Unrelated, but touching on a different point from the top post:
We've deliberately disallowed this. The CMILES contains essential info about the cheminformatics representation of the molecule that isn't stored in the QC representation (per-atom formal charges and per-bond bond orders). It's an interesting conceptual question why we need those at all, but lots of discussion years ago landed at "yes, for the purposes of making a force field, we absolutely need that information in our QC records". Tools like xyz2mol can do a decent job of guessing the chemical graph from a pure QC representation, but places where it makes errors would contaminate our dataset and so we decided to always store the chemical graph explicitly in CMILES. |
One other observation, I think |
That makes sense. I had been thinking there must be a way to generate the CMILES, but it sounds like it needs to be done from a topology-like representation before being converted to a QC representation.
I think it would be possible to retrieve the CMILES from the |
With Lexie's help, I found a way to get the CMILES from the parent optimization dataset by
from qcportal import PortalClient
client = PortalClient("https://api.qcarchive.molssi.org:443")
ds = client.get_dataset(
"singlepoint", "OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy"
)
target = next(ds.iterate_entries())
# prints None
print(
target.molecule.identifiers.canonical_isomeric_explicit_hydrogen_mapped_smiles
)
opt_record = next(
client.query_optimizations(final_molecule_id=target.molecule.id)
)
# list of datasets containing this record, 3 in this case so take the first
opt_ds_dict = client.query_dataset_records(opt_record.id)[0]
opt_ds = client.get_dataset(
opt_ds_dict["dataset_type"], opt_ds_dict["dataset_name"]
)
entry = None
for entry_name, _spec, record in opt_ds.iterate_records():
if record.final_molecule_id == target.molecule.id:
entry = opt_ds.get_entry(entry_name)
break
print(
entry.attributes.get("canonical_isomeric_explicit_hydrogen_mapped_smiles")
) This feels pretty hacky to me (and likely expensive as Lexie said), so I kinda doubt we'd want to include this as default behavior in the case of a missing CMILES, but it is possible. It looks a little nicer encapsulated as a function, but it's still questionable. def get_opt_entry(
client: PortalClient, target: SinglepointDatasetEntry
) -> OptimizationDatasetEntry:
opt_record = next(
client.query_optimizations(final_molecule_id=target.molecule.id)
)
opt_ds_dict = client.query_dataset_records(opt_record.id)[0]
opt_ds = client.get_dataset(
opt_ds_dict["dataset_type"], opt_ds_dict["dataset_name"]
)
for entry_name, _spec, record in opt_ds.iterate_records():
if record.final_molecule_id == target.molecule.id:
return opt_ds.get_entry(entry_name)
return None |
When downloading a
BasicResultCollection
usingBasicResultCollection.from_server()
, it crashes if it encounters an entry that is missing a CMILES, with the error:Reproducing example:
It looks like it's ultimately coming from
BasicResultCollection.from_dataset()
:the error is raised by the second if statement, so the third if statement (which should apply) is never reached.
Replacing the statement with
per @ntBre 's suggestion produces what I believe to be the intended outcome (e.g. it goes to the third statement, and prints "MISSING CMILES!", and skips the entry.
However, a related issue is that I believe a number of our older datasets don't have CMILES for any entries, the two I checked were
OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy
andOpenFF Gen 2 Opt Set 1 Roche
. It may be ideal to allow datasets to be loaded without CMILES.The text was updated successfully, but these errors were encountered: