Skip to content

Commit

Permalink
Added CREST encode and parser for conformer search directory parsing.
Browse files Browse the repository at this point in the history
  • Loading branch information
coltonbh committed Aug 13, 2024
1 parent 498fa72 commit 9300ac2
Show file tree
Hide file tree
Showing 12 changed files with 511 additions and 15 deletions.
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,12 @@
"natoms",
"nocuda",
"pathconf",
"psutil",
"qcel",
"qcio",
"qcparse",
"rotamer",
"rotamers",
"spinmult",
"tcin",
"tcout",
Expand Down
8 changes: 8 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,11 @@ See the `terachem.py` file for an overview.
- The `ParsedDataCollector` object only allows setting a particular data attribute once. If a second attempt is made it raises an `AttributeError`. This provides a sanity check that multiple parsers aren't trying to write to the same field and overwriting each other.
3. `parse` looks up the parsers for the `program` in the `parser_registry`. Parsers are registered by wrapping them with the `@parser` decorator found in `qcparse.parsers.utils`. The `@parser` decorator registers a parser with the registry under the program name of the module in which it is found, verifying that the `filetype` for which it is registered is supported by the `program` by checking `SupportedFileTypes` in the parser's module. It also registers whether a parser `must_succeed` which means an exception will be raised if this value is not found when attempting to parse a file. In order for parsers to properly register they must be imported, so make sure they are hoisted into the `qcparse.parsers.__init__` file.
4. `parse` executes all parsers for the given `filetype` and converts the `ParsedDataCollector` object passed to all the parsers into a final `SinglePointResults` object.

## Publish the package

With all code merged to `master` and the latest code pulled down to your local machine, run:

```sh
python scripts/release.py x.x.x
```
12 changes: 5 additions & 7 deletions docs/dev-decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,10 @@

## UPDATED DESIGN DECISION:

- I don't see a strong reason for making this package a standalone package that parses everything required for a `SinglePointOutput` object including input data, provenance data, xyz files, etc... While the original idea was to have a cli tool to run on TeraChem files, now that I've build my own data structures (`qcio`) and driver program (`qcop`), there's no reason to parse anything but `SinglePointResults` values because we should just be driving the programs with `qcop` and already have access to the input data. The code is far easier to maintain as only a results parser. The only downside would be walking in to someone else's old data and wanting to slurp it all in, but perhaps there's no reason to build for that use case now... Just go with SIMPLE and keep the code maintainable.
- I don't see a strong reason for making this package a standalone package that parses everything required for a `ProgramOutput` object including input data, provenance data, xyz files, etc... While the original idea was to have a cli tool to run on TeraChem files, now that I've build my own data structures (`qcio`) and driver program (`qcop`), there's no reason to parse anything but `SinglePointResults` values because we should just be driving the programs with `qcop` and already have access to the input data. The code is far easier to maintain as only a results parser. The only downside would be walking in to someone else's old data and wanting to slurp it all in, but perhaps there's no reason to build for that use case now... Just go with SIMPLE and keep the code maintainable.

## Publishing Checklist
## Future Features

- Update `CHANGELOG.md`
- Bump version in `pyproject.toml`
- Tag commit with a version and GitHub Actions will publish it to pypi if tag is on `master` branch.
- `git push --tags`
- `git push`
- At some point it could be good to have a `parse_dir` function that parses the entire output directory of a program and returns the corresponding `Results` object. The `parse` function would still be used on individual files/output data; however, the `parse_dir` function would be the top-level function for collecting all results from a directory and turning them into structured data. Useful for:
- Parsing all CREST outputs, e.g., `crest_conformers.xyz` and `crest_rotamers.xyz` into a `ConformerSearchResults` object.
- Parsing data from other TeraChem output files besides just the `stdout`, e.g., converting the `c0` binary files into a `Wavefunction` object.
22 changes: 18 additions & 4 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ homepage = "https://github.com/coltonbh/qcparse"
[tool.poetry.dependencies]
python = "^3.8"
pydantic = ">=2.0.0"
qcio = ">=0.10.0"
qcio = "^0.11.8"
tomli-w = "^1.0.0"

[tool.poetry.group.dev.dependencies]
mypy = "^1.1.1"
Expand Down
83 changes: 83 additions & 0 deletions qcparse/encoders/crest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import copy
import os
from typing import Any, Dict

import tomli_w
from qcio import CalcType, ProgramInput

from qcparse.exceptions import EncoderError
from qcparse.models import NativeInput

SUPPORTED_CALCTYPES = {CalcType.conformer_search}


def encode(inp_obj: ProgramInput) -> NativeInput:
"""Translate a ProgramInput into CREST inputs files.
Args:
inp_obj: The qcio ProgramInput object for a computation.
Returns:
NativeInput with .input_files being a crest.toml file and .geometry_file the
Structure's xyz file.
"""
validate_input(inp_obj)
struct_filename = "structure.xyz"

return NativeInput(
input_file=tomli_w.dumps(_to_toml_dict(inp_obj, struct_filename)),
geometry_file=inp_obj.structure.to_xyz(),
geometry_filename=struct_filename,
)


def validate_input(inp_obj: ProgramInput):
"""Validate the input for CREST.
Args:
inp_obj: The qcio ProgramInput object for a computation.
Raises:
EncoderError: If the input is invalid.
"""
# These values come from other parts of the ProgramInput and should not be set
# in the keywords.
non_allowed_keywords = ["charge", "uhf", "runtype"]
for keyword in non_allowed_keywords:
if keyword in inp_obj.keywords:
raise EncoderError(
f"{keyword} should not be set in keywords for CREST. It is already set "
"on the Structure or ProgramInput elsewhere.",
)


def _to_toml_dict(inp_obj: ProgramInput, struct_filename: str) -> Dict[str, Any]:
"""Convert a ProgramInput object to a dictionary in the CREST format of TOML.
This function makes it easier to test for the correct TOML structure.
"""
# Start with existing keywords
toml_dict = copy.deepcopy(inp_obj.keywords)

# Top level keywords
# Logical cores was 10% faster than physical cores, so not using psutil
toml_dict.setdefault("threads", os.cpu_count())
toml_dict["input"] = struct_filename

# TODO: May need to deal with non-covalent mode at some point
toml_dict["runtype"] = "imtd-gc"

# Calculation level keywords
calculation = toml_dict.pop("calculation", {})
calculation_level = calculation.pop("level", [])
if len(calculation_level) == 0:
calculation_level.append({})
for level_dict in calculation_level:
level_dict["method"] = inp_obj.model.method
level_dict["charge"] = inp_obj.structure.charge
level_dict["uhf"] = inp_obj.structure.multiplicity - 1

calculation["level"] = calculation_level
toml_dict["calculation"] = calculation

return toml_dict
3 changes: 2 additions & 1 deletion qcparse/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,8 @@ def encode(inp_data: ProgramInput, program: str) -> NativeInput:
A NativeInput object with the encoded input.
Raises:
EncoderError: If the calctype is not supported by the program's encoder.
EncoderError: If the calctype is not supported by the program's encoder or the
input is invalid.
"""
# Check that calctype is supported by the encoder
encoder = import_module(f"qcparse.encoders.{program}")
Expand Down
2 changes: 1 addition & 1 deletion qcparse/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ class NativeInput(BaseModel):
"""Native input file data. Writing these files to disk should produce a valid input.
Attributes:
input: input file for the program
input_file: input file for the program
geometry: xyz file or other geometry file required for the calculation
geometry_filename: filename of the geometry file referenced in the input
"""
Expand Down
77 changes: 77 additions & 0 deletions qcparse/parsers/crest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
from pathlib import Path
from typing import List, Optional, Union

from qcio import ConformerSearchResults, Structure

from .utils import regex_search


Expand All @@ -9,3 +14,75 @@ def parse_version_string(string: str) -> str:
regex = r"Version (\d+\.\d+\.\d+),"
match = regex_search(regex, string)
return match.group(1)


def parse_structures(
filename: Union[Path, str],
charge: Optional[int] = None,
multiplicity: Optional[int] = None,
) -> List[Structure]:
"""Parse Structures from a CREST multi-structure xyz file.
CREST places an energy value in the comments line of each structure. This function
collects all Structures and their energies from the file into AnnotatedStructure
objects.
Args:
filename: The path to the multi-structure xyz file.
charge: The charge of the structures.
multiplicity: The multiplicity of the structures.
Returns:
A list of Structure objects.
"""
try:
structures = Structure.open(filename, charge=charge, multiplicity=multiplicity)
if not isinstance(structures, list): # single structure
structures = [structures]
except FileNotFoundError:
structures = [] # No structures created
return structures


def parse_conformer_search_dir(
directory: Union[Path, str],
*,
charge: Optional[int] = None,
multiplicity: Optional[int] = None,
collect_rotamers: bool = True,
) -> ConformerSearchResults:
"""Parse the output directory of a CREST conformer search calculation.
Args:
directory: Path to the directory containing the CREST output files.
charge: The charge of the structures.
multiplicity: The multiplicity of the structures.
collect_rotamers: Whether to parse rotamers as well as conformers.
Returns:
The parsed conformers, rotamers, and their energies as a ConformerSearchResults
object.
"""
directory = Path(directory)
conformers = parse_structures(
directory / "crest_conformers.xyz", charge=charge, multiplicity=multiplicity
)

# CREST places the energy as the only value in the comment line
conf_energies = [conf.extras[Structure._xyz_comment_key][0] for conf in conformers]

rotamers = []
if collect_rotamers:
rotamers = parse_structures(
directory / "crest_rotamers.xyz", charge=charge, multiplicity=multiplicity
)

# CREST places the energy as the only value in the comment line
rotamer_energies = [rot.extras[Structure._xyz_comment_key][0] for rot in rotamers]

return ConformerSearchResults(
conformers=conformers,
conformer_energies=conf_energies,
rotamers=rotamers,
rotamer_energies=rotamer_energies,
)
Loading

0 comments on commit 9300ac2

Please sign in to comment.