Added CREST encode and parser for conformer search directory parsing.

coltonbh · Aug 13, 2024 · 9300ac2 · 9300ac2
1 parent 498fa72
commit 9300ac2
Show file tree

Hide file tree

Showing 12 changed files with 511 additions and 15 deletions.
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -12,9 +12,12 @@
         "natoms",
         "nocuda",
         "pathconf",
+        "psutil",
         "qcel",
         "qcio",
         "qcparse",
+        "rotamer",
+        "rotamers",
         "spinmult",
         "tcin",
         "tcout",

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -29,3 +29,11 @@ See the `terachem.py` file for an overview.
    - The `ParsedDataCollector` object only allows setting a particular data attribute once. If a second attempt is made it raises an `AttributeError`. This provides a sanity check that multiple parsers aren't trying to write to the same field and overwriting each other.
 3. `parse` looks up the parsers for the `program` in the `parser_registry`. Parsers are registered by wrapping them with the `@parser` decorator found in `qcparse.parsers.utils`. The `@parser` decorator registers a parser with the registry under the program name of the module in which it is found, verifying that the `filetype` for which it is registered is supported by the `program` by checking `SupportedFileTypes` in the parser's module. It also registers whether a parser `must_succeed` which means an exception will be raised if this value is not found when attempting to parse a file. In order for parsers to properly register they must be imported, so make sure they are hoisted into the `qcparse.parsers.__init__` file.
 4. `parse` executes all parsers for the given `filetype` and converts the `ParsedDataCollector` object passed to all the parsers into a final `SinglePointResults` object.
+
+## Publish the package
+
+With all code merged to `master` and the latest code pulled down to your local machine, run:
+
+```sh
+python scripts/release.py x.x.x
+```
diff --git a/docs/dev-decisions.md b/docs/dev-decisions.md
@@ -6,12 +6,10 @@
 
 ## UPDATED DESIGN DECISION:
 
-- I don't see a strong reason for making this package a standalone package that parses everything required for a `SinglePointOutput` object including input data, provenance data, xyz files, etc... While the original idea was to have a cli tool to run on TeraChem files, now that I've build my own data structures (`qcio`) and driver program (`qcop`), there's no reason to parse anything but `SinglePointResults` values because we should just be driving the programs with `qcop` and already have access to the input data. The code is far easier to maintain as only a results parser. The only downside would be walking in to someone else's old data and wanting to slurp it all in, but perhaps there's no reason to build for that use case now... Just go with SIMPLE and keep the code maintainable.
+- I don't see a strong reason for making this package a standalone package that parses everything required for a `ProgramOutput` object including input data, provenance data, xyz files, etc... While the original idea was to have a cli tool to run on TeraChem files, now that I've build my own data structures (`qcio`) and driver program (`qcop`), there's no reason to parse anything but `SinglePointResults` values because we should just be driving the programs with `qcop` and already have access to the input data. The code is far easier to maintain as only a results parser. The only downside would be walking in to someone else's old data and wanting to slurp it all in, but perhaps there's no reason to build for that use case now... Just go with SIMPLE and keep the code maintainable.
 
-## Publishing Checklist
+## Future Features
 
-- Update `CHANGELOG.md`
-- Bump version in `pyproject.toml`
-- Tag commit with a version and GitHub Actions will publish it to pypi if tag is on `master` branch.
-- `git push --tags`
-- `git push`
+- At some point it could be good to have a `parse_dir` function that parses the entire output directory of a program and returns the corresponding `Results` object. The `parse` function would still be used on individual files/output data; however, the `parse_dir` function would be the top-level function for collecting all results from a directory and turning them into structured data. Useful for:
+  - Parsing all CREST outputs, e.g., `crest_conformers.xyz` and `crest_rotamers.xyz` into a `ConformerSearchResults` object.
+  - Parsing data from other TeraChem output files besides just the `stdout`, e.g., converting the `c0` binary files into a `Wavefunction` object.
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -11,7 +11,8 @@ homepage = "https://github.com/coltonbh/qcparse"
 [tool.poetry.dependencies]
 python = "^3.8"
 pydantic = ">=2.0.0"
-qcio = ">=0.10.0"
+qcio = "^0.11.8"
+tomli-w = "^1.0.0"
 
 [tool.poetry.group.dev.dependencies]
 mypy = "^1.1.1"

diff --git a/qcparse/encoders/crest.py b/qcparse/encoders/crest.py
@@ -0,0 +1,83 @@
+import copy
+import os
+from typing import Any, Dict
+
+import tomli_w
+from qcio import CalcType, ProgramInput
+
+from qcparse.exceptions import EncoderError
+from qcparse.models import NativeInput
+
+SUPPORTED_CALCTYPES = {CalcType.conformer_search}
+
+
+def encode(inp_obj: ProgramInput) -> NativeInput:
+    """Translate a ProgramInput into CREST inputs files.
+
+    Args:
+        inp_obj: The qcio ProgramInput object for a computation.
+
+    Returns:
+        NativeInput with .input_files being a crest.toml file and .geometry_file the
+            Structure's xyz file.
+    """
+    validate_input(inp_obj)
+    struct_filename = "structure.xyz"
+
+    return NativeInput(
+        input_file=tomli_w.dumps(_to_toml_dict(inp_obj, struct_filename)),
+        geometry_file=inp_obj.structure.to_xyz(),
+        geometry_filename=struct_filename,
+    )
+
+
+def validate_input(inp_obj: ProgramInput):
+    """Validate the input for CREST.
+
+    Args:
+        inp_obj: The qcio ProgramInput object for a computation.
+
+    Raises:
+        EncoderError: If the input is invalid.
+    """
+    # These values come from other parts of the ProgramInput and should not be set
+    # in the keywords.
+    non_allowed_keywords = ["charge", "uhf", "runtype"]
+    for keyword in non_allowed_keywords:
+        if keyword in inp_obj.keywords:
+            raise EncoderError(
+                f"{keyword} should not be set in keywords for CREST. It is already set "
+                "on the Structure or ProgramInput elsewhere.",
+            )
+
+
+def _to_toml_dict(inp_obj: ProgramInput, struct_filename: str) -> Dict[str, Any]:
+    """Convert a ProgramInput object to a dictionary in the CREST format of TOML.
+
+    This function makes it easier to test for the correct TOML structure.
+    """
+    # Start with existing keywords
+    toml_dict = copy.deepcopy(inp_obj.keywords)
+
+    # Top level keywords
+    # Logical cores was 10% faster than physical cores, so not using psutil
+    toml_dict.setdefault("threads", os.cpu_count())
+    toml_dict["input"] = struct_filename
+
+    # TODO: May need to deal with non-covalent mode at some point
+    toml_dict["runtype"] = "imtd-gc"
+
+    # Calculation level keywords
+    calculation = toml_dict.pop("calculation", {})
+    calculation_level = calculation.pop("level", [])
+    if len(calculation_level) == 0:
+        calculation_level.append({})
+    for level_dict in calculation_level:
+        level_dict["method"] = inp_obj.model.method
+        level_dict["charge"] = inp_obj.structure.charge
+        level_dict["uhf"] = inp_obj.structure.multiplicity - 1
+
+    calculation["level"] = calculation_level
+    toml_dict["calculation"] = calculation
+
+    return toml_dict
diff --git a/qcparse/main.py b/qcparse/main.py
@@ -94,7 +94,8 @@ def encode(inp_data: ProgramInput, program: str) -> NativeInput:
         A NativeInput object with the encoded input.
 
     Raises:
-        EncoderError: If the calctype is not supported by the program's encoder.
+        EncoderError: If the calctype is not supported by the program's encoder or the
+            input is invalid.
     """
     # Check that calctype is supported by the encoder
     encoder = import_module(f"qcparse.encoders.{program}")

diff --git a/qcparse/models.py b/qcparse/models.py
@@ -163,7 +163,7 @@ class NativeInput(BaseModel):
     """Native input file data. Writing these files to disk should produce a valid input.
 
     Attributes:
-        input: input file for the program
+        input_file: input file for the program
         geometry: xyz file or other geometry file required for the calculation
         geometry_filename: filename of the geometry file referenced in the input
     """

diff --git a/qcparse/parsers/crest.py b/qcparse/parsers/crest.py
@@ -1,3 +1,8 @@
+from pathlib import Path
+from typing import List, Optional, Union
+
+from qcio import ConformerSearchResults, Structure
+
 from .utils import regex_search
 
 
@@ -9,3 +14,75 @@ def parse_version_string(string: str) -> str:
     regex = r"Version (\d+\.\d+\.\d+),"
     match = regex_search(regex, string)
     return match.group(1)
+
+
+def parse_structures(
+    filename: Union[Path, str],
+    charge: Optional[int] = None,
+    multiplicity: Optional[int] = None,
+) -> List[Structure]:
+    """Parse Structures from a CREST multi-structure xyz file.
+
+    CREST places an energy value in the comments line of each structure. This function
+    collects all Structures and their energies from the file into AnnotatedStructure
+    objects.
+
+    Args:
+        filename: The path to the multi-structure xyz file.
+        charge: The charge of the structures.
+        multiplicity: The multiplicity of the structures.
+
+    Returns:
+        A list of Structure objects.
+    """
+    try:
+        structures = Structure.open(filename, charge=charge, multiplicity=multiplicity)
+        if not isinstance(structures, list):  # single structure
+            structures = [structures]
+    except FileNotFoundError:
+        structures = []  # No structures created
+    return structures
+
+
+def parse_conformer_search_dir(
+    directory: Union[Path, str],
+    *,
+    charge: Optional[int] = None,
+    multiplicity: Optional[int] = None,
+    collect_rotamers: bool = True,
+) -> ConformerSearchResults:
+    """Parse the output directory of a CREST conformer search calculation.
+
+    Args:
+        directory: Path to the directory containing the CREST output files.
+        charge: The charge of the structures.
+        multiplicity: The multiplicity of the structures.
+        collect_rotamers: Whether to parse rotamers as well as conformers.
+
+    Returns:
+        The parsed conformers, rotamers, and their energies as a ConformerSearchResults
+        object.
+    """
+    directory = Path(directory)
+    conformers = parse_structures(
+        directory / "crest_conformers.xyz", charge=charge, multiplicity=multiplicity
+    )
+
+    # CREST places the energy as the only value in the comment line
+    conf_energies = [conf.extras[Structure._xyz_comment_key][0] for conf in conformers]
+
+    rotamers = []
+    if collect_rotamers:
+        rotamers = parse_structures(
+            directory / "crest_rotamers.xyz", charge=charge, multiplicity=multiplicity
+        )
+
+    # CREST places the energy as the only value in the comment line
+    rotamer_energies = [rot.extras[Structure._xyz_comment_key][0] for rot in rotamers]
+
+    return ConformerSearchResults(
+        conformers=conformers,
+        conformer_energies=conf_energies,
+        rotamers=rotamers,
+        rotamer_energies=rotamer_energies,
+    )