Update docs, remove unused option, small cleanups

rewiringamerica · Jan 4, 2024 · d5abc8b · d5abc8b
1 parent 427edca
commit d5abc8b
Show file tree

Hide file tree

Showing 4 changed files with 40 additions and 49 deletions.
diff --git a/buildstockbatch/aws/aws.py b/buildstockbatch/aws/aws.py
@@ -1509,10 +1509,6 @@ def validate_project(project_file):
     def docker_image(self):
         return "nrel/buildstockbatch"
 
-    @property
-    def weather_dir(self):
-        return self._weather_dir
-
     @property
     def container_repo(self):
         repo_name = self.docker_image

diff --git a/buildstockbatch/gcp/gcp.py b/buildstockbatch/gcp/gcp.py
@@ -5,11 +5,16 @@
 ~~~~~~~~~~~~~~~
 This class contains the object & methods that allow for usage of the library with GCP Batch.
 
-This implementation tries to match the structure of `../aws/aws.py` in the 'nrel/aws_batch' branch
-as much as possible in order to make it easier to refactor these two (or three, with Eagle) to share
-code later. Also, because that branch has not yet been merged, this will also _not_ do any
-refactoring right now to share code with that (to reduce merging complexity later). Instead, code
-that's likely to be refactored out will be commented with 'todo: aws-shared'.
+Architecture overview:
+    - Build a Docker image that includes OpenStudio and BuildStock Batch.
+    - Push the Docker image to GCP Artifact Registry.
+    - Run sampling, and split the generated buildings into batches.
+    - Collect all the required input files (including downloading weather files)
+      and upload them to Cloud Storage.
+    - Run a job on GCP Batch where each task runs one batch of simulations.
+      Uses the Docker image to run OpenStudio on Compute Engine VMs.
+    - Run a Cloud Run job for post-processing steps. Also uses the Docker image.
+    - Output files are written to a bucket in Cloud Storage.
 
 :author: Robert LaThanh, Natalie Weires
 :copyright: (c) 2023 by The Alliance for Sustainable Energy
@@ -249,11 +254,6 @@ def validate_project(project_file):
     def docker_image(self):
         return "nrel/buildstockbatch"
 
-    # todo: aws-shared (see file comment)
-    @property
-    def weather_dir(self):
-        return self._weather_dir
-
     @property
     def results_dir(self):
         return f"{self.gcs_bucket}/{self.gcs_prefix}/results"
@@ -642,7 +642,8 @@ def start_batch_job(self, batch_info):
         )
         logger.info(
             "Simulation output browser (Cloud Console): "
-            f"https://console.cloud.google.com/storage/browser/{self.gcs_bucket}/{self.gcs_prefix}/results/simulation_output"
+            f"https://console.cloud.google.com/storage/browser/{self.gcs_bucket}/{self.gcs_prefix}"
+            "/results/simulation_output"
         )
         logger.info(f"View GCP Batch job at {job_url}")
 
@@ -1098,11 +1099,6 @@ def main():
             help="Only do postprocessing, useful for when the simulations are already done",
             action="store_true",
         )
-        group.add_argument(
-            "--crawl",
-            help="Only do the crawling in Athena. When simulations and postprocessing are done.",
-            action="store_true",
-        )
         parser.add_argument(
             "-v",
             "--verbose",
@@ -1134,8 +1130,6 @@ def main():
             batch.build_image()
             batch.push_image()
             batch.process_results()
-        elif args.crawl:
-            batch.process_results(skip_combine=True, use_dask_cluster=False)
         else:
             if batch.check_for_existing_jobs():
                 return
@@ -1144,7 +1138,6 @@ def main():
             batch.push_image()
             batch.run_batch()
             batch.process_results()
-            # process_results is async, so don't do a clean (which would clean before it's done)
 
 
 if __name__ == "__main__":

diff --git a/docs/project_defn.rst b/docs/project_defn.rst
@@ -260,34 +260,35 @@ on the `AWS Batch <https://aws.amazon.com/batch/>`_ service.
 
 GCP Configuration
 ~~~~~~~~~~~~~~~~~
-The top-level ``gcp`` key is used to specify options for running the batch job
-on the `GCP Batch <https://cloud.google.com/batch>`_ service.
+The top-level ``gcp`` key is used to specify options for running the batch job on GCP,
+using `GCP Batch <https://cloud.google.com/batch>`_ and `Cloud Run <https://cloud.google.com/run>`_.
 
 .. note::
+
     When BuildStockBatch is run on GCP, it will only save results to GCP Cloud Storage (using the
     ``gcs`` configuration below); i.e., it currently cannot save to AWS S3 and Athena. Likewise,
     buildstock run locally, on Eagle, or on AWS cannot save to GCP.
 
 *  ``job_identifier``: A unique string that starts with an alphabetical character,
    is up to 48 characters long, and only has letters, numbers or hyphens.
-   This is used to name the GCP Batch job to be created and
-   differentiate it from other jobs.
+   This is used to name the GCP Batch and Cloud Run jobs to be created and
+   differentiate them from other jobs.
 *  ``project``: The GCP Project ID in which the batch will be run and of the Artifact Registry
    (where Docker images are stored).
-*  ``service_account``: The service account email address to use when running jobs on GCP.
-    Default: the Compute Engine default service account of the GCP project.
+*  ``service_account``: Optional. The service account email address to use when running jobs on GCP.
+   Default: the Compute Engine default service account of the GCP project.
 *  ``gcs``: Configuration for project data storage on GCP Cloud Storage.
 
     *  ``bucket``: The Cloud Storage bucket this project will use for simulation output and
        processed data storage.
-    *  ``prefix``: The Cloud Storage prefix at which the data will be stored.
+    *  ``prefix``: The Cloud Storage prefix at which the data will be stored within the bucket.
 
-*  ``region``: The GCP region in which the batch will be run and of the Artifact Registry.
+*  ``region``: The GCP region in which the job will be run and the region of the Artifact Registry.
 *  ``batch_array_size``: Number of tasks to divide the simulations into. Max: 10000.
 *  ``parallelism``: Optional. Maximum number of tasks that can run in parallel. If not specified,
    uses `GCP's default behavior`_ (the lesser of ``batch_array_size`` and `job limits`_).
    Parallelism is also limited by Compute Engine quotas and limits (including vCPU quota).
-*  ``artifact_registry``: Configuration for Docker image storage in GCP Artifact Registry
+*  ``artifact_registry``: Configuration for Docker image storage in GCP Artifact Registry.
 
     *  ``repository``: The name of the GCP Artifact Repository in which Docker images are stored.
        This will be combined with the ``project`` and ``region`` to build the full URL to the
@@ -300,10 +301,11 @@ on the `GCP Batch <https://cloud.google.com/batch>`_ service.
     *  ``machine_type``: GCP Compute Engine machine type to use. If omitted, GCP Batch will
        choose a machine type based on the requested vCPUs and memory. If set, the machine type
        should have at least as many resources as requested for each simulation above. If it is
-       large enough, multiple simulations will be run in parallel on the same machine.
-    *  ``use_spot``: true or false. Defaults to false if missing. This tells the project
-       to use `Spot VMs <https://cloud.google.com/spot-vms>`_ for data
-       simulations, which can reduce costs by up to 91%.
+       large enough, multiple simulations will be run in parallel on the same machine. Usually safe
+       to leave unset.
+    *  ``use_spot``: true or false. This tells the project whether to use
+       `Spot VMs <https://cloud.google.com/spot-vms>`_ for data simulations, which can reduce
+       costs by up to 91%. Default: false
 *  ``postprocessing_environment``: Optional. Specifies the Cloud Run computing environment for
    postprocessing.
 

diff --git a/docs/run_sims.rst b/docs/run_sims.rst
@@ -129,7 +129,7 @@ especially over a slower internet connection as it is downloading and building a
 GCP Specific Project configuration
 ..................................
 
-For the project to run on GCP, you will need to add a section to your config
+For the project to run on GCP, you will need to add a `gcp` section to your config
 file, something like this:
 
 .. code-block:: yaml
@@ -138,34 +138,34 @@ file, something like this:
       job_identifier: national01
       project: myorg_project
       region: us-central1
-      artifact_registry: buildstockbatch
+      artifact_registry:
+        repository: buildstockbatch
       gcs:
         bucket: mybucket
         prefix: national01_run01
       use_spot: true
       batch_array_size: 10000
 
-See :ref:`gcp-config` for details.
+See :ref:`gcp-config` for details and other optional settings.
 
-You can optionally override the ``job_identifier`` from the command line (``buildstock_gcp project.yml job_identifier``).
-Note that each job you run must have a unique ID (unless you delete a previous job with the ``--clean`` option), so
-this option makes it easier to quickly assign a new ID with each run. It also makes it easy to clean a previous job.
+You can optionally override the ``job_identifier`` from the command line
+(``buildstock_gcp project.yml [job_identifier]``). Note that each job you run must have a unique ID
+(unless you delete a previous job with the ``--clean`` option), so this option makes it easier to
+quickly assign a new ID with each run without updating the config file.
 
 
 List existing jobs
 ..................
 
-Run ``buildstock_gcp --show_jobs your_project_file.yml`` to see the existing
+Run ``buildstock_gcp your_project_file.yml [job_identifier] --show_jobs`` to see the existing
 jobs matching the project specified. This can show you whether a previously-started job
 has completed, is still running, or has already been cleaned up.
 
 
 Cleaning up after yourself
 ..........................
 
-TODO: Review and update this after implementing cleanup option.
-
-When the simulation and postprocessing is all complete, run ``buildstock_gcp
---clean your_project_file.yml [job_identifier]``. This will clean up all the GCP resources that
-were created to run the specified project. If the project is still running, it
-will be cancelled. Your output files will still be available in GCS.
+When the simulations and postprocessing are complete, run ``buildstock_gcp
+your_project_file.yml [job_identifier] --clean``. This will clean up all the GCP resources that
+were created to run the specified project, other than files in Cloud Storage. If the project is
+still running, it will be cancelled. Your output files will still be available in GCS.