Add final edits (#250)

* get rid of 2 files * Add first day of work * Add final edits
akgold · May 3, 2024 · 2a3f84a · 2a3f84a
1 parent 21a6fe4
commit 2a3f84a
Show file tree

Hide file tree

Showing 17 changed files with 84 additions and 89 deletions.
diff --git a/.gitignore b/.gitignore
@@ -20,3 +20,4 @@ lek_versions/*
 *docx*
 *.tex
 chapters.zip
+* 2*
diff --git a/chapters/intro.qmd b/chapters/intro.qmd
@@ -28,7 +28,7 @@ the background, and there's no \index{Google Sheets}Google Sheet,
 \index{csv}CSV file, or half-baked database query in sight.
 
 But that's a myth. If you're a data scientist putting your work in front
-of someone else's eyes, you are *in production*. And, I believe, if
+of someone else's eyes, you are in production. And, I believe, if
 you're in production, this book is for you.
 
 You may sensibly ask who I am to make such a proclamation.
@@ -45,7 +45,7 @@ science products more robust with open-source tooling and
 \index{Posit}Posit's Professional Products.
 
 I've seen organizations at every level of data science maturity. For
-some organizations, "in production" means a report that gets rendered
+some organizations, in production means a report that gets rendered
 and emailed around. For others, it means hosting a live app or dashboard
 that people visit. For the most sophisticated, it means serving live
 predictions to another service from a machine learning model via an
@@ -297,8 +297,8 @@ pretend we care deeply about the relationship between penguin bill
 length and mass, and we're going to build up an entire data science
 environment dedicated to exploring that relationship.
 
-The front end of this environment will be a website built in
-\index{Quarto}Quarto. It will include an app for fetching penguin mass
+The front end of this environment will be a website built with the
+\index{Quarto}Quarto publishing system. It will include an app for fetching penguin mass
 predictions from a machine learning model based on bill length and other
 features. The website will also have pages dedicated to exploratory data
 analysis and model building.
@@ -310,7 +310,7 @@ working. It will also host the machine learning model as an
 \index{API}API and the \index{Shiny}Shiny app for the website.
 
 The whole thing will get auto-deployed from a \index{Git}Git repo using
-\index{GitHub Actions}GitHub ActionsGitHub Actions.
+\index{GitHub Actions}GitHub Actions.
 
 From an architectural perspective, it'll look something like this:
 

diff --git a/chapters/sec1/1-0-sec-intro.qmd b/chapters/sec1/1-0-sec-intro.qmd
@@ -58,7 +58,7 @@ experience.
 
 But, there are best practices you can follow to make it easier to
 deliver value once you've discovered something interesting. In the
-chapters in this part, we'll explore what data science and data
+chapters in this part of the book, we'll explore what data science and data
 scientists can learn from DevOps to make your apps and environments as
 robust as possible.
 
@@ -111,16 +111,15 @@ promotion system.
 
 ### Docker for Data Science
 
-\index{Docker}Docker is an increasingly popular tool in the software
-development and data science world that allows for the easy capture and
-sharing of the environment around code. \index{Docker}Docker is
+\index{Docker}Docker is a software development tool that makes it easy to capture and
+share the environment around code. It is
 increasingly popular in data science contexts, so [Chapter @sec-docker]
 is a basic introduction to what \index{Docker}Docker is and how to use
 it.
 
 ## Labs in this part
 
-Each chapter in this part has a lab so you can get hands-on experience
+Each chapter in this part of the book has a lab so you can get hands-on experience
 implementing DevOps best practices in your data science projects.
 
 You'll create a website in the labs to explore the Palmer Penguins

diff --git a/chapters/sec1/1-1-env-as-code.qmd b/chapters/sec1/1-1-env-as-code.qmd
@@ -102,7 +102,7 @@ As a data scientist, you can and should be responsible for the package
 layer, and getting this layer right is where the biggest reproducibility
 bang for your buck lies. If you find yourself managing the system or
 hardware layer, [@sec-cloud] through [@sec-ssl] will teach you how to
-manage those layers.
+do that.
 
 ## The package layer
 
@@ -136,7 +136,7 @@ come back, it's likely that future you or your colleague won't have the
 correct versions and your code will break.
 
 What would've been better is if you'd had an environment as code
-strategy that created a portable environment for each **project** on
+strategy that created a portable environment for **each project** on
 your system.
 
 A successful package environment as code setup has two key attributes:
@@ -228,8 +228,7 @@ You'll still have those same packages to use.
 
 ### Step 3: Collaborate or deploy
 
-When you go to share your project, you don't want to share your actual
-package libraries. Package installs are specific to the
+When you share your project, you want to share only the lockfile, not the actual package libraries. Package installs are specific to the
 \index{operating system}operating system and the language version you're
 using, so you want your target system to install the package
 specifically for that system.
@@ -292,7 +291,7 @@ Sometimes, IT/Admins want to save space further by sharing package
 caches across users. This is usually a mistake. Sharing package caches
 leads to headaches over user file permissions to write to the package
 cache versus read. Storage space is cheap, way cheaper than your time.
-If you have to do it, both \index{renv}`{renv}` and \index{venv}`venv`
+If you have to do it, both \index{renv}`{renv}` and \index{venv}`{venv}`
 include settings to allow you to relocate the package cache to a shared
 location on the server.
 
@@ -310,7 +309,7 @@ location on the server.
     one?
 4.  Draw a mental map of the relationships between the following:
     package repository, package library, package, project-level-library,
-    `.libPaths()` (R) or `sys.path`(Python), lockfile.
+    `.libPaths()` (R) or `sys.path` (Python), lockfile.
 5.  Why is it a bad idea to share package libraries? What's the best way
     to collaborate with a colleague using an environment as code? What
     commands will you run in R or Python to save a package environment

diff --git a/chapters/sec1/1-2-proj-arch.qmd b/chapters/sec1/1-2-proj-arch.qmd
@@ -85,9 +85,9 @@ Basically, all data science projects fall into the following categories:
 
 3.  *A report.* Reports are code you're turning into an output you care
     about -- like a paper, book, presentation, or website. Reports
-    result from rendering an \index{R Markdown} doc,
+    result from rendering an \index{R Markdown}R Markdown doc,
     \index{Quarto}Quarto doc, or \index{Jupyter
-    Notebook} for people to consume on their computer, in print, or in a
+    Notebook}Jupyter Notebook for people to consume on their computer, in print, or in a
     presentation. These docs may be completely static (this book is a
     \index{Quarto}Quarto doc) or have some interactive
     elements.[^1-2-proj-arch-2]
@@ -275,7 +275,7 @@ to pull only once the user interactions clarify what you need.
 
 It may be adequate to work on only a sample of the data for many tasks,
 especially machine learning ones. In some cases, like classification of
-highly imbalanced classes, it may be *better* to work on a sample rather
+highly imbalanced classes, it may be **better** to work on a sample rather
 than the whole dataset.
 
 Sampling tends to work well when you're trying to compute statistical
@@ -375,7 +375,7 @@ a Google Sheet as a permanent home for data, but it can be a good
 intermediate step while you're still figuring out the right solution for
 your pipeline.
 
-The primary weakness of a \index{Google Sheets} -- that it's editable by
+The primary weakness of a \index{Google Sheets}Google Sheet -- that it's editable by
 someone who logs in -- can also be an asset if that's something you
 need.
 

diff --git a/chapters/sec1/1-3-data-access.qmd b/chapters/sec1/1-3-data-access.qmd
@@ -329,18 +329,18 @@ Below are common \index{API}API patterns that are good to know about:
     verbs, such as a `GET` and a `POST` at the same endpoint, for
     getting and setting the data that the endpoint stores.
 
-## \index{environment variable}Environment variable to secure data connections {#env-vars}
+## \index{environment variable}Environment Variables to Secure Data Connections {#env-vars}
 
 When you take an app to production, authenticating to your data source
 while keeping your secrets secure is crucial.
 
 The most important thing you can do to secure your credentials is to
-avoid ever putting credentials in your code. **Your username and
-password or** \index{API}API key should never appear in your code.
+avoid ever putting credentials in your code. Your username and
+password or \index{API}API key **should never appear in your code**.
 
 The simplest way to provide credentials without the values appearing in
 your code is with an \index{environment variable}environment variable.
-\index{environment variable}Environment variable are set before your
+\index{environment variable}Environment variables are set before your
 code starts -- sometimes from completely outside Python or R.
 
 ::: callout-note
@@ -600,7 +600,7 @@ button.
 ::: callout-tip
 I recommend setting an `api_url` value at the top of your app.
 
-By default, be $\text{"http://127.0.0.1:8080/predict"}$. If you've
+By default, it will be $\text{"http://127.0.0.1:8080/predict"}$. If you've
 changed the port from $8080$ or used a different name for your
 prediction endpoint, you should adjust accordingly.
 :::

diff --git a/chapters/sec1/1-4-monitor-log.qmd b/chapters/sec1/1-4-monitor-log.qmd
@@ -24,7 +24,7 @@ that reveal what's happening inside your project and aggregating and
 consuming them. As a data scientist, you need to take on the task of
 emitting helpful logs and metrics for your code. In most cases, you'll
 integrate with tooling that your organization already has for log
-aggregation.
+aggregation and monitoring.
 
 In this chapter we'll get into how to make your code observable. You'll
 learn how to use tooling in R and Python to see what's happening inside
@@ -57,7 +57,7 @@ observe.
 
 Moreover, you're already probably familiar with tools for *literate
 programming* like \index{Jupyter Notebook}Jupyter Notebooks,
-\index{R Markdown} Documents, and \index{Quarto}Quarto Documents.
+\index{R Markdown}R Markdown Documents, and \index{Quarto}Quarto Documents.
 
 One of my spicier opinions is that *all* jobs should be in a literate
 programming format. When used well, these tools intersperse code,
@@ -99,7 +99,7 @@ purpose-built tooling for logging allows you to apply consistent formats
 within logs, emit logs in useful formats, and provide visibility into
 the severity of issues.
 
-There are great logging packages in both Python and R. Python's logging
+There are great logging packages in both Python and R. Python's `{logging}`
 package is standard and included. There is no standard logging package
 in R, but I recommend `{log4r}`.
 
@@ -242,7 +242,7 @@ with `{log4r}`.[^1-4-monitor-log-1]
     that's current and have the older ones numbered. So today's log
     would be `my-log.log`, yesterday's would be `my-log.log.1`, the day
     before `my-log.log.2`, etc. This second pattern works particularly
-    well if you're using `logrotate` with `log4r`, because then `log4r`
+    well if you're using `logrotate` with `{log4r}`, because then `{log4r}`
     doesn't need to know anything about the log rotation. It's just
     always writing to `my-log.log`.
 
@@ -300,15 +300,15 @@ official Prometheus client in Python and the `{openmetrics}` package in
 R makes registering metrics from a Plumber \index{API}API or
 \index{Shiny}Shiny app easy.
 
-There's a great Get Started with Grafana and Prometheus doc on the
+There's a great [*Get Started with Grafana and Prometheus*](https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/) doc on the
 Grafana Labs website if you want to try it out.
 
 ## Comprehension questions
 
 1.  What is the difference between monitoring and logging? What are the
     two halves of the monitoring and logging process?
 2.  Logging is generally good, but what are some things you should be
-    careful *not to log*?
+    careful not to log?
 3.  At what level would you log each of the following events:
     1.  Someone clicks on a particular tab in your \index{Shiny}Shiny
         app.

diff --git a/chapters/sec1/1-5-deployments.qmd b/chapters/sec1/1-5-deployments.qmd
@@ -120,7 +120,7 @@ One way to help alleviate concerns about using real data is to create a
 
 Working with your IT/Admin team to get these things isn't always easy.
 They might not want to give you real data in dev. One point to emphasize
-is that creating this environment makes things more secure. It gives you
+is that creating this environment makes things **more secure**. It gives you
 a place to do development without fear that you might damage production
 data or services.
 
@@ -133,7 +133,7 @@ control* is the tool to make your code promotion process real.
 
 Version control is software that allows you to keep the prod version of
 your code safe, gives contributors a copy to work on, and hosts tools to
-manage merging changes back together. These days, \index{Git}Git is the
+manage merging changes back together. These days, \index{Git}*Git* is the
 industry standard for version control.
 
 \index{Git}Git is an open-source system for tracking changes to computer
@@ -152,7 +152,7 @@ from this book and learn about \index{Git}Git.
 
 People who say learning \index{Git}Git is easy are either lying or have
 forgotten. I am sorry our industry is standardized on a tool with such
-terrible ergonomics. It's worth your time to learn.
+terrible ergonomics. It is, unfortunately, worth your time to learn.
 
 Whether you're an R or Python user, I'd recommend starting with a
 resource designed to teach \index{Git}Git to a data science user. My
@@ -303,7 +303,7 @@ pipelines built into \index{Git}Git providers are very popular.
 While there are a number of CI/CD pipeline tools, including Jenkins, Travis, Azure DevOps,
 and GitLab, \index{GitHub Actions}GitHub Actions immediately rocketed to number one when it was released a few years ago. At this point, many organizations are quickly moving their CI/CD into GitHub Actions if they haven't already done so.
 
-\index{GitHub Actions}GitHub Actions are defined `.yml` files that go in the
+\index{GitHub Actions}GitHub Actions are defined in `.yml` files that go in the
 `.github/workflows` directory of a project. \index{GitHub}GitHub knows
 to inspect that directory and kick off any prescribed actions when there are changes to the repo. Let's talk about some of the basics of understanding and using \index{GitHub Actions}GitHub Actions.
 
@@ -339,7 +339,7 @@ Windows, and MacOS. You can also add custom runners. Depending on the
 level of reproducibility you're aiming for, you might want to lock the
 runner to a particular version of the operating system rather than just running `latest`.
 
-Once the job is kicked off and the runner live, it's time to actually do something. Because the default runners are all basically bare operating systems, the action needs to include steps to build the environment before you can actually run any code. Depending on what you're doing, that will mean installing OS dependencies, installing Python and/or R, and installing R and Python packages for whatever content you're running.
+Once the job is kicked off and the runner is live, it's time to actually do something. Because the default runners are all basically bare operating systems, the action needs to include steps to build the environment before you can actually run any code. Depending on what you're doing, that will mean installing OS dependencies, installing Python and/or R, and installing R and Python packages for whatever content you're running.
 
 In \index{GitHub Actions}GitHub Actions, the `jobs` section defines the set of `steps` that comprise the action. Most steps use the `uses` command to run an action that someone else wrote. Some actions accept variables with the `with` command. In order to ensure that your Actions can remain flexible and your secrets secret, \index{GitHub Actions}GitHub Actions allows you to pull a value from the \index{GitHub}GitHub GUI and use it in a step with the `${{ <variable > }}` syntax.
 

diff --git a/chapters/sec1/1-6-docker.qmd b/chapters/sec1/1-6-docker.qmd
@@ -98,9 +98,9 @@ Some organizations run private registries, usually using *registry as a
 service* offerings from cloud providers.[^1-6-docker-2]
 
 [^1-6-docker-2]: The big three \index{container}container registries are
-    AWS Elastic \index{container}container Registry (ECR), Azure
-    \index{container}container Registry, and Google
-    \index{container}container Registry.
+    AWS Elastic \index{container}Container Registry (ECR), Azure
+    \index{container}Container Registry, and Google
+    \index{container}Container Registry.
 
 Images are built from *Dockerfiles* -- the code that defines the image.
 Dockerfiles are usually stored in a \index{Git}Git repository. Building
@@ -368,7 +368,7 @@ docker run --rm -d \
 
 1.  This line is necessary because the model lives at `/data/model` on our **host** machine.
 But the \index{API}API inside the \index{container}container is looking
-for `/data/model` **inside the** \index{container}container. We need to make sure that the directory exists and has the model in it.
+for `/data/model` **inside the container**. We need to make sure that the directory exists and has the model in it.
 
 ### Lab Extensions