Skip to content

Commit

Permalink
Add final edits (#250)
Browse files Browse the repository at this point in the history
* get rid of 2 files

* Add first day of work

* Add final edits
  • Loading branch information
akgold authored May 3, 2024
1 parent 21a6fe4 commit 2a3f84a
Show file tree
Hide file tree
Showing 17 changed files with 84 additions and 89 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ lek_versions/*
*docx*
*.tex
chapters.zip
* 2*
10 changes: 5 additions & 5 deletions chapters/intro.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ the background, and there's no \index{Google Sheets}Google Sheet,
\index{csv}CSV file, or half-baked database query in sight.

But that's a myth. If you're a data scientist putting your work in front
of someone else's eyes, you are *in production*. And, I believe, if
of someone else's eyes, you are in production. And, I believe, if
you're in production, this book is for you.

You may sensibly ask who I am to make such a proclamation.
Expand All @@ -45,7 +45,7 @@ science products more robust with open-source tooling and
\index{Posit}Posit's Professional Products.

I've seen organizations at every level of data science maturity. For
some organizations, "in production" means a report that gets rendered
some organizations, in production means a report that gets rendered
and emailed around. For others, it means hosting a live app or dashboard
that people visit. For the most sophisticated, it means serving live
predictions to another service from a machine learning model via an
Expand Down Expand Up @@ -297,8 +297,8 @@ pretend we care deeply about the relationship between penguin bill
length and mass, and we're going to build up an entire data science
environment dedicated to exploring that relationship.

The front end of this environment will be a website built in
\index{Quarto}Quarto. It will include an app for fetching penguin mass
The front end of this environment will be a website built with the
\index{Quarto}Quarto publishing system. It will include an app for fetching penguin mass
predictions from a machine learning model based on bill length and other
features. The website will also have pages dedicated to exploratory data
analysis and model building.
Expand All @@ -310,7 +310,7 @@ working. It will also host the machine learning model as an
\index{API}API and the \index{Shiny}Shiny app for the website.

The whole thing will get auto-deployed from a \index{Git}Git repo using
\index{GitHub Actions}GitHub ActionsGitHub Actions.
\index{GitHub Actions}GitHub Actions.

From an architectural perspective, it'll look something like this:

Expand Down
9 changes: 4 additions & 5 deletions chapters/sec1/1-0-sec-intro.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ experience.

But, there are best practices you can follow to make it easier to
deliver value once you've discovered something interesting. In the
chapters in this part, we'll explore what data science and data
chapters in this part of the book, we'll explore what data science and data
scientists can learn from DevOps to make your apps and environments as
robust as possible.

Expand Down Expand Up @@ -111,16 +111,15 @@ promotion system.

### Docker for Data Science

\index{Docker}Docker is an increasingly popular tool in the software
development and data science world that allows for the easy capture and
sharing of the environment around code. \index{Docker}Docker is
\index{Docker}Docker is a software development tool that makes it easy to capture and
share the environment around code. It is
increasingly popular in data science contexts, so [Chapter @sec-docker]
is a basic introduction to what \index{Docker}Docker is and how to use
it.

## Labs in this part

Each chapter in this part has a lab so you can get hands-on experience
Each chapter in this part of the book has a lab so you can get hands-on experience
implementing DevOps best practices in your data science projects.

You'll create a website in the labs to explore the Palmer Penguins
Expand Down
11 changes: 5 additions & 6 deletions chapters/sec1/1-1-env-as-code.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ As a data scientist, you can and should be responsible for the package
layer, and getting this layer right is where the biggest reproducibility
bang for your buck lies. If you find yourself managing the system or
hardware layer, [@sec-cloud] through [@sec-ssl] will teach you how to
manage those layers.
do that.

## The package layer

Expand Down Expand Up @@ -136,7 +136,7 @@ come back, it's likely that future you or your colleague won't have the
correct versions and your code will break.

What would've been better is if you'd had an environment as code
strategy that created a portable environment for each **project** on
strategy that created a portable environment for **each project** on
your system.

A successful package environment as code setup has two key attributes:
Expand Down Expand Up @@ -228,8 +228,7 @@ You'll still have those same packages to use.

### Step 3: Collaborate or deploy

When you go to share your project, you don't want to share your actual
package libraries. Package installs are specific to the
When you share your project, you want to share only the lockfile, not the actual package libraries. Package installs are specific to the
\index{operating system}operating system and the language version you're
using, so you want your target system to install the package
specifically for that system.
Expand Down Expand Up @@ -292,7 +291,7 @@ Sometimes, IT/Admins want to save space further by sharing package
caches across users. This is usually a mistake. Sharing package caches
leads to headaches over user file permissions to write to the package
cache versus read. Storage space is cheap, way cheaper than your time.
If you have to do it, both \index{renv}`{renv}` and \index{venv}`venv`
If you have to do it, both \index{renv}`{renv}` and \index{venv}`{venv}`
include settings to allow you to relocate the package cache to a shared
location on the server.

Expand All @@ -310,7 +309,7 @@ location on the server.
one?
4. Draw a mental map of the relationships between the following:
package repository, package library, package, project-level-library,
`.libPaths()` (R) or `sys.path`(Python), lockfile.
`.libPaths()` (R) or `sys.path` (Python), lockfile.
5. Why is it a bad idea to share package libraries? What's the best way
to collaborate with a colleague using an environment as code? What
commands will you run in R or Python to save a package environment
Expand Down
8 changes: 4 additions & 4 deletions chapters/sec1/1-2-proj-arch.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,9 @@ Basically, all data science projects fall into the following categories:

3. *A report.* Reports are code you're turning into an output you care
about -- like a paper, book, presentation, or website. Reports
result from rendering an \index{R Markdown} doc,
result from rendering an \index{R Markdown}R Markdown doc,
\index{Quarto}Quarto doc, or \index{Jupyter
Notebook} for people to consume on their computer, in print, or in a
Notebook}Jupyter Notebook for people to consume on their computer, in print, or in a
presentation. These docs may be completely static (this book is a
\index{Quarto}Quarto doc) or have some interactive
elements.[^1-2-proj-arch-2]
Expand Down Expand Up @@ -275,7 +275,7 @@ to pull only once the user interactions clarify what you need.

It may be adequate to work on only a sample of the data for many tasks,
especially machine learning ones. In some cases, like classification of
highly imbalanced classes, it may be *better* to work on a sample rather
highly imbalanced classes, it may be **better** to work on a sample rather
than the whole dataset.

Sampling tends to work well when you're trying to compute statistical
Expand Down Expand Up @@ -375,7 +375,7 @@ a Google Sheet as a permanent home for data, but it can be a good
intermediate step while you're still figuring out the right solution for
your pipeline.

The primary weakness of a \index{Google Sheets} -- that it's editable by
The primary weakness of a \index{Google Sheets}Google Sheet -- that it's editable by
someone who logs in -- can also be an asset if that's something you
need.

Expand Down
10 changes: 5 additions & 5 deletions chapters/sec1/1-3-data-access.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -329,18 +329,18 @@ Below are common \index{API}API patterns that are good to know about:
verbs, such as a `GET` and a `POST` at the same endpoint, for
getting and setting the data that the endpoint stores.

## \index{environment variable}Environment variable to secure data connections {#env-vars}
## \index{environment variable}Environment Variables to Secure Data Connections {#env-vars}

When you take an app to production, authenticating to your data source
while keeping your secrets secure is crucial.

The most important thing you can do to secure your credentials is to
avoid ever putting credentials in your code. **Your username and
password or** \index{API}API key should never appear in your code.
avoid ever putting credentials in your code. Your username and
password or \index{API}API key **should never appear in your code**.

The simplest way to provide credentials without the values appearing in
your code is with an \index{environment variable}environment variable.
\index{environment variable}Environment variable are set before your
\index{environment variable}Environment variables are set before your
code starts -- sometimes from completely outside Python or R.

::: callout-note
Expand Down Expand Up @@ -600,7 +600,7 @@ button.
::: callout-tip
I recommend setting an `api_url` value at the top of your app.

By default, be $\text{"http://127.0.0.1:8080/predict"}$. If you've
By default, it will be $\text{"http://127.0.0.1:8080/predict"}$. If you've
changed the port from $8080$ or used a different name for your
prediction endpoint, you should adjust accordingly.
:::
Expand Down
12 changes: 6 additions & 6 deletions chapters/sec1/1-4-monitor-log.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ that reveal what's happening inside your project and aggregating and
consuming them. As a data scientist, you need to take on the task of
emitting helpful logs and metrics for your code. In most cases, you'll
integrate with tooling that your organization already has for log
aggregation.
aggregation and monitoring.

In this chapter we'll get into how to make your code observable. You'll
learn how to use tooling in R and Python to see what's happening inside
Expand Down Expand Up @@ -57,7 +57,7 @@ observe.

Moreover, you're already probably familiar with tools for *literate
programming* like \index{Jupyter Notebook}Jupyter Notebooks,
\index{R Markdown} Documents, and \index{Quarto}Quarto Documents.
\index{R Markdown}R Markdown Documents, and \index{Quarto}Quarto Documents.

One of my spicier opinions is that *all* jobs should be in a literate
programming format. When used well, these tools intersperse code,
Expand Down Expand Up @@ -99,7 +99,7 @@ purpose-built tooling for logging allows you to apply consistent formats
within logs, emit logs in useful formats, and provide visibility into
the severity of issues.

There are great logging packages in both Python and R. Python's logging
There are great logging packages in both Python and R. Python's `{logging}`
package is standard and included. There is no standard logging package
in R, but I recommend `{log4r}`.

Expand Down Expand Up @@ -242,7 +242,7 @@ with `{log4r}`.[^1-4-monitor-log-1]
that's current and have the older ones numbered. So today's log
would be `my-log.log`, yesterday's would be `my-log.log.1`, the day
before `my-log.log.2`, etc. This second pattern works particularly
well if you're using `logrotate` with `log4r`, because then `log4r`
well if you're using `logrotate` with `{log4r}`, because then `{log4r}`
doesn't need to know anything about the log rotation. It's just
always writing to `my-log.log`.

Expand Down Expand Up @@ -300,15 +300,15 @@ official Prometheus client in Python and the `{openmetrics}` package in
R makes registering metrics from a Plumber \index{API}API or
\index{Shiny}Shiny app easy.

There's a great Get Started with Grafana and Prometheus doc on the
There's a great [*Get Started with Grafana and Prometheus*](https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/) doc on the
Grafana Labs website if you want to try it out.

## Comprehension questions

1. What is the difference between monitoring and logging? What are the
two halves of the monitoring and logging process?
2. Logging is generally good, but what are some things you should be
careful *not to log*?
careful not to log?
3. At what level would you log each of the following events:
1. Someone clicks on a particular tab in your \index{Shiny}Shiny
app.
Expand Down
10 changes: 5 additions & 5 deletions chapters/sec1/1-5-deployments.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ One way to help alleviate concerns about using real data is to create a

Working with your IT/Admin team to get these things isn't always easy.
They might not want to give you real data in dev. One point to emphasize
is that creating this environment makes things more secure. It gives you
is that creating this environment makes things **more secure**. It gives you
a place to do development without fear that you might damage production
data or services.

Expand All @@ -133,7 +133,7 @@ control* is the tool to make your code promotion process real.

Version control is software that allows you to keep the prod version of
your code safe, gives contributors a copy to work on, and hosts tools to
manage merging changes back together. These days, \index{Git}Git is the
manage merging changes back together. These days, \index{Git}*Git* is the
industry standard for version control.

\index{Git}Git is an open-source system for tracking changes to computer
Expand All @@ -152,7 +152,7 @@ from this book and learn about \index{Git}Git.

People who say learning \index{Git}Git is easy are either lying or have
forgotten. I am sorry our industry is standardized on a tool with such
terrible ergonomics. It's worth your time to learn.
terrible ergonomics. It is, unfortunately, worth your time to learn.

Whether you're an R or Python user, I'd recommend starting with a
resource designed to teach \index{Git}Git to a data science user. My
Expand Down Expand Up @@ -303,7 +303,7 @@ pipelines built into \index{Git}Git providers are very popular.
While there are a number of CI/CD pipeline tools, including Jenkins, Travis, Azure DevOps,
and GitLab, \index{GitHub Actions}GitHub Actions immediately rocketed to number one when it was released a few years ago. At this point, many organizations are quickly moving their CI/CD into GitHub Actions if they haven't already done so.

\index{GitHub Actions}GitHub Actions are defined `.yml` files that go in the
\index{GitHub Actions}GitHub Actions are defined in `.yml` files that go in the
`.github/workflows` directory of a project. \index{GitHub}GitHub knows
to inspect that directory and kick off any prescribed actions when there are changes to the repo. Let's talk about some of the basics of understanding and using \index{GitHub Actions}GitHub Actions.

Expand Down Expand Up @@ -339,7 +339,7 @@ Windows, and MacOS. You can also add custom runners. Depending on the
level of reproducibility you're aiming for, you might want to lock the
runner to a particular version of the operating system rather than just running `latest`.

Once the job is kicked off and the runner live, it's time to actually do something. Because the default runners are all basically bare operating systems, the action needs to include steps to build the environment before you can actually run any code. Depending on what you're doing, that will mean installing OS dependencies, installing Python and/or R, and installing R and Python packages for whatever content you're running.
Once the job is kicked off and the runner is live, it's time to actually do something. Because the default runners are all basically bare operating systems, the action needs to include steps to build the environment before you can actually run any code. Depending on what you're doing, that will mean installing OS dependencies, installing Python and/or R, and installing R and Python packages for whatever content you're running.

In \index{GitHub Actions}GitHub Actions, the `jobs` section defines the set of `steps` that comprise the action. Most steps use the `uses` command to run an action that someone else wrote. Some actions accept variables with the `with` command. In order to ensure that your Actions can remain flexible and your secrets secret, \index{GitHub Actions}GitHub Actions allows you to pull a value from the \index{GitHub}GitHub GUI and use it in a step with the `${{ <variable > }}` syntax.

Expand Down
8 changes: 4 additions & 4 deletions chapters/sec1/1-6-docker.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -98,9 +98,9 @@ Some organizations run private registries, usually using *registry as a
service* offerings from cloud providers.[^1-6-docker-2]

[^1-6-docker-2]: The big three \index{container}container registries are
AWS Elastic \index{container}container Registry (ECR), Azure
\index{container}container Registry, and Google
\index{container}container Registry.
AWS Elastic \index{container}Container Registry (ECR), Azure
\index{container}Container Registry, and Google
\index{container}Container Registry.

Images are built from *Dockerfiles* -- the code that defines the image.
Dockerfiles are usually stored in a \index{Git}Git repository. Building
Expand Down Expand Up @@ -368,7 +368,7 @@ docker run --rm -d \

1. This line is necessary because the model lives at `/data/model` on our **host** machine.
But the \index{API}API inside the \index{container}container is looking
for `/data/model` **inside the** \index{container}container. We need to make sure that the directory exists and has the model in it.
for `/data/model` **inside the container**. We need to make sure that the directory exists and has the model in it.

### Lab Extensions

Expand Down
Loading

0 comments on commit 2a3f84a

Please sign in to comment.