diff --git a/.gitignore b/.gitignore index b10c9faa..30940f76 100644 --- a/.gitignore +++ b/.gitignore @@ -20,3 +20,4 @@ lek_versions/* *docx* *.tex chapters.zip +* 2* diff --git a/chapters/intro.qmd b/chapters/intro.qmd index cc34724a..10bac39b 100644 --- a/chapters/intro.qmd +++ b/chapters/intro.qmd @@ -28,7 +28,7 @@ the background, and there's no \index{Google Sheets}Google Sheet, \index{csv}CSV file, or half-baked database query in sight. But that's a myth. If you're a data scientist putting your work in front -of someone else's eyes, you are *in production*. And, I believe, if +of someone else's eyes, you are in production. And, I believe, if you're in production, this book is for you. You may sensibly ask who I am to make such a proclamation. @@ -45,7 +45,7 @@ science products more robust with open-source tooling and \index{Posit}Posit's Professional Products. I've seen organizations at every level of data science maturity. For -some organizations, "in production" means a report that gets rendered +some organizations, in production means a report that gets rendered and emailed around. For others, it means hosting a live app or dashboard that people visit. For the most sophisticated, it means serving live predictions to another service from a machine learning model via an @@ -297,8 +297,8 @@ pretend we care deeply about the relationship between penguin bill length and mass, and we're going to build up an entire data science environment dedicated to exploring that relationship. -The front end of this environment will be a website built in -\index{Quarto}Quarto. It will include an app for fetching penguin mass +The front end of this environment will be a website built with the +\index{Quarto}Quarto publishing system. It will include an app for fetching penguin mass predictions from a machine learning model based on bill length and other features. The website will also have pages dedicated to exploratory data analysis and model building. @@ -310,7 +310,7 @@ working. It will also host the machine learning model as an \index{API}API and the \index{Shiny}Shiny app for the website. The whole thing will get auto-deployed from a \index{Git}Git repo using -\index{GitHub Actions}GitHub ActionsGitHub Actions. +\index{GitHub Actions}GitHub Actions. From an architectural perspective, it'll look something like this: diff --git a/chapters/sec1/1-0-sec-intro.qmd b/chapters/sec1/1-0-sec-intro.qmd index daeb8e73..c8c5fc4c 100644 --- a/chapters/sec1/1-0-sec-intro.qmd +++ b/chapters/sec1/1-0-sec-intro.qmd @@ -58,7 +58,7 @@ experience. But, there are best practices you can follow to make it easier to deliver value once you've discovered something interesting. In the -chapters in this part, we'll explore what data science and data +chapters in this part of the book, we'll explore what data science and data scientists can learn from DevOps to make your apps and environments as robust as possible. @@ -111,16 +111,15 @@ promotion system. ### Docker for Data Science -\index{Docker}Docker is an increasingly popular tool in the software -development and data science world that allows for the easy capture and -sharing of the environment around code. \index{Docker}Docker is +\index{Docker}Docker is a software development tool that makes it easy to capture and +share the environment around code. It is increasingly popular in data science contexts, so [Chapter @sec-docker] is a basic introduction to what \index{Docker}Docker is and how to use it. ## Labs in this part -Each chapter in this part has a lab so you can get hands-on experience +Each chapter in this part of the book has a lab so you can get hands-on experience implementing DevOps best practices in your data science projects. You'll create a website in the labs to explore the Palmer Penguins diff --git a/chapters/sec1/1-1-env-as-code.qmd b/chapters/sec1/1-1-env-as-code.qmd index 6058ba3f..e7b2af18 100644 --- a/chapters/sec1/1-1-env-as-code.qmd +++ b/chapters/sec1/1-1-env-as-code.qmd @@ -102,7 +102,7 @@ As a data scientist, you can and should be responsible for the package layer, and getting this layer right is where the biggest reproducibility bang for your buck lies. If you find yourself managing the system or hardware layer, [@sec-cloud] through [@sec-ssl] will teach you how to -manage those layers. +do that. ## The package layer @@ -136,7 +136,7 @@ come back, it's likely that future you or your colleague won't have the correct versions and your code will break. What would've been better is if you'd had an environment as code -strategy that created a portable environment for each **project** on +strategy that created a portable environment for **each project** on your system. A successful package environment as code setup has two key attributes: @@ -228,8 +228,7 @@ You'll still have those same packages to use. ### Step 3: Collaborate or deploy -When you go to share your project, you don't want to share your actual -package libraries. Package installs are specific to the +When you share your project, you want to share only the lockfile, not the actual package libraries. Package installs are specific to the \index{operating system}operating system and the language version you're using, so you want your target system to install the package specifically for that system. @@ -292,7 +291,7 @@ Sometimes, IT/Admins want to save space further by sharing package caches across users. This is usually a mistake. Sharing package caches leads to headaches over user file permissions to write to the package cache versus read. Storage space is cheap, way cheaper than your time. -If you have to do it, both \index{renv}`{renv}` and \index{venv}`venv` +If you have to do it, both \index{renv}`{renv}` and \index{venv}`{venv}` include settings to allow you to relocate the package cache to a shared location on the server. @@ -310,7 +309,7 @@ location on the server. one? 4. Draw a mental map of the relationships between the following: package repository, package library, package, project-level-library, - `.libPaths()` (R) or `sys.path`(Python), lockfile. + `.libPaths()` (R) or `sys.path` (Python), lockfile. 5. Why is it a bad idea to share package libraries? What's the best way to collaborate with a colleague using an environment as code? What commands will you run in R or Python to save a package environment diff --git a/chapters/sec1/1-2-proj-arch.qmd b/chapters/sec1/1-2-proj-arch.qmd index e7c942e9..8f77a04d 100644 --- a/chapters/sec1/1-2-proj-arch.qmd +++ b/chapters/sec1/1-2-proj-arch.qmd @@ -85,9 +85,9 @@ Basically, all data science projects fall into the following categories: 3. *A report.* Reports are code you're turning into an output you care about -- like a paper, book, presentation, or website. Reports - result from rendering an \index{R Markdown} doc, + result from rendering an \index{R Markdown}R Markdown doc, \index{Quarto}Quarto doc, or \index{Jupyter - Notebook} for people to consume on their computer, in print, or in a + Notebook}Jupyter Notebook for people to consume on their computer, in print, or in a presentation. These docs may be completely static (this book is a \index{Quarto}Quarto doc) or have some interactive elements.[^1-2-proj-arch-2] @@ -275,7 +275,7 @@ to pull only once the user interactions clarify what you need. It may be adequate to work on only a sample of the data for many tasks, especially machine learning ones. In some cases, like classification of -highly imbalanced classes, it may be *better* to work on a sample rather +highly imbalanced classes, it may be **better** to work on a sample rather than the whole dataset. Sampling tends to work well when you're trying to compute statistical @@ -375,7 +375,7 @@ a Google Sheet as a permanent home for data, but it can be a good intermediate step while you're still figuring out the right solution for your pipeline. -The primary weakness of a \index{Google Sheets} -- that it's editable by +The primary weakness of a \index{Google Sheets}Google Sheet -- that it's editable by someone who logs in -- can also be an asset if that's something you need. diff --git a/chapters/sec1/1-3-data-access.qmd b/chapters/sec1/1-3-data-access.qmd index e1d59a4c..5c4ed9de 100644 --- a/chapters/sec1/1-3-data-access.qmd +++ b/chapters/sec1/1-3-data-access.qmd @@ -329,18 +329,18 @@ Below are common \index{API}API patterns that are good to know about: verbs, such as a `GET` and a `POST` at the same endpoint, for getting and setting the data that the endpoint stores. -## \index{environment variable}Environment variable to secure data connections {#env-vars} +## \index{environment variable}Environment Variables to Secure Data Connections {#env-vars} When you take an app to production, authenticating to your data source while keeping your secrets secure is crucial. The most important thing you can do to secure your credentials is to -avoid ever putting credentials in your code. **Your username and -password or** \index{API}API key should never appear in your code. +avoid ever putting credentials in your code. Your username and +password or \index{API}API key **should never appear in your code**. The simplest way to provide credentials without the values appearing in your code is with an \index{environment variable}environment variable. -\index{environment variable}Environment variable are set before your +\index{environment variable}Environment variables are set before your code starts -- sometimes from completely outside Python or R. ::: callout-note @@ -600,7 +600,7 @@ button. ::: callout-tip I recommend setting an `api_url` value at the top of your app. -By default, be $\text{"http://127.0.0.1:8080/predict"}$. If you've +By default, it will be $\text{"http://127.0.0.1:8080/predict"}$. If you've changed the port from $8080$ or used a different name for your prediction endpoint, you should adjust accordingly. ::: diff --git a/chapters/sec1/1-4-monitor-log.qmd b/chapters/sec1/1-4-monitor-log.qmd index d714b437..2f357216 100644 --- a/chapters/sec1/1-4-monitor-log.qmd +++ b/chapters/sec1/1-4-monitor-log.qmd @@ -24,7 +24,7 @@ that reveal what's happening inside your project and aggregating and consuming them. As a data scientist, you need to take on the task of emitting helpful logs and metrics for your code. In most cases, you'll integrate with tooling that your organization already has for log -aggregation. +aggregation and monitoring. In this chapter we'll get into how to make your code observable. You'll learn how to use tooling in R and Python to see what's happening inside @@ -57,7 +57,7 @@ observe. Moreover, you're already probably familiar with tools for *literate programming* like \index{Jupyter Notebook}Jupyter Notebooks, -\index{R Markdown} Documents, and \index{Quarto}Quarto Documents. +\index{R Markdown}R Markdown Documents, and \index{Quarto}Quarto Documents. One of my spicier opinions is that *all* jobs should be in a literate programming format. When used well, these tools intersperse code, @@ -99,7 +99,7 @@ purpose-built tooling for logging allows you to apply consistent formats within logs, emit logs in useful formats, and provide visibility into the severity of issues. -There are great logging packages in both Python and R. Python's logging +There are great logging packages in both Python and R. Python's `{logging}` package is standard and included. There is no standard logging package in R, but I recommend `{log4r}`. @@ -242,7 +242,7 @@ with `{log4r}`.[^1-4-monitor-log-1] that's current and have the older ones numbered. So today's log would be `my-log.log`, yesterday's would be `my-log.log.1`, the day before `my-log.log.2`, etc. This second pattern works particularly - well if you're using `logrotate` with `log4r`, because then `log4r` + well if you're using `logrotate` with `{log4r}`, because then `{log4r}` doesn't need to know anything about the log rotation. It's just always writing to `my-log.log`. @@ -300,7 +300,7 @@ official Prometheus client in Python and the `{openmetrics}` package in R makes registering metrics from a Plumber \index{API}API or \index{Shiny}Shiny app easy. -There's a great Get Started with Grafana and Prometheus doc on the +There's a great [*Get Started with Grafana and Prometheus*](https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/) doc on the Grafana Labs website if you want to try it out. ## Comprehension questions @@ -308,7 +308,7 @@ Grafana Labs website if you want to try it out. 1. What is the difference between monitoring and logging? What are the two halves of the monitoring and logging process? 2. Logging is generally good, but what are some things you should be - careful *not to log*? + careful not to log? 3. At what level would you log each of the following events: 1. Someone clicks on a particular tab in your \index{Shiny}Shiny app. diff --git a/chapters/sec1/1-5-deployments.qmd b/chapters/sec1/1-5-deployments.qmd index ea238261..bb9ad4bf 100644 --- a/chapters/sec1/1-5-deployments.qmd +++ b/chapters/sec1/1-5-deployments.qmd @@ -120,7 +120,7 @@ One way to help alleviate concerns about using real data is to create a Working with your IT/Admin team to get these things isn't always easy. They might not want to give you real data in dev. One point to emphasize -is that creating this environment makes things more secure. It gives you +is that creating this environment makes things **more secure**. It gives you a place to do development without fear that you might damage production data or services. @@ -133,7 +133,7 @@ control* is the tool to make your code promotion process real. Version control is software that allows you to keep the prod version of your code safe, gives contributors a copy to work on, and hosts tools to -manage merging changes back together. These days, \index{Git}Git is the +manage merging changes back together. These days, \index{Git}*Git* is the industry standard for version control. \index{Git}Git is an open-source system for tracking changes to computer @@ -152,7 +152,7 @@ from this book and learn about \index{Git}Git. People who say learning \index{Git}Git is easy are either lying or have forgotten. I am sorry our industry is standardized on a tool with such -terrible ergonomics. It's worth your time to learn. +terrible ergonomics. It is, unfortunately, worth your time to learn. Whether you're an R or Python user, I'd recommend starting with a resource designed to teach \index{Git}Git to a data science user. My @@ -303,7 +303,7 @@ pipelines built into \index{Git}Git providers are very popular. While there are a number of CI/CD pipeline tools, including Jenkins, Travis, Azure DevOps, and GitLab, \index{GitHub Actions}GitHub Actions immediately rocketed to number one when it was released a few years ago. At this point, many organizations are quickly moving their CI/CD into GitHub Actions if they haven't already done so. -\index{GitHub Actions}GitHub Actions are defined `.yml` files that go in the +\index{GitHub Actions}GitHub Actions are defined in `.yml` files that go in the `.github/workflows` directory of a project. \index{GitHub}GitHub knows to inspect that directory and kick off any prescribed actions when there are changes to the repo. Let's talk about some of the basics of understanding and using \index{GitHub Actions}GitHub Actions. @@ -339,7 +339,7 @@ Windows, and MacOS. You can also add custom runners. Depending on the level of reproducibility you're aiming for, you might want to lock the runner to a particular version of the operating system rather than just running `latest`. -Once the job is kicked off and the runner live, it's time to actually do something. Because the default runners are all basically bare operating systems, the action needs to include steps to build the environment before you can actually run any code. Depending on what you're doing, that will mean installing OS dependencies, installing Python and/or R, and installing R and Python packages for whatever content you're running. +Once the job is kicked off and the runner is live, it's time to actually do something. Because the default runners are all basically bare operating systems, the action needs to include steps to build the environment before you can actually run any code. Depending on what you're doing, that will mean installing OS dependencies, installing Python and/or R, and installing R and Python packages for whatever content you're running. In \index{GitHub Actions}GitHub Actions, the `jobs` section defines the set of `steps` that comprise the action. Most steps use the `uses` command to run an action that someone else wrote. Some actions accept variables with the `with` command. In order to ensure that your Actions can remain flexible and your secrets secret, \index{GitHub Actions}GitHub Actions allows you to pull a value from the \index{GitHub}GitHub GUI and use it in a step with the `${{ }}` syntax. diff --git a/chapters/sec1/1-6-docker.qmd b/chapters/sec1/1-6-docker.qmd index a173f8ac..f9b8c4fb 100644 --- a/chapters/sec1/1-6-docker.qmd +++ b/chapters/sec1/1-6-docker.qmd @@ -98,9 +98,9 @@ Some organizations run private registries, usually using *registry as a service* offerings from cloud providers.[^1-6-docker-2] [^1-6-docker-2]: The big three \index{container}container registries are - AWS Elastic \index{container}container Registry (ECR), Azure - \index{container}container Registry, and Google - \index{container}container Registry. + AWS Elastic \index{container}Container Registry (ECR), Azure + \index{container}Container Registry, and Google + \index{container}Container Registry. Images are built from *Dockerfiles* -- the code that defines the image. Dockerfiles are usually stored in a \index{Git}Git repository. Building @@ -368,7 +368,7 @@ docker run --rm -d \ 1. This line is necessary because the model lives at `/data/model` on our **host** machine. But the \index{API}API inside the \index{container}container is looking -for `/data/model` **inside the** \index{container}container. We need to make sure that the directory exists and has the model in it. +for `/data/model` **inside the container**. We need to make sure that the directory exists and has the model in it. ### Lab Extensions diff --git a/chapters/sec2/2-1-cloud.qmd b/chapters/sec2/2-1-cloud.qmd index 52f5cecc..256cc2e6 100644 --- a/chapters/sec2/2-1-cloud.qmd +++ b/chapters/sec2/2-1-cloud.qmd @@ -12,8 +12,8 @@ essential cloud services for data science. [^2-1-cloud-1]: Yes, that is a *Sound of Music* reference. -This chapter has two labs. In the first, you'll start with an *AWS -(Amazon Web Services)* server -- getting it stood up and learning how to +This chapter has two labs. In the first, you'll start with an *AWS* +(Amazon Web Services) server -- getting it stood up and learning how to start and stop it. In the second lab, you'll put the model from our penguin mass modeling lab into an \index{S3}S3 bucket (more on that in a bit). @@ -34,7 +34,7 @@ someone and buying a bunch of hardware probably isn't worth it. Enter an online bookstore named Amazon. Around 2000, Amazon started centralizing servers across the company so teams who needed capacity could acquire it from this central pool instead of running their own. -Over the next few years, Amazon execs (correctly) realized that other +Over the next few years, Amazon's leaders (correctly) realized that other companies and organizations would value this ability to rent server capacity. They launched this "rent a server" business as AWS in 2006. @@ -61,8 +61,8 @@ purported benefits are real, while some are not. The most important cloud benefit is flexibility. Moving to the cloud allows you to get a new server or re-scale an existing one in minutes; -you only pay for what you use, often on an hourly basis.[^2-1-cloud-4] -Because you pay as you go, the risk of incorrectly guessing how much +you can pay only for what you use, often on an hourly basis.[^2-1-cloud-4] +Because you can pay as you go, the risk of incorrectly guessing how much capacity you'll need is way lower than in an on-prem environment. [^2-1-cloud-4]: A huge amount of cloud spending is now done via annual @@ -210,15 +210,15 @@ from scratch. Some common IaaS services you're likely to use include: -- Renting a server from AWS with \index{EC2}*EC2 (Elastic Cloud - Compute)*, Azure with *Azure VMs*, or GCP with *Google Compute +- Renting a server from AWS with \index{EC2}*EC2* (Elastic Cloud + Compute), Azure with *Azure VMs*, or GCP with *Google Compute Engine Instances*. - Attaching storage with AWS's *EBS (Elastic Block Store)*, *Azure Managed Disk*, or *Google Persistent Disk*. - Creating and managing the networking where your servers sit with - AWS's *VPC (Virtual Private Cloud)*, Azure's *Virtual Network*, and + AWS's *VPC* (Virtual Private Cloud), Azure's *Virtual Network*, and GCP's *Virtual Private Cloud*. - Managing DNS records via AWS's *Route 53*, *Azure* *DNS*, and @@ -238,17 +238,17 @@ to the service. In the cake-baking world, PaaS would be like buying a pre-made cake and some frosting and writing "Happy Birthday!" on the cake yourself. -One PaaS service that already came up in this book is *blob (Binary -Large Object)* storage. Blob storage allows you to put objects somewhere +One PaaS service that already came up in this book is *blob* (Binary +Large Object) storage. Blob storage allows you to put objects somewhere and recall them to any other machine that has access to the blob store. Many data science artifacts, including machine learning models, are kept -in blob stores. The major blob stores are AWS's \index{S3}*S3 (Simple -Storage Service)*, *Azure Blob Storage*, and *Google Cloud Storage*. +in blob stores. The major blob stores are AWS's \index{S3}*S3* (Simple +Storage Service), *Azure Blob Storage*, and *Google Cloud Storage*. You'll also likely use cloud-based database, data lake, and data -warehouse offerings. I've seen *RDS* or *Redshift* from AWS, *Azure -Database* or *Azure Datalake*, and *Google Cloud Database* and *Google -BigQuery* used most frequently*.* This category also includes several +warehouse offerings. I've seen *RDS* or *Redshift* from AWS, Azure +*Database* or *Azure Datalake*, and *Google Cloud Database* and *Google +BigQuery* used most frequently. This category also includes several offerings from outside the big three, most notably *Snowflake* and *Databricks*.[^2-1-cloud-8] @@ -257,17 +257,17 @@ offerings from outside the big three, most notably *Snowflake* and somewhat immaterial. Depending on your organization, you may also use services that run APIs -or applications from containers or machine images like AWS's *ECS -(Elastic* \index{container}container Service), *Elastic Beanstalk*, or -*Lambda*, Azure's \index{container}container Apps or *Functions*, or +or applications from containers or machine images like AWS's *ECS* +(Elastic \index{container}Container Service), *Elastic Beanstalk*, or +*Lambda*, Azure's \index{container}Container Apps or *Functions*, or GCP's *App Engine* or *Cloud Functions*. Increasingly, organizations are turning to \index{Kubernetes}*Kubernetes* to host services. (More on that in [Chapter @sec-ent-scale].) Most organizations who do so use a cloud -provider's \index{Kubernetes}Kubernetes cluster as a service: AWS's *EKS -(Elastic* \index{Kubernetes}Kubernetes Service) or *Fargate*, Azure's -*AKS (Azure Kubernetes Service)*, or GCP's *GKE (Google* +provider's \index{Kubernetes}Kubernetes cluster as a service: AWS's *EKS* +(Elastic \index{Kubernetes}Kubernetes Service) or *Fargate*, Azure's +*AKS* (Azure Kubernetes Service), or GCP's *GKE* (Google \index{Kubernetes}Kubernetes Engine). Many organizations are moving to PaaS solutions for hosting applications @@ -318,8 +318,8 @@ you've found it lacking. Regardless of what you're trying to do, if you're working in the cloud, you must ensure that the right people have the correct permissions. To -manage these permissions, AWS has *IAM (Identity and Access -Management)*, GCP has *Identity Access Management*, and Azure has +manage these permissions, AWS has *IAM* (Identity and Access +Management), GCP has *Identity Access Management*, and Azure has *Microsoft Entra ID*, which was called *Azure Active Directory* until the summer of 2023. Your organization might integrate these services with a SaaS identity management solution like *Okta* or *OneLogin*. @@ -432,7 +432,7 @@ recommend you name the server something like `do4ds-lab` in case you stand up others later. If you're doing this at work, there may be tagging policies so the -IT/Admin team can figure out who servers belong to later. +IT/Admin team can figure out who servers belong to. #### Image diff --git a/chapters/sec2/2-2-cmd-line.qmd b/chapters/sec2/2-2-cmd-line.qmd index 1bfc7f1f..573f829b 100644 --- a/chapters/sec2/2-2-cmd-line.qmd +++ b/chapters/sec2/2-2-cmd-line.qmd @@ -219,12 +219,10 @@ you're trying to \index{SSH}SSH into, like a server or \index{Git}Git host, but your private key must be treated as a precious secret. When you use the `ssh` command, your local machine sends a request to -open an \index{SSH}SSH session to the remote and includes the private key with the -request. The remote host verifies the private key with the public key -and opens an encrypted connection. +open an \index{SSH}SSH session to the remote. Your local machine and the remote use the keys to verify each other and open an encrypted connection. ![](images/ssh.png){.lightbox -fig-alt="A diagram of \index{SSH}SSH initialization. The local host sends the private key, the remote checks against the public key, and then opens the session."} +fig-alt="A diagram of \index{SSH}SSH initialization. The local host sends a request signed by the private key, the remote checks against the public key, and then opens the session."} It can be hard to remember how to configure \index{SSH}SSH. So let's detour into *public key cryptography*, the underlying technology. Once you've built @@ -233,7 +231,7 @@ mechanically remembering. Public key cryptography uses mathematical operations that are simple in one direction but hard to reverse to make it easy to check whether a -proffered private key is valid but nearly impossible to fabricate a +particular private key is valid but nearly impossible to fabricate a private key from a public key. ::: callout-tip @@ -354,7 +352,7 @@ One annoyance about \index{SSH}SSH is that it blocks the terminal it's using and connection will break when your computer goes to sleep. Many people like using the *tmux* command line utility to help solve these issues. -*tmux* is a terminal multiplexer, which allows you to manipulate +As a terminal multiplexer, tmux allows you to manipulate terminal sessions from the command line, including putting sessions into the background and making sessions durable through sleeps and other operations. I'm mentioning tmux because many people love it, but I've diff --git a/chapters/sec2/2-3-linux.qmd b/chapters/sec2/2-3-linux.qmd index a68cc814..18f6e591 100644 --- a/chapters/sec2/2-3-linux.qmd +++ b/chapters/sec2/2-3-linux.qmd @@ -23,7 +23,7 @@ on Linux.[^2-3-linux-1] To administer a server, you'll have to learn a little about Linux. In this chapter, you'll learn about the history of Linux and how to -navigate and manipulate a server running Linux. +navigate and manipulate a server running Linux. Many of these techniques are also useful on your laptop's command line. ## A brief history of Linux @@ -107,7 +107,7 @@ year and month they were released. Most people use the Long Term Support Ubuntu versions have fun alliterative names, so you'll hear people refer to releases by name or version. As of this writing, most Ubuntu machines -run Focal (20.04, Focal Fossa) or Jammy (22.04, Jammy Jellyfish). Noble (24.04, Noble Numbat) is imminent. +run Focal (20.04, Focal Fossa), Jammy (22.04, Jammy Jellyfish), or Noble (24.04, Noble Numbat). ::: ## Administering Linux with bash @@ -204,7 +204,7 @@ Whenever a program is running, it is running as a particular user identified by their *username*. On any Unix-like system, the `whoami` command returns the username of -the active user. When I run `whoami` on my MacBook, I get: +the active user. For example, running `whoami` might look like: ``` {.bash filename="Terminal"} > whoami @@ -241,8 +241,7 @@ their username. Like a user has a `uid`, a group has a `gid`. User `gid`s start at 100. -The `id` command shows a user's username, `uid`, groups, and `gid`. On -my MacBook, I'm a member of several groups, with the primary group +The `id` command shows a user's username, `uid`, groups, and `gid`. For example, I might be a member of several groups, with the primary group `staff`. ``` {.bash filename="Terminal"} @@ -317,7 +316,7 @@ have the permission. So the permissions in the graphic would be `-rwxr-x--x` for a file and `drwxr-x--x` for a directory. -The best way to get these permissions is to run the `ls -l` command. +The best way to get these permissions is to run the `ls -l` command. For example: ``` {.bash filename="Terminal"} > ls -l diff --git a/chapters/sec2/2-4-app-admin.qmd b/chapters/sec2/2-4-app-admin.qmd index 90d8e7bc..a6b40e1c 100644 --- a/chapters/sec2/2-4-app-admin.qmd +++ b/chapters/sec2/2-4-app-admin.qmd @@ -309,7 +309,7 @@ ensure the \index{container}container you care about comes up whenever However, many \index{Docker}Docker services involve coordinating more than one \index{container}container. If so, you'll want to use a purpose-built system for managing multiple containers. The most popular -are \index{Docker}Docker Compose or \index{Kubernetes}*Kubernetes*. +are \index{Docker}*Docker Compose* or \index{Kubernetes}*Kubernetes*. \index{Docker}Docker Compose is a relatively lightweight system that allows you to write a YAML file describing the containers you need and @@ -602,10 +602,10 @@ First, we haven't daemonized the \index{API}API. Feel free to try Second, neither the \index{API}API nor the \index{Shiny}Shiny app will automatically update when we change them. You might want to set up -a\index{GitHub Actions}GitHub Actions to do so. For \index{Shiny}Shiny Server, you'll +a \index{GitHub Actions}GitHub Action to do so. For \index{Shiny}Shiny Server, you'll need to push the updates to the server and then restart \index{Shiny}Shiny Server. For the \index{API}API, you'd need to -configure a \index{GitHub}GitHub action to rebuild the +configure a \index{GitHub}GitHub Action to rebuild the \index{container}container and push it to a registry. You'd then need to tell \index{Docker}Docker on the server to re-pull and restart the \index{container}container. diff --git a/chapters/sec2/2-5-scale.qmd b/chapters/sec2/2-5-scale.qmd index 861bd3e2..c27e8d11 100644 --- a/chapters/sec2/2-5-scale.qmd +++ b/chapters/sec2/2-5-scale.qmd @@ -184,12 +184,11 @@ preceded them. HDDs consist of spinning magnetic disks with magnetized read/write heads that save and read data from the disks. While HDDs spin very fast -- -5,400 and 7,200 RPM are typical speeds -- SSDs with no moving parts are -much faster. +5,400 and 7,200 RPM are typical speeds -- SSDs with no moving parts are still much faster. ## Recommendation 3: Get lots of storage; it's cheap -Get however much storage you think you'll need when you configure storage your server, but don't +Get however much storage you think you'll need when you configure your server, but don't think too hard. Storage is cheap and easy to upgrade. It's almost always more cost-effective to buy additional storage than to have a highly-paid human figure out how to delete things to free up room. @@ -233,8 +232,8 @@ size you estimate will be adequate and add more if needed. ## GPUs are special-purpose compute All computers have a CPU. Some computers have specialized chips where -the CPU can offload particular tasks -- the most common being the -graphical processing unit (GPU). GPUs are architected for tasks like +the CPU can offload particular tasks -- the most common being the *GPU* +(graphical processing unit). GPUs are architected for tasks like rendering video game graphics, some kinds of machine learning, training large language models (LLMs), and, yes, Bitcoin mining.[^2-5-scale-3] diff --git a/chapters/sec3/3-0-sec-intro.qmd b/chapters/sec3/3-0-sec-intro.qmd index 45974a86..f5a150c1 100644 --- a/chapters/sec3/3-0-sec-intro.qmd +++ b/chapters/sec3/3-0-sec-intro.qmd @@ -74,7 +74,7 @@ extreme form, this is someone entirely outside the organization (*outsider threat*). But it also could be someone inside the organization who is disgruntled or seeking personal gain (*insider threat*). And even if data isn't stolen, it's bad if someone hijacks -your computational resources for nefarious ends like crypto-remining or +your computational resources for nefarious ends like crypto-mining or virtual DDOS attacks on Turkish banks.[^3-0-sec-intro-1] [^3-0-sec-intro-1]: Yes, both things I've actually seen happen. diff --git a/chapters/sec3/3-2-auth.qmd b/chapters/sec3/3-2-auth.qmd index 35af51c8..a2abc07d 100644 --- a/chapters/sec3/3-2-auth.qmd +++ b/chapters/sec3/3-2-auth.qmd @@ -189,7 +189,7 @@ OAuth and IAM, to secure access to data sources, including databases, APIs, and cloud services. Sometimes, you'll have to manually navigate the token exchange process in your Python or R code. For example, you've likely acquired and dispatched an OAuth token to access a -\index{Google Sheets} or a modern data \index{API}API. +\index{Google Sheets}Google Sheet or a modern data \index{API}API. Increasingly, IT/Admins want users to have the experience of logging in and automatically accessing data sources. This situation is sometimes diff --git a/chapters/sec3/3-4-ent-pm.qmd b/chapters/sec3/3-4-ent-pm.qmd index 03e6bb15..ff7b42c4 100644 --- a/chapters/sec3/3-4-ent-pm.qmd +++ b/chapters/sec3/3-4-ent-pm.qmd @@ -61,7 +61,7 @@ These CVEs can get into your organization when they are in code, which is a component of the software you're using directly. For example, a CVE in JavaScript might show up in the version of JavaScript used by \index{Jupyter -Notebook}, \index{RStudio}RStudio, \index{Shiny}Shiny, or \index{Streamlit}Streamlit. +Notebook}Jupyter Notebook, \index{RStudio}RStudio, \index{Shiny}Shiny, or \index{Streamlit}Streamlit. Many companies disallow using software with `Critical` CVEs and only temporarily allow software with a few `High` CVEs. @@ -202,11 +202,11 @@ some convincing. Many enterprises run *package repository software* inside their firewall to govern package ingestion and availability. Most package repository products are paid because enterprises primarily need them. Common ones -include Jfrog Artifactory, Sonatype Nexus, \index{Anaconda} Business, +include Jfrog Artifactory, Sonatype Nexus, \index{Anaconda}Anaconda Business, and \index{Posit}Posit Package Manager. Artifactory and Nexus are generalized library and package management -solutions for all sorts of software, while \index{Anaconda} and +solutions for all sorts of software, while \index{Anaconda}Anaconda and \index{Posit}Posit Package Manager are more narrowly tailored for data science use cases. I'd suggest working with your IT/Admins to get data science focused repository software. Often these repositories can run