Second edition: rationale, changes, outline, and feedback #101

jeroenjanssens · 2020-06-10T18:29:32Z

I'm happy to announce that I'll be writing the second edition of Data Science at the Command Line (O'Reilly, 2014). This issue explains why I think a second edition is needed, lists what changes I plan to make, and presents a tentative outline. Finally, I have a few words about the process and giving feedback.

Why a second edition?

While the command line as a technology and as a way of working is timeless, some of the tools discussed in the first edition have either: (1) been superseded by newer tools (e.g., csvkit has been replaced by xsv), (2) been abandoned by their developers (e.g., drake), or (3) been suboptimal choices (e.g., weka). Since the first edition was published in October 2014 I have learned a lot, either through my own experience or through the useful feedback from its readers. Even though the book is quite niche because it lies at the intersection of two subjects, there remains a steady interest from the data science community. I notice this from the many positive messages I receive almost every day. By updating the first edition I hope to keep the book relevant for at least another five years.

Changes with respect to the first edition

These are the general changes I currently have in mind. Please note that this is subject to change.

Throughout the book: replace csvkit with xsv as much as possible. xsv is a much faster alternative to working with CSV files.
Section 1.6: Replace the used data set with one that is accessible without an API key.
Section 2.2 and 3.2: Replace the VirtualBox image with a Docker image (this is already done on https://www.datascienceatthecommandline.com). Docker is a faster and more lightweight way of running an isolated environment than VirtualBox.
Section 4.3: Split Python and R into separate sections. Furthermore, explain how to parse command-line options in those languages.
Section 5.4: Split into two sections. Use xmlstarlet for working with XML.
Section 5.5: Move these subsections beneath Section 5.3.
Section 5.6: Use pup instead of scrape to work with HTML. scrape is a Python tool I created myself. pup is much faster, has more features, and is easier to install.
Chapter 6: Replace Drake with Make. Drake is no longer maintained. Make is much more mature and is also very popular with developers.
Section 7.3.2 and 7.4.x Replace Rio with littler. Rio is a Bash script I created myself. littler is a much more stable way of using R from the command line and is easier to install.
Chapter 8. Add new sections that discuss how to get a list of running instances from not only AWS but also from two newer cloud providers: GCP and Azure.
Chapter 9: Replace Weka, BigML, and SKLL with Vowpal Wabbit. Weka is old and the way it is used from the command line is clunky. BigML is a commercial API on which I no longer want to rely. SKLL is not truly from the command line. Vowpal Wabbit is a very mature machine learning tool, developed at Yahoo! and now at Microsoft. At some point, there was supposed to be an entire book about Vowpal Wabbit (titled Sequential Learning), but unfortunately this never was finished. These three sections will give Vowpal Wabbit the exposure it deserves and the readers the speed and stability of applying machine learning at the command line they deserve.
Chapter 10: New chapter about integrating the command line into existing workflows, including Python, R, Julia, and Spark. In the first edition I mention that the command line can easily be integrated with existing workflows, but I never go into that. This chapter fixes that. My hope is that with this chapter, someone would be quicker inclined to pick up this book and learn about the advantages of the command line.

Book outline

In the tentative outline below, 🆕 indicates added and ❌ indicates removed chapters and sections with respect to the first edition.

Preface
- What to Expect from This Book
- How to Read This Book
- Who This Book Is For
- Acknowledgments
- Dedication
- About the Author
Chapter 1 Introduction
- 1.1 Overview
- 1.2 Data Science is OSEMN
  - 1.2.1 Obtaining Data
  - 1.2.2 Scrubbing Data
  - 1.2.3 Exploring Data
  - 1.2.4 Modeling Data
  - 1.2.5 Interpreting Data
- 1.3 Intermezzo Chapters
- 1.4 What is the Command Line?
- 1.5 Why Data Science at the Command Line?
  - 1.5.1 The Command Line is Agile
  - 1.5.2 The Command Line is Augmenting
  - 1.5.3 The Command Line is Scalable
  - 1.5.4 The Command Line is Extensible
  - 1.5.5 The Command Line is Ubiquitous
- 1.6 A Real-world Use Case
- 1.7 Further Reading
Chapter 2 Getting Started
- 2.1 Overview
- 2.2 Setting Up Your Data Science Toolbox ❌
- 2.2 Installing the Docker Image 🆕
- 2.3 Essential GNU/Linux Concepts
  - 2.3.1 The Environment
  - 2.3.2 Executing a Command-line Tool
  - 2.3.3 Five Types of Command-line Tools
  - 2.3.4 Combining Command-line Tools
  - 2.3.5 Redirecting Input and Output
  - 2.3.6 Working With Files
  - 2.3.7 Help!
- 2.4 Further Reading
Chapter 3 Obtaining Data
- 3.1 Overview
- 3.2 Copying Local Files to the Data Science Toolbox ❌
  - 3.2.1 Local Version of Data Science Toolbox ❌
  - 3.2.2 Remote Version of Data Science Toolbox ❌
- 3.2 Copying Local Files to the Docker Image 🆕
- 3.3 Decompressing Files
- 3.4 Converting Microsoft Excel Spreadsheets
- 3.5 Querying Relational Databases
- 3.6 Downloading from the Internet
- 3.7 Calling a Web API
- 3.8 Further Reading
Chapter 4 Creating Reusable Command-line Tools
- 4.1 Overview
- 4.2 Converting One-liners into Shell Scripts
  - 4.2.1 Step 1: Copy and Paste
  - 4.2.2 Step 2: Add Permission to Execute
  - 4.2.3 Step 3: Define Shebang
  - 4.2.4 Step 4: Remove Fixed Input
  - 4.2.5 Step 5: Parametrize
  - 4.2.6 Step 6: Extend Your PATH
- 4.3 Creating Command-line Tools with Python and R ❌
  - 4.3.1 Porting The Shell Script ❌
  - 4.3.2 Processing Streaming Data from Standard Input ❌
- 4.3 Creating Command-line Tools with Python 🆕
  - 4.3.1 Porting The Shell Script 🆕
  - 4.3.2 Processing Streaming Data from Standard Input 🆕
  - 4.3.3 Parsing Command-Line Options 🆕
- 4.4 Creating Command-line Tools with R 🆕
  - 4.3.1 Porting The Shell Script 🆕
  - 4.3.2 Processing Streaming Data from Standard Input 🆕
  - 4.3.3 Parsing Command-Line Options 🆕
- 4.5 Further Reading
Chapter 5 Scrubbing Data
- 5.1 Overview
- 5.2 Common Scrub Operations for Plain Text
  - 5.2.1 Filtering Lines
  - 5.2.2 Extracting Values
  - 5.2.3 Replacing and Deleting Values
- 5.3 Working with CSV
  - 5.3.1 Bodies and Headers and Columns, Oh My!
  - 5.3.2 Performing SQL Queries on CSV
  - 5.3.3 Extracting and Reordering Columns 🆕
  - 5.3.4 Filtering Lines 🆕
  - 5.3.5 Merging Columns 🆕
  - 5.3.6 Combining Multiple CSV Files 🆕
- 5.4 Working with XML/HTML and JSON ❌
- 5.5 Common Scrub Operations for CSV ❌
  - 5.5.1 Extracting and Reordering Columns ❌
  - 5.5.2 Filtering Lines ❌
  - 5.5.3 Merging Columns ❌
  - 5.5.4 Combining Multiple CSV Files ❌
- 5.4 Working with JSON 🆕
  - Introducing jq 🆕
  - Filtering elements 🆕
  - Simplifying JSON 🆕
  - Converting JSON to CSV 🆕
- 5.5 Working with XML 🆕
  - 5.5.1 Introducing xmlstarlet 🆕
  - 5.5.2 Extracting fields using XPath 🆕
  - 5.5.3 Converting XML to CSV 🆕
- 5.6 Working with HTML 🆕
  - 5.6.1 Introducing pup 🆕
  - 5.6.2 Extracting fields using CSS Selectors 🆕
  - 5.6.3 Converting HTML to CSV 🆕
- 5.7 Further Reading
Chapter 6 Managing Your Data Workflow
- 6.1 Overview
- 6.2 Introducing ~~Drake~~ Make 🆕
- 6.3 Installing Drake ❌
- 6.3 One Script to Rule Them All 🆕
- 6.4 Obtain Top E-books from Project Gutenberg
- 6.5 Every Workflow Starts with a Single Step
- 6.6 Well, That Depends
- 6.7 Rebuilding Certain Targets
- 6.8 Discussion
- 6.9 Further Reading
Chapter 7 Exploring Data
- 7.1 Overview
- 7.2 Inspecting Data and its Properties
  - 7.2.1 Header Or Not, Here I Come
  - 7.2.2 Inspect All The Data
  - 7.2.3 Feature Names and Data Types
  - 7.2.4 Unique Identifiers, Continuous Variables, and Factors
- 7.3 Computing Descriptive Statistics
  - 7.3.1 ~~csvstat~~ Using xsv stat 🆕
  - 7.3.2 Using R from the Command Line ~~using Rio~~
- 7.4 Creating Visualizations
  - 7.4.1 Introducing Gnuplot and Feedgnuplot
  - 7.4.2 Introducing ggplot2
  - 7.4.3 Histograms
  - 7.4.4 Bar Plots
  - 7.4.5 Density Plots
  - 7.4.6 Box Plots
  - 7.4.7 Scatter Plots
  - 7.4.8 Line Graphs
  - 7.4.9 Summary
- 7.5 Further Reading
Chapter 8 Parallel Pipelines
- 8.1 Overview
- 8.2 Serial Processing
  - 8.2.1 Looping Over Numbers
  - 8.2.2 Looping Over Lines
  - 8.2.3 Looping Over Files
- 8.3 Parallel Processing
  - 8.3.1 Introducing GNU Parallel
  - 8.3.2 Specifying Input
  - 8.3.3 Controlling the Number of Concurrent Jobs
  - 8.3.4 Logging and Output
  - 8.3.5 Creating Parallel Tools
- 8.4 Distributed Processing
  - 8.4.1 Get List of Running AWS EC2 Instances ❌
  - 8.4.1 Running Commands on Remote Machines
  - 8.4.2 Distributing Local Data among Remote Machines
  - 8.4.3 Processing Files on Remote Machines
  - 8.4.4 Get List of Running EC2 Instances on AWS 🆕
  - 8.4.5 Get List of Running Compute Engine Instances on GCP 🆕
  - 8.4.6 Get List of Running Instances on Azure 🆕
- 8.5 Discussion
- 8.6 Further Reading
Chapter 9 Modeling Data
- 9.1 Overview
- 9.2 More Wine Please!
- 9.3 Dimensionality Reduction with Tapkee
  - 9.3.1 Introducing Tapkee
  - 9.3.2 Installing Tapkee
  - 9.3.3 Linear and Non-linear Mappings
- 9.4 Clustering with Weka ❌
  - 9.4.1 Introducing Weka ❌
  - 9.4.2 Taming Weka on the Command Line ❌
  - 9.4.3 Converting between CSV to ARFF Data Formats ❌
  - 9.4.4 Comparing Three Cluster Algorithms ❌
- 9.4 Clustering with SciKit-Learn 🆕
  - 9.4.1 Using SciKit-Learn from the Command Line 🆕
  - 9.4.2 K-Means Clustering 🆕
  - 9.4.3 Hierarchical Clustering 🆕
  - 9.4.4 Pipelines 🆕
- 9.5 Regression with SciKit-Learn Laboratory ❌
  - 9.5.1 Preparing the Data ❌
  - 9.5.2 Running the Experiment ❌
  - 9.5.3 Parsing the Results ❌
- 9.6 Classification with BigML ❌
  - 9.6.1 Creating Balanced Train and Test Data Sets ❌
  - 9.6.2 Calling the API ❌
  - 9.6.3 Inspecting the Results ❌
  - 9.6.4 Conclusion ❌
- 9.5 Collaborative Filtering with Vowpal Wabbit 🆕
  - 9.5.1 Introducing Vowpal Wabbit 🆕
  - 9.5.2 Input Format 🆕
  - 9.5.3 Matrix Factorization 🆕
  - 9.5.4 Training a Model 🆕
  - 9.5.5 Making Predictions 🆕
  - 9.5.6 Measure Performance 🆕
- 9.6 Regression with Vowpal Wabbit 🆕
  - 9.6.1 Feature Hashing 🆕
  - 9.6.2 Gradient Descent 🆕
  - 9.6.3 Hyper-parameter Optimization 🆕
  - 9.6.4 Inspecting Models 🆕
- 9.7 Classification with Vowpal Wabbit 🆕
  - 9.7.1 Extended Input Format 🆕
  - 9.7.2 Multi-class Classification 🆕
  - 9.7.3 Online Learning 🆕
- 9.8 Further Reading
Chapter 10 Leverage the Unix Command Line Elsewhere 🆕
- 10.1 Jupyter Notebook 🆕
- 10.2 Python Scripts 🆕
- 10.3 RStudio 🆕
- 10.4 R Markdown 🆕
- 10.5 R Scripts 🆕
- 10.6 Julia Scripts 🆕
- 10.7 Spark Pipes 🆕
Chapter 10 11 Conclusion
- 11.1 Let’s Recap
- 11.2 Three Pieces of Advice
  - 11.2.1 Be Patient
  - 11.2.2 Be Creative
  - 11.2.3 Be Practical
- 11.3 Where To Go From Here?
  - 11.3.1 APIs
  - 11.3.2 Shell Programming
  - 11.3.3 Python, R, and SQL
  - 11.3.4 Interpreting Data
- 11.4 Getting in Touch
References

Feedback

In the past five years I have received a lot of valuable feedback in the form of emails, tweets, book reviews, errata submitted to O'Reilly, GitHub issues, and even pull requests. I love this. It has only made the book better.

O'Reilly has graciously given me permission to make the source of the second edition available on GitHub and an HTML version available on https://www.datascienceatthecommandline.com under a Creative Commons Attribution-NoDerivatives 4.0 International License from the start. That's fantastic because this way, I'll be able to receive feedback during the entire journey, which will make the book even better.

And feedback is, as always, very much appreciated. This can be anything ranging from a typo to a command-line tool or trick that might be of interest to others. If you have any ideas, suggestions, questions, criticism, or compliments, then I would love to hear from you. You may reply to this particular issue, create a new issue, tweet me at @jeroenhjanssens, or email me; use whichever medium you prefer.

Thank you.

Best wishes,

Jeroen

The text was updated successfully, but these errors were encountered:

aborruso · 2020-06-11T06:55:53Z

@jeroenjanssens it's really a great thing, thank you very much.

A note about your "scrape": the only real problem for me was that it did not work on python3, and for this reason I had built a cli based on it (and I do not must think to the environment).

You are right, pup is faster and easier to install, but you cannot do XPATH query using it. I think that if you must use a cli tool to query HTML pages, it's necessary to use something that is able to run both CSS selector and XPATH queries, as your GREAT scrape.

jeroenjanssens · 2020-06-12T08:33:07Z

Thank you @aborruso!

For the second edition I would like to only use tools which can be installed easily through some package manager. So to address your point, I guess we could do two things:

Create a separate package for scrape.
Extend pup such that it accepts XPATH queries.

What do you think?

knbknb · 2020-06-12T08:40:50Z

Off the top of my head:

I remember that one of the reviewers of the first edition of this book on goodreads.com wrote that he very much liked your introduction to gnu parallel. That supposedly was a highlight of the book.

So maybe split chapter 8 into two chapters: one chapter about parallel processing on localhost, and one chapter about parallelization on cloud platforms.

aborruso · 2020-06-12T08:42:23Z

So to address your point, I guess we could do two things:

Create a separate package for scrape.

Extend pup such that it accepts XPATH queries.

Dear @jeroenjanssens, both are very good points.

But unluckily I'm above all a final user and not a Python or go developer. I have built the cli version of scrape, using another utility :)
Then I cannot say to you I will help you to create the package or extend pup :(

If there is scrape package it will become a tool which can be installed easily through some package manager.

Once again thank you

iveksl2 · 2020-07-06T06:10:56Z

Hmmm, maybe something about model deployment? Not sure how it fits into the command-line but some buzzwords to think about in Deep Learning, Optimization, RL?

kwbonds · 2020-09-20T04:53:44Z

Thanks for your book. Suggest you consider switching to printf instead of echo in the second edition though. Seems it is more stable. I spent a while trying to figure out why echo 'foo\nbar\nfoo' would not recognize the newline characters. printf 'foo\nbar\nfoo' works correctly.

simonw · 2020-10-13T15:26:51Z

I'm here to advocate for more SQLite coverage.

SQLite is a fantastic tool for command-line data science, because it gives you a full relational database without needing to run a PostgreSQL or MySQL server anywhere - each database exists as a single file on disk.

My sqlite-utils tool (brew install sqlite-utils) lets you pipe JSON or CSV data directly into a database, automatically creating an appropriate scheme. You can then run queries and pipe the results out as further JSON/CSV ready to be piped to other processes.

While I'm here I'll plug Datasette too (brew install datasette) which gives you an instant localhost web UI for exploring a SQLite database (datasette mydb.db) - and can also export CSV or JSON results of queries back out again.

(Originally discussed on Twitter)

Awannaphasch2016 · 2021-04-02T16:02:19Z

here is also a list of command line related tools for further reading.
https://github.com/dbohdan/structured-text-tools

PythonCoderUnicorn · 2021-10-18T01:30:47Z

Thanks for sharing the link. I appreciate your hard work. Hope to learn a lot

jeroenjanssens pinned this issue Jun 10, 2020

jeroenjanssens added the good first issue label Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second edition: rationale, changes, outline, and feedback #101

Second edition: rationale, changes, outline, and feedback #101

jeroenjanssens commented Jun 10, 2020

aborruso commented Jun 11, 2020 •

edited

Loading

jeroenjanssens commented Jun 12, 2020

knbknb commented Jun 12, 2020

aborruso commented Jun 12, 2020

iveksl2 commented Jul 6, 2020

kwbonds commented Sep 20, 2020

simonw commented Oct 13, 2020

Awannaphasch2016 commented Apr 2, 2021

PythonCoderUnicorn commented Oct 18, 2021

Second edition: rationale, changes, outline, and feedback #101

Second edition: rationale, changes, outline, and feedback #101

Comments

jeroenjanssens commented Jun 10, 2020

Why a second edition?

Changes with respect to the first edition

Book outline

Feedback

aborruso commented Jun 11, 2020 • edited Loading

jeroenjanssens commented Jun 12, 2020

knbknb commented Jun 12, 2020

aborruso commented Jun 12, 2020

iveksl2 commented Jul 6, 2020

kwbonds commented Sep 20, 2020

simonw commented Oct 13, 2020

Awannaphasch2016 commented Apr 2, 2021

PythonCoderUnicorn commented Oct 18, 2021

aborruso commented Jun 11, 2020 •

edited

Loading