tweaking parquet snack

fhdsl · Sep 30, 2024 · db84c5a · db84c5a
1 parent 19c8c6c
commit db84c5a
Show file tree

Hide file tree

Showing 8 changed files with 57 additions and 56 deletions.
diff --git a/docs/index.html b/docs/index.html
@@ -269,7 +269,7 @@ <h3 class="no-anchor listing-title">
 </a>
 </div>
 </div>
-<div class="quarto-post image-right" data-index="2" data-categories="R,Tables,databases,bigdata" data-listing-date-sort="1727679600000" data-listing-file-modified-sort="1727717388576" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="3" data-listing-word-count-sort="529">
+<div class="quarto-post image-right" data-index="2" data-categories="R,Tables,databases,bigdata" data-listing-date-sort="1727679600000" data-listing-file-modified-sort="1727717853705" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="3" data-listing-word-count-sort="589">
 <div class="thumbnail">
 <p><a href="./r_snacks/parquet.html" class="no-external"></a></p><a href="./r_snacks/parquet.html" class="no-external">
 <p><img loading="lazy" src="./r_snacks/images/duckdb.svg" class="thumbnail-image"></p>
@@ -297,7 +297,7 @@ <h3 class="no-anchor listing-title">
 </div>
 </div>
 <div class="listing-description">
-<a href="./r_snacks/parquet.html" class="no-external">As you continue on with your work as a Data Scientist, you will start encountering parquet files as a way of storing data.</a>
+<a href="./r_snacks/parquet.html" class="no-external">As you continue on with your work as a Data Scientist, you will start encountering parquet files as a way of storing large data. Let’s add to our toolkit.</a>
 </div>
 </div>
 <div class="metadata">

diff --git a/docs/python_snacks/wordcloud.html b/docs/python_snacks/wordcloud.html
diff --git a/docs/r_snacks/gtsummary.html b/docs/r_snacks/gtsummary.html
diff --git a/docs/r_snacks/naniar.html b/docs/r_snacks/naniar.html
diff --git a/docs/r_snacks/parquet.html b/docs/r_snacks/parquet.html
diff --git a/docs/r_snacks/patchwork.html b/docs/r_snacks/patchwork.html
diff --git a/docs/search.json b/docs/search.json
@@ -277,7 +277,7 @@
     "href": "r_snacks/parquet.html",
     "title": "Large Data Work: Intro to parquet files in R",
     "section": "",
-    "text": "As you continue on with your work as a Data Scientist, you will start encountering parquet files as a way of storing data.\nWhat is a parquet file? It is another way of storing data that is organized in a way that it’s fast to search. In contrast to row-based storage, where the data is stored by row, parquet is column-based: data is stored by column. There are some particular advantages to this storage type:\nThe main drawback is that there is additional technology required to work with parquet files."
+    "text": "As you continue on with your work as a Data Scientist, you will start encountering parquet files as a way of storing large data. Let’s add to our toolkit.\nWhat is a parquet file? It is another way of storing data that is organized in a way that it’s fast to search. In contrast to row-based storage, where the data is stored by row, parquet is column-based: data is stored by column. There are some particular advantages to this storage type:\nThe main drawback is that there is additional technology required to work with parquet files. We’ll use the nanoparquet, duckdb, and duckplyr packages to interact with them using a tidy workflow."
   },
   {
     "objectID": "r_snacks/parquet.html#you-dont-load-parquet-files-into-memory",
@@ -291,14 +291,14 @@
     "href": "r_snacks/parquet.html#a-titanic-parquet-file",
     "title": "Large Data Work: Intro to parquet files in R",
     "section": "A Titanic Parquet file",
-    "text": "A Titanic Parquet file\nWe’ve got a folder called data/, and there is a file called titanic.parquet in there.\n\n\n\n\n\n\n\n\nNow that we know what’s in our data/ folder, we can get some info on this file. We’ll use the {nanoparquet} package to get some info. We can get an idea of the size of the data using parquet_info():\n\n\n\n\n\n\n\n\nLet’s get info about the column types in our file with parquet_column_types\n\n\n\n\n\n\n\n\nIn general, nanoparquet is very useful for reading and writing parquet files. But remember, we are going to be interacting them through a database query engine. Enter DuckDB."
+    "text": "A Titanic Parquet file\nWe’ve got a folder called data/, and there is a file called titanic.parquet (link to data if you want to download) in there. L\n\n\n\n\n\n\n\n\nNow that we know what’s in our data/ folder, we can get some info on this file. We’ll use the {nanoparquet} package to get some info on the file. We can get an idea of the size of the data using parquet_info():\n\n\n\n\n\n\n\n\nLet’s get info about the column types in our file with parquet_column_types\n\n\n\n\n\n\n\n\nIn general, nanoparquet is very useful for reading and writing parquet files. But remember, we are going to be interacting them through a database query engine. Enter DuckDB."
   },
   {
     "objectID": "r_snacks/parquet.html#querying-your-data",
     "href": "r_snacks/parquet.html#querying-your-data",
     "title": "Large Data Work: Intro to parquet files in R",
     "section": "Querying your Data",
-    "text": "Querying your Data\nTo actually interact with the data, we can connect to the parquet file with the DuckDB package. We first need to start what’s called a database connection so that we can connect to the DuckDB software:\n\n\n\n\n\n\n\n\nNow, we’re going to make a “View”, which is a temporary table in DuckDB to do our queries on.\n\n\n\n\n\n\n\n\nNow, we have our connection and our view, we can start to take a look at the data. We can pass our connection and view with the tbl() function. Here we’re calling head(), followed by collect().\n\n\n\n\n\n\n\n\nWhy do we need to call collect()? DuckDB uses what’s called Lazy Execution: it only calculates a value when you tell it to. Lazy Execution allows DuckDB to make an execution plan: it finds the best strategy for executing the entire query rather than calculate it piece by piece."
+    "text": "Querying your Data\nTo actually interact with the data, we can connect to the parquet file with the DuckDB package. We first need to start what’s called a database connection so that we can connect to the DuckDB software:\n\n\n\n\n\n\n\n\nNow, we’re going to make a “View”, which is a temporary table in DuckDB to do our queries on. This also connects our parquet file with the PARQUET_SCAN() function in DuckDB.\nThis is the only SQL we need to write to interact with the data.\n\n\n\n\n\n\n\n\nNow, we have our connection and our view, we can start to take a look at the data. We can pass our connection and view with the tbl() function. Here we’re calling head(), followed by collect().\n\n\n\n\n\n\n\n\nWhy do we need to call collect()? DuckDB uses what’s called Lazy Execution: it only calculates a value when you tell it to. Lazy Execution allows DuckDB to make an execution plan: it finds the best strategy for executing the entire query rather than calculate it piece by piece."
   },
   {
     "objectID": "r_snacks/parquet.html#summarizing-the-passengers",

diff --git a/docs/snack_template.html b/docs/snack_template.html