Skip to content

Commit

Permalink
tweaking parquet snack
Browse files Browse the repository at this point in the history
  • Loading branch information
laderast committed Sep 30, 2024
1 parent 19c8c6c commit db84c5a
Show file tree
Hide file tree
Showing 8 changed files with 57 additions and 56 deletions.
4 changes: 2 additions & 2 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@ <h3 class="no-anchor listing-title">
</a>
</div>
</div>
<div class="quarto-post image-right" data-index="2" data-categories="R,Tables,databases,bigdata" data-listing-date-sort="1727679600000" data-listing-file-modified-sort="1727717388576" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="3" data-listing-word-count-sort="529">
<div class="quarto-post image-right" data-index="2" data-categories="R,Tables,databases,bigdata" data-listing-date-sort="1727679600000" data-listing-file-modified-sort="1727717853705" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="3" data-listing-word-count-sort="589">
<div class="thumbnail">
<p><a href="./r_snacks/parquet.html" class="no-external"></a></p><a href="./r_snacks/parquet.html" class="no-external">
<p><img loading="lazy" src="./r_snacks/images/duckdb.svg" class="thumbnail-image"></p>
Expand Down Expand Up @@ -297,7 +297,7 @@ <h3 class="no-anchor listing-title">
</div>
</div>
<div class="listing-description">
<a href="./r_snacks/parquet.html" class="no-external">As you continue on with your work as a Data Scientist, you will start encountering parquet files as a way of storing data.</a>
<a href="./r_snacks/parquet.html" class="no-external">As you continue on with your work as a Data Scientist, you will start encountering parquet files as a way of storing large data. Let’s add to our toolkit.</a>
</div>
</div>
<div class="metadata">
Expand Down
8 changes: 4 additions & 4 deletions docs/python_snacks/wordcloud.html

Large diffs are not rendered by default.

14 changes: 7 additions & 7 deletions docs/r_snacks/gtsummary.html

Large diffs are not rendered by default.

14 changes: 7 additions & 7 deletions docs/r_snacks/naniar.html

Large diffs are not rendered by default.

31 changes: 16 additions & 15 deletions docs/r_snacks/parquet.html

Large diffs are not rendered by default.

26 changes: 13 additions & 13 deletions docs/r_snacks/patchwork.html

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,7 @@
"href": "r_snacks/parquet.html",
"title": "Large Data Work: Intro to parquet files in R",
"section": "",
"text": "As you continue on with your work as a Data Scientist, you will start encountering parquet files as a way of storing data.\nWhat is a parquet file? It is another way of storing data that is organized in a way that it’s fast to search. In contrast to row-based storage, where the data is stored by row, parquet is column-based: data is stored by column. There are some particular advantages to this storage type:\nThe main drawback is that there is additional technology required to work with parquet files."
"text": "As you continue on with your work as a Data Scientist, you will start encountering parquet files as a way of storing large data. Let’s add to our toolkit.\nWhat is a parquet file? It is another way of storing data that is organized in a way that it’s fast to search. In contrast to row-based storage, where the data is stored by row, parquet is column-based: data is stored by column. There are some particular advantages to this storage type:\nThe main drawback is that there is additional technology required to work with parquet files. We’ll use the nanoparquet, duckdb, and duckplyr packages to interact with them using a tidy workflow."
},
{
"objectID": "r_snacks/parquet.html#you-dont-load-parquet-files-into-memory",
Expand All @@ -291,14 +291,14 @@
"href": "r_snacks/parquet.html#a-titanic-parquet-file",
"title": "Large Data Work: Intro to parquet files in R",
"section": "A Titanic Parquet file",
"text": "A Titanic Parquet file\nWe’ve got a folder called data/, and there is a file called titanic.parquet in there.\n\n\n\n\n\n\n\n\nNow that we know what’s in our data/ folder, we can get some info on this file. We’ll use the {nanoparquet} package to get some info. We can get an idea of the size of the data using parquet_info():\n\n\n\n\n\n\n\n\nLet’s get info about the column types in our file with parquet_column_types\n\n\n\n\n\n\n\n\nIn general, nanoparquet is very useful for reading and writing parquet files. But remember, we are going to be interacting them through a database query engine. Enter DuckDB."
"text": "A Titanic Parquet file\nWe’ve got a folder called data/, and there is a file called titanic.parquet (link to data if you want to download) in there. L\n\n\n\n\n\n\n\n\nNow that we know what’s in our data/ folder, we can get some info on this file. We’ll use the {nanoparquet} package to get some info on the file. We can get an idea of the size of the data using parquet_info():\n\n\n\n\n\n\n\n\nLet’s get info about the column types in our file with parquet_column_types\n\n\n\n\n\n\n\n\nIn general, nanoparquet is very useful for reading and writing parquet files. But remember, we are going to be interacting them through a database query engine. Enter DuckDB."
},
{
"objectID": "r_snacks/parquet.html#querying-your-data",
"href": "r_snacks/parquet.html#querying-your-data",
"title": "Large Data Work: Intro to parquet files in R",
"section": "Querying your Data",
"text": "Querying your Data\nTo actually interact with the data, we can connect to the parquet file with the DuckDB package. We first need to start what’s called a database connection so that we can connect to the DuckDB software:\n\n\n\n\n\n\n\n\nNow, we’re going to make a “View”, which is a temporary table in DuckDB to do our queries on.\n\n\n\n\n\n\n\n\nNow, we have our connection and our view, we can start to take a look at the data. We can pass our connection and view with the tbl() function. Here we’re calling head(), followed by collect().\n\n\n\n\n\n\n\n\nWhy do we need to call collect()? DuckDB uses what’s called Lazy Execution: it only calculates a value when you tell it to. Lazy Execution allows DuckDB to make an execution plan: it finds the best strategy for executing the entire query rather than calculate it piece by piece."
"text": "Querying your Data\nTo actually interact with the data, we can connect to the parquet file with the DuckDB package. We first need to start what’s called a database connection so that we can connect to the DuckDB software:\n\n\n\n\n\n\n\n\nNow, we’re going to make a “View”, which is a temporary table in DuckDB to do our queries on. This also connects our parquet file with the PARQUET_SCAN() function in DuckDB.\nThis is the only SQL we need to write to interact with the data.\n\n\n\n\n\n\n\n\nNow, we have our connection and our view, we can start to take a look at the data. We can pass our connection and view with the tbl() function. Here we’re calling head(), followed by collect().\n\n\n\n\n\n\n\n\nWhy do we need to call collect()? DuckDB uses what’s called Lazy Execution: it only calculates a value when you tell it to. Lazy Execution allows DuckDB to make an execution plan: it finds the best strategy for executing the entire query rather than calculate it piece by piece."
},
{
"objectID": "r_snacks/parquet.html#summarizing-the-passengers",
Expand Down
10 changes: 5 additions & 5 deletions docs/snack_template.html

Large diffs are not rendered by default.

0 comments on commit db84c5a

Please sign in to comment.