Skip to content

Commit

Permalink
duckdb 1.29.0; self-host extensions (#1734)
Browse files Browse the repository at this point in the history
* explicit duckdb 1.29.0; self-host core extensions; document

* configure which extensions are self-hosted

(not quite there yet: still need to do hashing, per-extension configuration of the LOAD command, and per page configuration)

* hash extensions

* better docs

* cleaner duckdb manifest — now works in scripts and embeds

* restructure code, extensible manifest

* test, documentation

* much nicer config

* document config

* add support for mvp, clean config & documentation

* parametrized the initial LOAD in DuckDBClient

* tests

* bake-in the extensions manifest

* fix test

* don't activate spatial on the documentation

* refactor: hash individual extensions, include the list of platforms in the config (not configurable yet)

* don't copy extensions twice

* Update src/duckdb.ts

Co-authored-by: Mike Bostock <[email protected]>

* remove DuckDBClientReport utility

* renames

* p for platform

* centralize DUCKDBWASMVERSION and DUCKDBVERSION

* clearer

* better config; manifest.extensions now lists individual extensions once only, with one reference per platform

* validate extension names; centralize DUCKDBBUNDLES

* fix tests

* copy edit

* support loading non-self-hosted extensions

* test duckdb config normalization & defaults

* documentation

* typography

* doc

* use view for <50MB

* docs, shorthand, etc.

* annotate fixes

* disable telemetry on annotate tests, too

* tidier duckdb manifest

* remove todo

* more robust duckdb: scheme

---------

Co-authored-by: Mike Bostock <[email protected]>
  • Loading branch information
Fil and mbostock authored Nov 2, 2024
1 parent 9d6c967 commit 02dd892
Show file tree
Hide file tree
Showing 44 changed files with 688 additions and 72 deletions.
39 changes: 39 additions & 0 deletions docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,45 @@ export default {
};
```

## duckdb <a href="https://github.com/observablehq/framework/pull/1734" class="observablehq-version-badge" data-version="prerelease" title="Added in #1734"></a>

The **duckdb** option configures [self-hosting](./lib/duckdb#self-hosting-of-extensions) and loading of [DuckDB extensions](./lib/duckdb#extensions) for use in [SQL code blocks](./sql) and the `sql` and `DuckDBClient` built-ins. For example, a geospatial data app might enable the [`spatial`](https://duckdb.org/docs/extensions/spatial/overview.html) and [`h3`](https://duckdb.org/community_extensions/extensions/h3.html) extensions like so:

```js run=false
export default {
duckdb: {
extensions: ["spatial", "h3"]
}
};
```

The **extensions** option can either be an array of extension names, or an object whose keys are extension names and whose values are configuration options for the given extension, including its **source** repository (defaulting to the keyword _core_ for core extensions, and otherwise _community_; can also be a custom repository URL), whether to **load** it immediately (defaulting to true, except for known extensions that support autoloading), and whether to **install** it (_i.e._ to self-host, defaulting to true). As additional shorthand, you can specify `[name]: true` to install and load the named extension from the default (_core_ or _community_) source repository, or `[name]: string` to install and load the named extension from the given source repository.

The configuration above is equivalent to:

```js run=false
export default {
duckdb: {
extensions: {
spatial: {
source: "https://extensions.duckdb.org/",
install: true,
load: true
},
h3: {
source: "https://community-extensions.duckdb.org/",
install: true,
load: true
}
}
}
};
```

The `json` and `parquet` are configured (and therefore self-hosted) by default. To expressly disable self-hosting of extension, you can set its **install** property to false, or equivalently pass null as the extension configuration object.

For more, see [DuckDB extensions](./lib/duckdb#extensions).

## markdownIt <a href="https://github.com/observablehq/framework/releases/tag/v1.1.0" class="observablehq-version-badge" data-version="^1.1.0" title="Added in v1.1.0"></a>

A hook for registering additional [markdown-it](https://github.com/markdown-it/markdown-it) plugins. For example, to use [markdown-it-footnote](https://github.com/markdown-it/markdown-it-footnote), first install the plugin with either `npm add markdown-it-footnote` or `yarn add markdown-it-footnote`, then register it like so:
Expand Down
95 changes: 94 additions & 1 deletion docs/lib/duckdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ const db2 = await DuckDBClient.of({base: FileAttachment("quakes.db")});
db2.queryRow(`SELECT COUNT() FROM base.events`)
```

For externally-hosted data, you can create an empty `DuckDBClient` and load a table from a SQL query, say using [`read_parquet`](https://duckdb.org/docs/guides/import/parquet_import) or [`read_csv`](https://duckdb.org/docs/guides/import/csv_import). DuckDB offers many affordances to make this easier (in many cases it detects the file format and uses the correct loader automatically).
For externally-hosted data, you can create an empty `DuckDBClient` and load a table from a SQL query, say using [`read_parquet`](https://duckdb.org/docs/guides/import/parquet_import) or [`read_csv`](https://duckdb.org/docs/guides/import/csv_import). DuckDB offers many affordances to make this easier. (In many cases it detects the file format and uses the correct loader automatically.)

```js run=false
const db = await DuckDBClient.of();
Expand Down Expand Up @@ -105,3 +105,96 @@ const sql = DuckDBClient.sql({quakes: `https://earthquake.usgs.gov/earthquakes/f
```sql echo
SELECT * FROM quakes ORDER BY updated DESC;
```

## Extensions <a href="https://github.com/observablehq/framework/pull/1734" class="observablehq-version-badge" data-version="prerelease" title="Added in #1734"></a>

[DuckDB extensions](https://duckdb.org/docs/extensions/overview.html) extend DuckDB’s functionality, adding support for additional file formats, new types, and domain-specific functions. For example, the [`json` extension](https://duckdb.org/docs/data/json/overview.html) provides a `read_json` method for reading JSON files:

```sql echo
SELECT bbox FROM read_json('https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson');
```

To read a local file (or data loader), use `FileAttachment` and interpolation `${…}`:

```sql echo
SELECT bbox FROM read_json(${FileAttachment("../quakes.json").href});
```

For convenience, Framework configures the `json` and `parquet` extensions by default. Some other [core extensions](https://duckdb.org/docs/extensions/core_extensions.html) also autoload, meaning that you don’t need to explicitly enable them; however, Framework will only [self-host extensions](#self-hosting-of-extensions) if you explicitly configure them, and therefore we recommend that you always use the [**duckdb** config option](../config#duckdb) to configure DuckDB extensions. Any configured extensions will be automatically [installed and loaded](https://duckdb.org/docs/extensions/overview#explicit-install-and-load), making them available in SQL code blocks as well as the `sql` and `DuckDBClient` built-ins.

For example, to configure the [`spatial` extension](https://duckdb.org/docs/extensions/spatial/overview.html):

```js run=false
export default {
duckdb: {
extensions: ["spatial"]
}
};
```

You can then use the `ST_Area` function to compute the area of a polygon:

```sql echo run=false
SELECT ST_Area('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))'::GEOMETRY) as area;
```

To tell which extensions have been loaded, you can run the following query:

```sql echo
FROM duckdb_extensions() WHERE loaded;
```

<div class="warning">

If the `duckdb_extensions()` function runs before DuckDB autoloads a core extension (such as `json`), it might not be included in the returned set.

</div>

### Self-hosting of extensions

As with [npm imports](../imports#self-hosting-of-npm-imports), configured DuckDB extensions are self-hosted, improving performance, stability, & security, and allowing you to develop offline. Extensions are downloaded to the DuckDB cache folder, which lives in <code>.observablehq/<wbr>cache/<wbr>\_duckdb</code> within the source root (typically `src`). You can clear the cache and restart the preview server to re-fetch the latest versions of any DuckDB extensions. If you use an [autoloading core extension](https://duckdb.org/docs/extensions/core_extensions.html#list-of-core-extensions) that is not configured, DuckDB-Wasm [will load it](https://duckdb.org/docs/api/wasm/extensions.html#fetching-duckdb-wasm-extensions) from the default extension repository, `extensions.duckdb.org`, at runtime.

## Configuring

The second argument to `DuckDBClient.of` and `DuckDBClient.sql` is a [`DuckDBConfig`](https://shell.duckdb.org/docs/interfaces/index.DuckDBConfig.html) object which configures the behavior of DuckDB-Wasm. By default, Framework sets the `castBigIntToDouble` and `castTimestampToDate` query options to true. To instead use [`BigInt`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt):

```js run=false
const bigdb = DuckDBClient.of({}, {query: {castBigIntToDouble: false}});
```

By default, `DuckDBClient.of` and `DuckDBClient.sql` automatically load all [configured extensions](#extensions). To change the loaded extensions for a particular `DuckDBClient`, use the **extensions** config option. For example, pass an empty array to instantiate a DuckDBClient with no loaded extensions (even if your configuration lists several):

```js echo run=false
const simpledb = DuckDBClient.of({}, {extensions: []});
```

Alternatively, you can configure extensions to be self-hosted but not load by default using the **duckdb** config option and the `load: false` shorthand:

```js run=false
export default {
duckdb: {
extensions: {
spatial: false,
h3: false
}
}
};
```

You can then selectively load extensions as needed like so:

```js echo run=false
const geosql = DuckDBClient.sql({}, {extensions: ["spatial", "h3"]});
```

In the future, we’d like to allow DuckDB to be configured globally (beyond just [extensions](#extensions)) via the [**duckdb** config option](../config#duckdb); please upvote [#1791](https://github.com/observablehq/framework/issues/1791) if you are interested in this feature.

## Versioning

Framework currently uses [DuckDB-Wasm 1.29.0](https://github.com/duckdb/duckdb-wasm/releases/tag/v1.29.0), which aligns with [DuckDB 1.1.1](https://github.com/duckdb/duckdb/releases/tag/v1.1.1). You can load a different version of DuckDB-Wasm by importing `npm:@duckdb/duckdb-wasm` directly, for example:

```js run=false
import * as duckdb from "npm:@duckdb/[email protected]";
```

However, you will not be able to change the version of DuckDB-Wasm used by SQL code blocks or the `sql` or `DuckDBClient` built-ins, nor can you use Framework’s support for self-hosting extensions with a different version of DuckDB-Wasm.
2 changes: 1 addition & 1 deletion docs/project-structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ For this site, routes map to files as:
/hello → dist/hello.html → src/hello.md
```

This assumes [“clean URLs”](./config#clean-urls) as supported by most static site servers; `/hello` can also be accessed as `/hello.html`, and `/` can be accessed as `/index` and `/index.html`. (Some static site servers automatically redirect to clean URLs, but we recommend being consistent when linking to your site.)
This assumes [“clean URLs”](./config#preserve-extension) as supported by most static site servers; `/hello` can also be accessed as `/hello.html`, and `/` can be accessed as `/index` and `/index.html`. (Some static site servers automatically redirect to clean URLs, but we recommend being consistent when linking to your site.)

Apps should always have a top-level `index.md` in the source root; this is your app’s home page, and it’s what people visit by default.

Expand Down
2 changes: 1 addition & 1 deletion docs/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ sql:

<div class="tip">For performance and reliability, we recommend using local files rather than loading data from external servers at runtime. You can use a <a href="./data-loaders">data loader</a> to take a snapshot of a remote data during build if needed.</div>

You can also register tables via code (say to have sources that are defined dynamically via user input) by defining the `sql` symbol with [DuckDBClient.sql](./lib/duckdb).
You can also register tables via code (say to have sources that are defined dynamically via user input) by defining the `sql` symbol with [DuckDBClient.sql](./lib/duckdb). To register [DuckDB extensions](./lib/duckdb#extensions), use the [**duckdb** config option](./config#duckdb).

## SQL code blocks

Expand Down
11 changes: 6 additions & 5 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,12 @@
"docs:deploy": "tsx --no-warnings=ExperimentalWarning ./src/bin/observable.ts deploy",
"build": "rimraf dist && node build.js --outdir=dist --outbase=src \"src/**/*.{ts,js,css}\" --ignore \"**/*.d.ts\"",
"test": "concurrently npm:test:mocha npm:test:tsc npm:test:lint npm:test:prettier",
"test:coverage": "c8 --check-coverage --lines 80 --per-file yarn test:mocha",
"test:build": "rimraf test/build && cross-env npm_package_version=1.0.0-test node build.js --sourcemap --outdir=test/build \"{src,test}/**/*.{ts,js,css}\" --ignore \"test/input/**\" --ignore \"test/output/**\" --ignore \"test/preview/dashboard/**\" --ignore \"**/*.d.ts\" && cp -r templates test/build",
"test:mocha": "yarn test:build && rimraf --glob test/.observablehq/cache test/input/build/*/.observablehq/cache && cross-env OBSERVABLE_TELEMETRY_DISABLE=1 TZ=America/Los_Angeles mocha --timeout 30000 -p \"test/build/test/**/*-test.js\" && yarn test:annotate",
"test:mocha:serial": "yarn test:build && rimraf --glob test/.observablehq/cache test/input/build/*/.observablehq/cache && cross-env OBSERVABLE_TELEMETRY_DISABLE=1 TZ=America/Los_Angeles mocha --timeout 30000 \"test/build/test/**/*-test.js\" && yarn test:annotate",
"test:annotate": "yarn test:build && cross-env OBSERVABLE_ANNOTATE_FILES=true TZ=America/Los_Angeles mocha --timeout 30000 \"test/build/test/**/annotate.js\"",
"test:coverage": "c8 --check-coverage --lines 80 --per-file yarn test:mocha:all",
"test:build": "rimraf test/build && rimraf --glob test/.observablehq/cache test/input/build/*/.observablehq/cache && cross-env npm_package_version=1.0.0-test node build.js --sourcemap --outdir=test/build \"{src,test}/**/*.{ts,js,css}\" --ignore \"test/input/**\" --ignore \"test/output/**\" --ignore \"test/preview/dashboard/**\" --ignore \"**/*.d.ts\" && cp -r templates test/build",
"test:mocha": "yarn test:mocha:serial -p",
"test:mocha:serial": "yarn test:build && cross-env OBSERVABLE_TELEMETRY_DISABLE=1 TZ=America/Los_Angeles mocha --timeout 30000 \"test/build/test/**/*-test.js\"",
"test:mocha:annotate": "yarn test:build && cross-env OBSERVABLE_TELEMETRY_DISABLE=1 OBSERVABLE_ANNOTATE_FILES=true TZ=America/Los_Angeles mocha --timeout 30000 \"test/build/test/**/annotate.js\"",
"test:mocha:all": "yarn test:mocha && cross-env OBSERVABLE_TELEMETRY_DISABLE=1 OBSERVABLE_ANNOTATE_FILES=true TZ=America/Los_Angeles mocha --timeout 30000 \"test/build/test/**/annotate.js\"",
"test:lint": "eslint src test --max-warnings=0",
"test:prettier": "prettier --check src test",
"test:tsc": "tsc --noEmit",
Expand Down
24 changes: 21 additions & 3 deletions src/build.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ import {existsSync} from "node:fs";
import {copyFile, readFile, rm, stat, writeFile} from "node:fs/promises";
import {basename, dirname, extname, join} from "node:path/posix";
import type {Config} from "./config.js";
import {getDuckDBManifest} from "./duckdb.js";
import {CliError} from "./error.js";
import {getClientPath, prepareOutput} from "./files.js";
import {findModule, getModuleHash, readJavaScript} from "./javascript/module.js";
Expand Down Expand Up @@ -53,7 +54,7 @@ export async function build(
{config}: BuildOptions,
effects: BuildEffects = new FileBuildEffects(config.output, join(config.root, ".observablehq", "cache"))
): Promise<void> {
const {root, loaders} = config;
const {root, loaders, duckdb} = config;
Telemetry.record({event: "build", step: "start"});

// Prepare for build (such as by emptying the existing output root).
Expand Down Expand Up @@ -140,6 +141,21 @@ export async function build(
effects.logger.log(cachePath);
}

// Copy over the DuckDB extensions, initializing aliases that are needed to
// construct the DuckDB manifest.
for (const path of globalImports) {
if (path.startsWith("/_duckdb/")) {
const sourcePath = join(cacheRoot, path);
effects.output.write(`${faint("build")} ${path} ${faint("→")} `);
const contents = await readFile(sourcePath);
const hash = createHash("sha256").update(contents).digest("hex").slice(0, 8);
const [, , , version, bundle, name] = path.split("/");
const alias = join("/_duckdb/", `${basename(name, ".duckdb_extension.wasm")}-${hash}`, version, bundle, name);
aliases.set(path, alias);
await effects.writeFile(alias, contents);
}
}

// Generate the client bundles. These are initially generated into the cache
// because we need to rewrite any npm and node imports to be hashed; this is
// handled generally for all global imports below.
Expand All @@ -149,6 +165,7 @@ export async function build(
effects.output.write(`${faint("bundle")} ${path} ${faint("→")} `);
const clientPath = getClientPath(path === "/_observablehq/client.js" ? "index.js" : path.slice("/_observablehq/".length)); // prettier-ignore
const define: {[key: string]: string} = {};
if (path === "/_observablehq/stdlib/duckdb.js") define["DUCKDB_MANIFEST"] = JSON.stringify(await getDuckDBManifest(duckdb, {root, aliases})); // prettier-ignore
const contents = await rollupClient(clientPath, root, path, {minify: true, keepNames: true, define});
await prepareOutput(cachePath);
await writeFile(cachePath, contents);
Expand Down Expand Up @@ -202,9 +219,10 @@ export async function build(

// Copy over global assets (e.g., minisearch.json, DuckDB’s WebAssembly).
// Anything in _observablehq also needs a content hash, but anything in _npm
// or _node does not (because they are already necessarily immutable).
// or _node does not (because they are already necessarily immutable). We’re
// skipping DuckDB’s extensions because they were previously copied above.
for (const path of globalImports) {
if (path.endsWith(".js")) continue;
if (path.endsWith(".js") || path.startsWith("/_duckdb/")) continue;
const sourcePath = join(cacheRoot, path);
effects.output.write(`${faint("build")} ${path} ${faint("→")} `);
if (path.startsWith("/_observablehq/")) {
Expand Down
52 changes: 37 additions & 15 deletions src/client/stdlib/duckdb.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,25 @@ import * as duckdb from "npm:@duckdb/duckdb-wasm";
// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
// POSSIBILITY OF SUCH DAMAGE.

const bundle = await duckdb.selectBundle({
mvp: {
mainModule: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-mvp.wasm"),
mainWorker: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-browser-mvp.worker.js")
},
eh: {
mainModule: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-eh.wasm"),
mainWorker: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-browser-eh.worker.js")
}
});

// Baked-in manifest.
// eslint-disable-next-line no-undef
const manifest = DUCKDB_MANIFEST;
const candidates = {
...(manifest.bundles.includes("mvp") && {
mvp: {
mainModule: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-mvp.wasm"),
mainWorker: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-browser-mvp.worker.js")
}
}),
...(manifest.bundles.includes("eh") && {
eh: {
mainModule: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-eh.wasm"),
mainWorker: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-browser-eh.worker.js")
}
})
};
const bundle = await duckdb.selectBundle(candidates);
const activePlatform = manifest.bundles.find((key) => bundle.mainModule === candidates[key].mainModule);
const logger = new duckdb.ConsoleLogger(duckdb.LogLevel.WARNING);

let db;
Expand Down Expand Up @@ -169,6 +177,7 @@ export class DuckDBClient {
config = {...config, query: {...config.query, castBigIntToDouble: true}};
}
await db.open(config);
await registerExtensions(db, config.extensions);
await Promise.all(Object.entries(sources).map(([name, source]) => insertSource(db, name, source)));
return new DuckDBClient(db);
}
Expand All @@ -178,9 +187,22 @@ export class DuckDBClient {
}
}

Object.defineProperty(DuckDBClient.prototype, "dialect", {
value: "duckdb"
});
Object.defineProperty(DuckDBClient.prototype, "dialect", {value: "duckdb"});

async function registerExtensions(db, extensions) {
const con = await db.connect();
try {
await Promise.all(
manifest.extensions.map(([name, {[activePlatform]: ref, load}]) =>
con
.query(`INSTALL "${name}" FROM '${import.meta.resolve(ref)}'`)
.then(() => (extensions === undefined ? load : extensions.includes(name)) && con.query(`LOAD "${name}"`))
)
);
} finally {
await con.close();
}
}

async function insertSource(database, name, source) {
source = await source;
Expand Down Expand Up @@ -258,7 +280,7 @@ async function insertFile(database, name, file, options) {
});
}
if (/\.parquet$/i.test(file.name)) {
const table = file.size < 10e6 ? "TABLE" : "VIEW"; // for small files, materialize the table
const table = file.size < 50e6 ? "TABLE" : "VIEW"; // for small files, materialize the table
return await connection.query(`CREATE ${table} '${name}' AS SELECT * FROM parquet_scan('${file.name}')`);
}
if (/\.(db|ddb|duckdb)$/i.test(file.name)) {
Expand Down
Loading

0 comments on commit 02dd892

Please sign in to comment.