diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md index 88a2be28815..74f0c65936e 100644 --- a/docs/CONTRIBUTING.md +++ b/docs/CONTRIBUTING.md @@ -28,7 +28,7 @@ git checkout -B new-branch-name ## Local package development -### Environment +### Python Environment Use `pip` to install `quilt` locally (including development dependencies): @@ -42,7 +42,7 @@ install](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs) of `quilt`, allowing you to modify the code and test your changes right away. -### Testing +### Python Testing All new code contributions are expected to have complete unit test coverage, and to pass all preexisting tests. @@ -62,7 +62,7 @@ catalog if you already have a catalog deployed to AWS, because the catalog relies on certain services (namely, AWS Lambda and the AWS Elasticsearch Service) which cannot be run locally. -### Environment +### Catalog Environment Use `npm` to install the catalog (`quilt-navigator`) dependencies locally: @@ -152,7 +152,7 @@ Make sure that any images you check into the repository are [optimized](https://kinsta.com/blog/optimize-images-for-web/) at check-in time. -### Testing +### Catalog Testing To run the catalog unit tests: diff --git a/docs/Catalog/Query.md b/docs/Catalog/Query.md new file mode 100644 index 00000000000..0f2c93c11a0 --- /dev/null +++ b/docs/Catalog/Query.md @@ -0,0 +1,55 @@ + +[Amazon Athena](https://aws.amazon.com/athena/) is an interactive query service +that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is +serverless, so there is no infrastructure to manage, and you pay only for the +queries that you run. + +The Catalog's Queries tab allows you to run Athena queries against your S3 +buckets, and any other data sources your users have access to. There are +prebuilt tables for packages and objects, and you can create your own tables and +views. See, for example, [Tabulator](advanced-features/tabulator.md). + +## Basics + +"Run query" executes the selected query and waits for the result. + +![ui](../imgs/athena-ui.png) + + Individual users will also see their past queries, and easily re-run them. + +![history](../imgs/athena-history.png) + +## Example: query package-level metadata + +Suppose we wish to find all packages produced by algorithm version 1.3 with a +cell index of 5. + +```sql +SELECT * FROM "YOUR-BUCKET_packages-view" +-- extract and query package-level metadata +WHERE json_extract_scalar(meta, + '$.user_meta.nucmembsegmentationalgorithmversion') LIKE '1.3%' +AND json_array_contains(json_extract(meta, '$.user_meta.cellindex'), '5'); +``` + +## Example: query object-level metadata + +Suppose we wish to find all .tiff files produced by algorithm version 1.3 +with a cell index of 5. + +```sql +SELECT * FROM "YOUR-BUCKET_objects-view" +WHERE substr(logical_key, -5) = '.tiff' +-- extract and query object-level metadata +AND json_extract_scalar(meta, + '$.user_meta.nucmembsegmentationalgorithmversion') LIKE '1.3%' +AND json_array_contains(json_extract(meta, '$.user_meta.cellindex'), '5'); +``` + +## Configuration + +Athena queries saved from the AWS Console for a given workgroup will be +available in the Quilt Catalog for all users to run. + +Administrators can hide the "Queries" tab by setting `ui > nav > queries: false` +([learn more](./Preferences.md)). diff --git a/docs/Catalog/SearchQuery.md b/docs/Catalog/Search.md similarity index 55% rename from docs/Catalog/SearchQuery.md rename to docs/Catalog/Search.md index 4b51655b4a4..0007e1e6ee4 100644 --- a/docs/Catalog/SearchQuery.md +++ b/docs/Catalog/Search.md @@ -1,22 +1,19 @@ - -Quilt provides support for queries in the Elasticsearch DSL, as -well as SQL queries in Athena. + + +Each Quilt stack includes an Elasticsearch cluster that indexes objects and +packages as documents. The objects in Amazon S3 buckets connected to Quilt are +synchronized to an Elasticsearch cluster, which provides Quilt's search and +package listing features. -## Elasticsearch +## Indexing -The objects in Amazon S3 buckets connected to Quilt are synchronized to -an Elasticsearch cluster, which provides Quilt's search features. - -Quilt uses Elasticsearch 6.7 -([docs](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/index.html)). - -### Indexing Quilt maintains a near-realtime index of the objects in your S3 bucket in Elasticsearch. Each bucket corresponds to one or more Elasticsearch indexes. As objects are mutated in S3, Quilt uses an event-driven system (via SNS and SQS) to update Elasticsearch. There are two types of indexing in Quilt: + * *shallow* indexing includes object metadata (such as the file name and size) * *deep* indexing includes object contents. Quilt supports deep indexing for the following file extensions: @@ -28,24 +25,18 @@ indexing for the following file extensions: * .pptx * .xls, .xlsx -> By default, Quilt indexes a limited number of bytes per document for specified file -formats (100KB). Both the max number of bytes per document and which file formats -to deep index can be customized per Bucket in the Catalog Admin settings. - -![Example of Admin Bucket indexing options](../imgs/elastic-search-indexing-options.png) - ### Search Bar The search bar on every page in the catalog provides a convenient shortcut for searching objects and packages in an Amazon S3 bucket. -> Quilt uses Elasticsearch 6.7 [query string -> syntax](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-query-string-query.html#query-string-syntax). +NOTE: Quilt uses Elasticsearch 6.7 [query string +syntax](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-query-string-query.html#query-string-syntax). The following are all valid search parameters: -**Fields** +#### Fields | Syntax | Description | Example | |- | - | - | @@ -65,7 +56,7 @@ The following are all valid search parameters: | `package_stats.total_bytes` | Package total bytes | `package_stats.total_bytes:<100` | | `workflow.id` | Package workflow ID | `workflow.id:verify-metadata` | -**Logical operators and grouping** +#### Logical operators and grouping | Syntax | Description | Example | |- | - | - | @@ -75,7 +66,7 @@ The following are all valid search parameters: | `_exists_` | Matches any non-null value for the given field | `_exists_: content` | | `()` | Group terms | `(a AND b) NOT c` | -**Wildcard and regular expressions** +#### Wildcard and regular expressions | Syntax | Description | Example | |- | - | - | @@ -83,55 +74,27 @@ The following are all valid search parameters: | `?` | Exactly one character | `ext:React.?sx` | | `//` | Regular expression (slows performance) | `content:/lmnb[12]/` | -### QUERIES > ELASTICSEARCH tab +### ELASTICSEARCH tab + +When you click into a specific bucket, you can access the Elasticsearch tab to +run more complex queries. The Elasticsearch tab provides a more powerful search +interface than the search bar, allowing you to specify the Elasticsearch index +and query parameters. -![](../imgs/catalog-es-queries-default.png) +![catalog-es-queries-default](../imgs/catalog-es-queries-default.png) Quilt Elasticsearch queries support the following keys: -- `index` — comma-separated list of indexes to search ([learn + +* `index` — comma-separated list of indexes to search ([learn more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/multi-index.html)) -- `filter_path` — to reducing response nesting, ([learn +* `filter_path` — to reducing response nesting, ([learn more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/common-options.html#common-options-response-filtering)) -- `_source` — boolean that adds or removes the `_source` field, or +* `_source` — boolean that adds or removes the `_source` field, or a list of fields to return ([learn more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-source-filtering.html)) -- `size` — limits the number of hits ([learn +* `size` — limits the number of hits ([learn more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-uri-request.html)) -- `from` — starting offset for pagination ([learn +* `from` — starting offset for pagination ([learn more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-uri-request.html)) -- `body` — the search query body as a JSON dictionary ([learn +* `body` — the search query body as a JSON dictionary ([learn more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-body.html)) - -#### Saved queries -You can provide pre-canned queries for your users by providing a configuration file -at `s3://YOUR_BUCKET/.quilt/queries/config.yaml`: - -```yaml -version: "1" -queries: - query-1: - name: My first query - description: Optional description - url: s3://BUCKET/.quilt/queries/query-1.json - query-2: - name: Second query - url: s3://BUCKET/.quilt/queries/query-2.json -``` - -The Quilt catalog displays your saved queries in a drop-down for your users to -select, edit, and execute. - -## Athena - -You can park reusable Athena Queries in the Quilt catalog so that your users can -run them. You must first set up you an Athena workgroup and Saved queries per -[AWS's Athena documentation](https://docs.aws.amazon.com/athena/latest/ug/getting-started.html). - -### Configuration -You can hide the "Queries" tab by setting `ui > nav > queries: false` ([learn more](./Preferences.md)). - -### Basics -"Run query" executes the selected query and waits for the result. - -![](../imgs/athena-ui.png) -![](../imgs/athena-history.png) diff --git a/docs/FAQ.md b/docs/FAQ.md index 2290c51511b..02c20d9c2e7 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -1,4 +1,4 @@ - + ## How do I sync my notebook and all of its data and models to S3 as a package? ```python @@ -6,13 +6,16 @@ p = quilt3.Package() p.set_dir(".", ".") p.push("USR/PKG", message="MSG", registry="s3://BUCKET") ``` -> Use a [.quiltignore file](https://docs.quiltdata.com/advanced-usage/.quiltignore) -for more control over which files `set_dir()` includes. + +> Use a [.quiltignore +file](https://docs.quiltdata.com/advanced-usage/.quiltignore) for more control +over which files `set_dir()` includes. ## How does Quilt versioning relate to S3 object versioning? -Quilt packages are one level of abstraction above S3 object versions. -Object versions track mutations to a single file, -whereas a quilt package references a *collection* files and assigns this collection a unique version. + +Quilt packages are one level of abstraction above S3 object versions. Object +versions track mutations to a single file, whereas a quilt package references a +*collection* files and assigns this collection a unique version. It is strongly recommended that you enable object versioning on the S3 buckets that you push Quilt packages to. @@ -20,13 +23,16 @@ Object versioning ensures that mutations to every object are tracked, and provides some protection against deletion. ## Where are the Quilt 2 packages? + Visit [legacy.quiltdata.com](https://legacy.quiltdata.com/) and use [`quilt`](https://pypi.org/project/quilt/) on PyPI. ## Does `quilt3` collect anonymous usage statistics? + Yes, to find bugs and prioritize features. You can disable anonymous usage collection with an environment variable: + ```bash export QUILT_DISABLE_USAGE_METRICS=true ``` @@ -37,6 +43,7 @@ to persistently disable anonymous usage statistics. ## Can I turn off TQDM progress bars for log files? Yes: + ```bash export QUILT_MINIMIZE_STDOUT=true ``` @@ -44,15 +51,20 @@ export QUILT_MINIMIZE_STDOUT=true ## Which version of Quilt are you on? ### Python client + ```bash quilt3 --version ``` ### CloudFormation application + 1. Go to CloudFormation > Stacks > YourQuiltStack > Outputs 1. Copy the row labeled TemplateBuildMetadata 1. "git_revision" is your template version +This information is also available in the footer of the main page of the +Catalog. + ## Hashing during `push` takes a long time. Can I speed it up? Yes. Follow these steps: @@ -62,26 +74,34 @@ a local machine or foreign region)—I/O is much faster. 1. Use a larger instance with more vCPUs. -1. Increase [`QUILT_TRANSFER_MAX_CONCURRENCY`](api-reference/cli.md#quilt_transfer_max_concurrency) +1. Increase + +[`QUILT_TRANSFER_MAX_CONCURRENCY`](api-reference/cli.md#quilt_transfer_max_concurrency) above its default to match your available vCPUs. -1. If you are using Quilt Catalog 1.51 (released Feb 2024), you can enable the `ChunkedChecksums` CloudFormation parameter so it will calculate the checksums in parallel, or reuse them if already existing in S3. Parallel checksums are also available by default in `quilt3` v6 or later (pre-released Feb 2024). +1. If you are using Quilt Catalog 1.51 (released Feb 2024), you can enable the + `ChunkedChecksums` CloudFormation parameter so it will calculate the + checksums in parallel, or reuse them if already existing in S3. Parallel + checksums are also available by default in `quilt3` v6 or later (pre-released + Feb 2024). ## Does Quilt work with R? -In the scientific computing community, the [R Project](https://www.r-project.org/) -is commonly used as an alternative, or companion, to Python. It is a language and -environment for statistical computing and graphics, and is available as Free Software -under the [GNU General Public License](https://www.r-project.org/COPYING). +In the scientific computing community, the [R +Project](https://www.r-project.org/) is commonly used as an alternative, or +companion, to Python. It is a language and environment for statistical computing +and graphics, and is available as Free Software under the [GNU General Public +License](https://www.r-project.org/COPYING). Currently there are no plans to release a Quilt package for distribution through -the [CRAN package repository](https://cloud.r-project.org/). However, you can still -use Quilt with R, using either: +the [CRAN package repository](https://cloud.r-project.org/). However, you can +still use Quilt with R, using either: 1. The Command Line Interface (CLI) API 1. [Reticulate](https://rstudio.github.io/reticulate/) ### Using the Quilt CLI API with R + You can script the Quilt CLI directly from your shell environment and chain it with your R scripts to create a unified workflow: @@ -89,25 +109,30 @@ with your R scripts to create a unified workflow: ```bash quilt3 install my-package # download Quilt data package [Run R commands or scripts] # modify the data in Quilt data package using R -quilt3 push --dir path/to/remote-registry my-package # upload Quilt data package to the remote registry +quilt3 push --dir path/to/remote-registry my-package +# upload Quilt data package to the remote registry ``` ### Using Quilt with Reticulate -The [Reticulate](https://rstudio.github.io/reticulate/) package provides a set of tools -for interoperability between Python and R by embedding a Python session within your R session. + +The [Reticulate](https://rstudio.github.io/reticulate/) package provides a set +of tools for interoperability between Python and R by embedding a Python session +within your R session. ## How do I delete a data package and all of the objects in the data package? You may have a test data package that you wish to delete at some point to ensure -your data repository is clean and organized. *Please do this very carefully!* +your data repository is clean and organized. *Please do this very carefully!* In favor of immutability, Quilt makes deletion a bit tricky. First, note that `quilt3.Package.delete` only deletes the -_package manifest_, not the *underlying objects*. If you wish to delete -the entire package *and* its objects, _delete the objects first_. +*package manifest*, not the *underlying objects*. If you wish to delete +the entire package *and* its objects, *delete the objects first*. -*Warning: the objects you delete will be lost forever. Ditto for the package revision.* +*Warning: the objects you delete will be lost forever. Ditto for the package +revision.* -To delete, first browse the package then walk it, deleting its entry objects as follows: +To delete, first browse the package then walk it, deleting its entry objects as +follows: ```python @@ -125,38 +150,46 @@ for (k, e) in p.walk(): s3.delete_object(Bucket=pk.bucket, Key=pk.path, VersionId=pk.version_id) ``` -You can then follow the above with `q3.delete_package(pname, registry=reg, top_hash=p.top_hash)`. +You can then follow the above with `q3.delete_package(pname, registry=reg, +top_hash=p.top_hash)`. + +## Do I have to login via quilt3 to use the Quilt APIs? -## Do I have to login via quilt3 to use the Quilt APIs? How do I push to Quilt from a headless environment like a Docker container? +## How do I push to Quilt from a headless environment like a Docker container? -Configure [AWS CLI credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) + +Configure [AWS CLI credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) and `quilt3` will use the same for its API calls. -> Be sure to run `quilt3 logout` if you've previously logged in. +> Be sure to run `quilt3 logout` if you've previously logged in. Select among multiple profiles in your shell as follows: + ```bash export AWS_PROFILE=your_profile ``` The S3 permissions needed by `quilt3` are similar to + [this bucket policy](https://docs.quiltdata.com/advanced/crossaccount#bucket-policies) but `quilt3` does not need either `s3:GetBucketNotification` or `s3:PutBucketNotification`. ## How complex can my Athena queries be? -Amazon Athena supports a subset of Data Defintion Language (DDL) -and Data Manipulation Language (DML) statements, functions, operators, -and data types, based on [Presto](https://prestodb.io/) and [Trino](https://trino.io/). +Amazon Athena supports a subset of Data Defintion Language (DDL) and Data +Manipulation Language (DML) statements, functions, operators, and data types, +based on [Presto](https://prestodb.io/) and [Trino](https://trino.io/). -This allows for extremely granular querying of your data package name, metadata, and contents -and includes logical operators, comparison functions, conditional expressions, mathematical functions, -bitwise functions, date and time functions and operators, regular expression functions, and aggregate -functions. Please review the references linked below to learn more. +This allows for extremely granular querying of your data package name, metadata, +and contents and includes logical operators, comparison functions, conditional +expressions, mathematical functions, bitwise functions, date and time functions +and operators, regular expression functions, and aggregate functions. Please +review the references linked below to learn more. ### Helpful examples `regexp_extract_all(string, pattern)` - Return the substring(s) matched by the regular expression `pattern` in `string` + +Return the substring(s) matched by the regular expression `pattern` in `string` ```sql @@ -165,11 +198,16 @@ SELECT regexp_extract_all('1a 2b 14m', '\d+'); ### Considerations and limitations -There are [many considerations and limitations](https://docs.aws.amazon.com/athena/latest/ug/other-notable-limitations.html) +There are [many considerations and + +limitations]() when writing Amazon Athena queries. ### References + + * [SQL reference for Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/ddl-sql-reference.html) + * [Functions in Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/presto-functions.html) ## Are there any limitations on characters in Quilt filenames? @@ -183,6 +221,7 @@ that might require special handling, and characters to avoid, please review the official Amazon S3 documentation linked below. ### List of safe characters + * Alphanumeric characters: * 0-9 * a-z @@ -197,8 +236,9 @@ review the official Amazon S3 documentation linked below. * Open parenthesis (`(`) * Close parenthesis (`)`) -### References -* [Creating object key names](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html) +For more details, see [Creating object key +names](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html) +in the Amazon S3 documentation. ## How many IPs does a standard Quilt stack require? @@ -206,11 +246,11 @@ Currently, a full size, multi-Availability Zone deployment (without [Voila](https://docs.quiltdata.com/catalog/visualizationdashboards#voila)) requires at least 256 IPs. This means a minimum CIDR block of `/24`. -Optional additional features (such as automated data packaging) require additional IPs. +Optional additional features (such as automated data packaging) require +additional IPs. ## The "Last Modified" column in the Quilt catalog is empty Amazon S3 is a key-value store with prefixes but no true "folders". In the Quilt Catalog Bucket view, as in AWS Console, only objects have a "Last modified" value, whereas package entries and prefixes do not. - diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index 5d4ac781c09..d7346ee7151 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -5,6 +5,7 @@ * [Mental Model](MentalModel.md) * [Metadata Management](Catalog/Metadata.md) * [Metadata Workflows](advanced-features/workflows.md) +* [Organizing S3 Buckets](advanced-features/s3-bucket-organization.md) ## Quilt Platform (Catalog) User @@ -12,30 +13,28 @@ * [Bucket Browsing](Catalog/FileBrowser.md) * [Document Previews](Catalog/Preview.md) * [Embeddable iFrames](Catalog/Embed.md) -* [Search & Query](Catalog/SearchQuery.md) +* [Search](Catalog/Search.md) * [Qurator Omni](Catalog/Qurator.md) AI Assistant +* [Query](Catalog/Query.md) * [Visualization & Dashboards](Catalog/VisualizationDashboards.md) -* **Advanced** - * [Athena](advanced-features/athena.md) - * [Elasticsearch](walkthrough/working-with-elasticsearch.md) ## Quilt Platform Administrator * [Admin Settings UI](Catalog/Admin.md) * [Catalog Configuration](Catalog/Preferences.md) * [Cross-Account Access](CrossAccount.md) +* [Elasticsearch](walkthrough/working-with-elasticsearch.md) * [Enterprise Installs](technical-reference.md) -* [quilt3.admin Python API](api-reference/Admin.md) +* [SSO Permissions Mapping](advanced-features/sso-permissions.md) +* [Tabulator](advanced-features/tabulator.md) +* [Troubleshooting](Troubleshooting.md) * **Advanced** + * [quilt3.admin Python API](api-reference/Admin.md) + * [GxP for Security & Compliance](advanced-features/good-practice.md) * [Package Events](advanced-features/package-events.md) * [Private Endpoints](advanced-features/private-endpoint-access.md) * [Restrict Access by Bucket Prefix](advanced-features/s3-prefix-permissions.md) * [S3 Events via EventBridge](EventBridge.md) - * [SSO Permissions Mapping](advanced-features/sso-permissions.md) - * [Tabulator](advanced-features/tabulator.md) -* **Best Practices** - * [GxP for Security & Compliance](advanced-features/good-practice.md) - * [Organizing S3 Buckets](advanced-features/s3-bucket-organization.md) ## Quilt Ecosystem Integrations @@ -71,4 +70,3 @@ * [Changelog](CHANGELOG.md) * [Contributing](CONTRIBUTING.md) * [Frequently Asked Questions](FAQ.md) - * [Troubleshooting](Troubleshooting.md) diff --git a/docs/Troubleshooting.md b/docs/Troubleshooting.md index 4d91dd142f3..6e57cc1ce9e 100644 --- a/docs/Troubleshooting.md +++ b/docs/Troubleshooting.md @@ -1,52 +1,76 @@ - + ## Catalog Overview stats (objects, packages) seem incorrect or aren't updating + ## Catalog Packages tab doesn't work + ## Catalog packages or stats are missing or are not updating -If you recently added the bucket or upgraded the stack, if search volume is high, -or if read/write volume is high, wait a few minutes and try again. +These are all symptoms of the same underlying issue: the Elasticsearch index is +out of sync. If any of the following are true, please wait a few minutes and try +again: + +- you recently added the bucket or upgraded the stack +- search volume is high, or +- read/write volume is high + +If that doesn't work, try the following steps: ### Re-index the bucket -1. Open the bucket overview in the Quilt catalog and click the gear icon (upper right), -or navigate to Admin settings > Buckets and inspect the settings of the bucket in question. +If you have less than 1 million objects in the bucket, you should re-index the +bucket: + +1. Open the bucket overview in the Quilt catalog and click the gear icon (upper +right), or navigate to Admin settings > Buckets and inspect the settings of the +bucket in question. 1. Under "Indexing and notifications", click "Re-index and Repair". -> Optionally: **if and only if** bucket notifications are not working and you are -> certain that there are no other subscribers to the S3 Events of the bucket in -> question, check "Repair S3 notifications". +> Optionally: **if and only if** bucket notifications are not working and you +> are certain that there are no other subscribers to the S3 Events of the bucket +> in question, check "Repair S3 notifications". + +Bucket packages, stats, and the search index will repopulate in the next few +minutes. -Bucket packages, stats, and the search index will repopulate in the next few minutes. -Buckets with more than one million objects will take longer. +However, if you have more than 1 million objects in the bucket, re-indexing will +take much longer and potentially become expensive. In that case, please try the +below steps. If those do not work, please contact [Quilt +support](mailto:support@quiltdata.io). ### Inspect the Elasticsearch domain -1. Determine your Quilt instance's ElasticSearch domain from Amazon Console > OpenSearch -or `aws opensearch list-domain-names`. Note the domain name (hereafter `QUILT_DOMAIN`). +1. Determine your Quilt instance's ElasticSearch domain from Amazon Console > +OpenSearch or `aws opensearch list-domain-names`. Note the domain name +(hereafter `QUILT_DOMAIN`). 1. Run the following command and save the output file: ```sh - aws es describe-elasticsearch-domain --domain-name "$QUILT_DOMAIN" > quilt-es-domain.json + aws es describe-elasticsearch-domain --domain-name "$QUILT_DOMAIN"\ + > quilt-es-domain.json ``` 1. Visit Amazon Console > OpenSearch > `QUILT_DOMAIN` > Cluster health. -1. Set the time range as long as possible to fully overlap with your observed issues. +1. Set the time range as long as possible to fully overlap with your observed + issues. -1. Screenshot the Summary, Overall Health, and Key Performance Indicator sections +1. Screenshot the Summary, Overall Health, and Key Performance Indicator + sections -1. Send the JSON output file and screenshots to [Quilt support](mailto:support@quiltdata.io). +1. Send the JSON output file and screenshots to [Quilt + support](mailto:support@quiltdata.io). -> As a rule you should not reconfigure your Elasticsearch domain directly as this will -> result in stack drift that will be lost the next time you update your Quilt instance. +> As a rule you should **not** reconfigure your Elasticsearch domain directly as +> this will result in stack drift that will be lost the next time you update +> your Quilt instance. ## Missing metadata when working with Quilt packages via the API -> `Package.set_dir()` on the package root (".") overrides package-level metadata. -> If you do not provide `set_dir(".", foo, meta=baz)` with a value for `meta=`, -> `set_dir` will set package-level metadata to `None`. +> `Package.set_dir()` on the package root (".") overrides package-level +> metadata. If you do not provide `set_dir(".", foo, meta=baz)` with a value for +> `meta=`, `set_dir` will set package-level metadata to `None`. A common pattern is to `Package.browse()` to get the most recent version of a package, and then `Package.push()` updates. @@ -90,49 +114,62 @@ after clicking the `RELOAD` button in the Quilt Catalog. 1. Your Quilt user Role has been corrupted. You will need a Quilt Admin user to reset your Quilt user Role to a default (**and valid**) Role. - ## User creation and log in -Users can either be invited directly or are _just-in-time provisioned (JIP)_ when -they sign in via SSO and receive the "default role." + +Users can either be invited directly or are _just-in-time provisioned (JIP)_ +when they sign in via SSO and receive the "default role." ### Important conditions and pre-requisites -* If an admin (or any user) is created by JIP, or created through CloudFormation -with an SSO Provider set to anything other than Disabled, then setting the password -for that user has no effect and _password login will never succeed_ for that user. -Said another way, users created through SSO can only log in through SSO. -* You _must disable SSO_ and enable `PasswordAuth` if you wish to log in as an admin -using a password (as opposed to SSO). + +- If an admin (or any user) is created by JIP, or created through CloudFormation +with an SSO Provider set to anything other than Disabled, then setting the +password for that user has no effect and _password login will never succeed_ for +that user. Said another way, users created through SSO can only log in through +SSO. + +- You _must disable SSO_ and enable `PasswordAuth` if you wish to log in as an +admin using a password (as opposed to SSO). ### Unable to log in -The following are common causes of failed logins. In most cases we recommend that -you check the [network panel of your browser](#browser-network-and-console) for details. +The following are common causes of failed logins. In most cases we recommend +that you check the [network panel of your browser](#browser-network-and-console) +for details. -1. SSO connector misconfigured. See [SSO](technical-reference.md#cnames) for details. -1. SSL errors are often caused by misspelled names, or incomplete Subject Alternate Names. -The ACM certificate for `CertificateArnELB` must cover all three Quilt [CNAMEs](technical-reference.md#cnames) either via a suitable `*` or explicit Subject Alternate Names. +1. SSO connector misconfigured. See [SSO](technical-reference.md#cnames) for + details. +1. SSL errors are often caused by misspelled names, or incomplete Subject +Alternate Names. The ACM certificate for `CertificateArnELB` must cover all +three Quilt [CNAMEs](technical-reference.md#cnames) either via a suitable `*` or +explicit Subject Alternate Names. ### Changing the admin email or password -Changing the admin password is only possible with `PasswordAuth=Enabled` in CloudFormation -and is subject to the following limitations for security reasons: -* Has no effect if SSO is in use, or was in use when the admin was first created. -* Has no effect on pre-existing admin username/password pairs. +Changing the admin password is only possible with `PasswordAuth=Enabled` in +CloudFormation and is subject to the following limitations for security reasons: + +- Has no effect if SSO is in use, or was in use when the admin was first + created. +- Has no effect on pre-existing admin username/password pairs. You can click "reset password" on the login page. - -To change the admin email (e.g. you have accidentally broken your admin user) try the following: -1. Change the value of the `AdminEmail` CloudFormation parameter _to a net new email_. +To change the admin email (e.g. you have accidentally broken your admin user) +try the following: + +1. Change the value of the `AdminEmail` CloudFormation parameter _to a net new + email_. 1. Apply the change as a stack _Update_. -1. Once the update is successful, the new admin can log in, set roles, and nominate -other admins as needed. +1. Once the update is successful, the new admin can log in, set roles, and +nominate other admins as needed. ## General stack update failure steps -On rare occasions, Quilt stack deployment updates might fail. This can happen for several -reasons. To expedite resolution of stack deployment issues, it's helpful to -have the following data and output from the following [AWS CLI](https://aws.amazon.com/cli/) -commands when contacting support@quiltdata.io. + +On rare occasions, Quilt stack deployment updates might fail. This can happen +for several reasons. To expedite resolution of stack deployment issues, it's +helpful to have the following data and output from the following [AWS +CLI](https://aws.amazon.com/cli/) commands when contacting +. 1. Quilt stack outputs: @@ -143,7 +180,7 @@ commands when contacting support@quiltdata.io. --query 'Stacks[].Outputs' ``` -1. Initiate drift detection: +1. Initiate drift detection: ```sh aws cloudformation detect-stack-drift \ @@ -178,17 +215,21 @@ Quilt support: Chrome menu select **More tools > Developer tools**. 1. Select the **Network** tab. 1. Ensure the session is recorded: - - Google Chrome: Check the red button in the upper left corner is set to **Record**. + - Google Chrome: Check the red button in the upper left corner is set to + **Record**. 1. Ensure **Preserve Log** is enabled. - 1. Perform the action that triggers the error (e.g. clicking the `Download package` button). + 1. Perform the action that triggers the error (e.g. clicking the `Download + package` button). 1. Export the logs as HAR format. - Google Chrome: **Ctrl + Click** anywhere on the grid of network requests and select **Save all as HAR with content**. 1. Save the HAR-formatted file to your localhost. - ![Save browser Network error logs as HAR content](imgs/troubleshooting-logs-browser.png) + ![Save browser Network error logs as HAR + content](imgs/troubleshooting-logs-browser.png) 1. Select the **Console** tab. - 1. Perform the action that triggers the error (e.g. clicking the `Download package` button). + 1. Perform the action that triggers the error (e.g. clicking the `Download + package` button). 1. Export the logs. - Google Chrome: **Ctrl + Click** anywhere on the grid of network requests and select **Save as...**. @@ -201,6 +242,7 @@ Quilt support: ```sh aws cloudformation list-stacks ``` + 1. Capture Quilt log events for the last 30 minutes as follows: ```sh @@ -232,9 +274,9 @@ aws s3api get-object-tagging --bucket "$BUCKET" --key "$PREFIX" ### Specific logical resources -Sometimes you may wish to find an ID or other information from a logical resource -in a Quilt stack. The following example is for security groups. Modify the commands as needed -for other resource types. +Sometimes you may wish to find an ID or other information from a logical +resource in a Quilt stack. The following example is for security groups. Modify +the commands as needed for other resource types. ```sh @@ -263,9 +305,8 @@ aws lambda get-event-source-mapping --uuid \ --query StackResourceDetail.PhysicalResourceId --output text) ``` -## Remediation +### Remediation -### Event source mapping If for some reason the event source mapping is disabled, it can be enabled as follows. diff --git a/docs/advanced-features/athena.md b/docs/advanced-features/athena.md deleted file mode 100644 index 993e8448a58..00000000000 --- a/docs/advanced-features/athena.md +++ /dev/null @@ -1,37 +0,0 @@ - -# Querying package metadata with Athena -Quilt stores package data and metadata in S3. Metadata lives in a per-package manifest file -in a each bucket's `.quilt/` directory. - -You can therefore query package metadata wth SQL engines like AWS Athena. -Users can write SQL queries to select packages (or files from within packages) -using predicates based on package or object-level metadata. - -Packages can be created from the resulting tabular data. -To be able to create a package, -the table must contain the columns `logical_key`, `physical_keys` and `size` as shown below. -(See also [Mental Model](https://docs.quiltdata.com/mentalmodel)) - -## Defining package tables and views in Athena - -> This step is not required for users of Quilt enterprise, since tables and views -are managed by Quilt. Check the value of `UserAthenaDatabaseName` output in your -CloudFormation stack to know the name of the Athena database it created. - -The first step in configuring Athena to query the package contents and metadata -is to define a set of tables and views that represent the metadata fields as columns. -The easiest way to do this is using the pre-built CloudFormation templates -available in the [examples repository](https://github.com/quiltdata/examples/tree/master/athena_cfn/). - -## Example: query object-level metadata - -Suppose we wish to find all .tiff files produced by algorithm version 1.3 -with a cell index of 5. - -```sql -SELECT * FROM "YOUR-BUCKET_objects-view" -WHERE substr(logical_key, -5) = '.tiff' --- extract and query object-level metadata -AND json_extract_scalar(meta, '$.user_meta.nucmembsegmentationalgorithmversion') LIKE '1.3%' -AND json_array_contains(json_extract(meta, '$.user_meta.cellindex'), '5'); -``` diff --git a/docs/walkthrough/working-with-elasticsearch.md b/docs/walkthrough/working-with-elasticsearch.md index 2a8d2c4332a..3f3f2ca2703 100644 --- a/docs/walkthrough/working-with-elasticsearch.md +++ b/docs/walkthrough/working-with-elasticsearch.md @@ -1,23 +1,55 @@ - - + Each Quilt stack includes an Elasticsearch cluster that indexes objects and packages as documents. The cluster is deployed in the AWS OpenSearch service. You can connect to your Elasticsearch domain to query documents. -> If your Quilt stack uses private endpoints for Elasticsearch you will need to -> connect to the cluster from a machine in the same VPC as the cluster. +Each Amazon S3 bucket connected to Quilt implies two Elasticsearch index +aliases: -Each Amazon S3 bucket connected to Quilt implies two Elasticsearch index aliases: 1. `YOUR_BUCKET_NAME`: Contains one document per object in the bucket. -2. `YOUR_BUCKET_NAME_packages`: Contains one document per package revision in the bucket. +2. `YOUR_BUCKET_NAME_packages`: Contains one document per package revision in + the bucket. + +## Configuring Saved Queries + +You can provide pre-canned Elasticsearch queries for your users by providing a +configuration file at `s3://YOUR_BUCKET/.quilt/queries/config.yaml`: + +```yaml +version: "1" +queries: + query-1: + name: My first query + description: Optional description + url: s3://BUCKET/.quilt/queries/query-1.json + query-2: + name: Second query + url: s3://BUCKET/.quilt/queries/query-2.json +``` + +The Quilt catalog displays your saved queries in a drop-down for your users to +select, edit, and execute. + +## Managing Elasticsearch -> Quilt uses Amazon Elasticsearch version 6.7. + +Quilt uses Amazon Elasticsearch 6.7 +([docs](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/index.html)). -## Query Elasticsearch with Python +1. If your Quilt stack uses private endpoints for Elasticsearch you will need to + connect to the cluster from a machine in the same VPC as the cluster. +2. By default, Quilt indexes a limited number of bytes per document for +specified file formats (100KB). Both the max number of bytes per document and +which file formats to deep index can be customized per Bucket in the Catalog +Admin settings. -You can use [`elasticsearch -6.3.1`](https://elasticsearch-py.readthedocs.io/en/6.3.1/) as + +![Example of Admin Bucket indexing options](../imgs/elastic-search-indexing-options.png) + +## Querying Elasticsearch with Python + +You can use [`elasticsearch`](https://elasticsearch-py.readthedocs.io/en/) as follows: @@ -72,7 +104,8 @@ will be one result (`Logical ID` value of `Search`). Elasticsearch cluster in the AWS OpenSearch service. 1. Select the "Cluster health" tab. 1. Review the "Summary" section (look for **Green** Status): - - If your cluster Status is **Red** or **Yellow**, notify your Quilt account manager. + - If your cluster Status is **Red** or **Yellow**, notify your Quilt account + manager. 1. In the "Overall health" section, update the "Time range" to `2w` and review all graphs, paying particular attention to: - Total free storage space: if one or more nodes in your cluster @@ -98,6 +131,7 @@ directly. ### References + - [Sizing Amazon OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/sizing-domains.html) @@ -109,7 +143,7 @@ workload. There is known bug in CloudFormation when deploying and/or upgrading Quilt stacks using t2 or t3 instance types. During stack deployments the following error may be encountered: -``` +```log Autotune is not supported in t2/t3 instance types. Disable autotune or change your instance type. (Service: AWSElasticsearch; Status Code: 400; Error Code: ValidationException; @@ -123,7 +157,9 @@ actions and re-run the Quilt CloudFormation deployment: 1. Access the Quilt OpenSearch cluster (see steps 1 - 3 above). 1. Select the "Auto-Tune" tab. -1. Review the "Status" value. If the value is **Turned on**, click the "Edit" button. -1. Select the option to "Turn off" Auto-Tune and click the "Save changes" button: +1. Review the "Status" value. If the value is **Turned on**, click the "Edit" + button. +1. Select the option to "Turn off" Auto-Tune and click the "Save changes" + button: ![Auto-Tune configuration](../imgs/elastic-search-autotune.png)