support distributed search capability #118

tomkralidis · 2021-06-16T18:55:01Z

Some infrastructures have a metadata catalogue architecture of 1..n distributed endpoints that may harvest or aggregate from one another. While harvesting would be put forth as an extension (see #48), the ability to do a run-time search of a federation of catalogues (who have not harvested/aggregated one another) is also a valuable use case.

Options:

OARec server has a pre-configured set of federated catalogues, advertises them
- client must pick from that set or all catalogues are searched if client wants (i.e. distributedSearch=TRUE a la CSW 2)
client can specify any catalogue for an OARec to perform a distributed search against
should federation imply only OARec <-> OARec workflow?
how to assemble search results (e.g. group by catalogue)

cc @pvretano @cportele @uvoges

The text was updated successfully, but these errors were encountered:

mhogeweg · 2021-06-16T19:37:50Z

I can see it being useful to have some standardization around federated search. We in fact have this in our Geoportal Search Component where we support a vendor-specific parameter to indicate pre-configured sources to search. For example: search ArcGIS Online and Geoportal Server for 'map'.

What needs to be clear in any specification is:

how to deal with slow/fast responses from the source.
- Asynchronous and results are pushed through as they become available?
- Synchronous but wait for the slowest source catalog to respond?
how to handle result ranking?
- source catalogs may apply different matching/ranking algorithms
how to distinguish results from different sources?
- we basically wrap the CSW response in a JSON and leave it to the client to parse out the responses from the individual sources

pvretano · 2022-11-14T16:42:16Z

14-NOV-2022:

Federated search would definitely be an extension ... although I suppose a client, knowing how to query one OAPIR server could query across a bunch of OAPIR servers and then aggregate the results.

We need to be clear about what we mean by harvesting. Three types of harvesting were identified in the SWG call:

One type of harvesting is where one catalogue harvests the records of another catalogue on an ongoing basis to support federated queries. This is more of a "sync" type operation where one catalogue sync's itself against one or more other catalogues so that when a query is performed on it, the results include the records from the other catalogues.
Another type of harvesting is related to populating a catalogue. Rather than a client reading a resource (like a Sentinel product) and creating one or more records to describe that resource, the server can do that and all the client has to do is point the server to the resource. The server will either recognize the resource and harvest it or throw an exception saying it does not know show to read that resource type. This type of harvesting is related to federated search since it is a superset of the first type of harvesting mentioned in this list.
The third type of harvesting is harvesting a crawlable catalogue (static records) in order to transform it into a searchable catalogue.

Another wrinkle here is that the search API for records is really the features API which does not currently include a federated (or cross collection, cross deployment) search capability. This might be another case where we define the functionality here in Records but it eventually gets moved over to Features.

kalxas · 2022-11-15T09:57:06Z

The above 3 types of harvesting cover the "offline" mode of distributed search, i.e. the local catalogue has already done queries to the remote catalogue and has stored the results/records in the local database/model.

We also need to define the "online" mode of distributed search (a.k.a. federated search) that was previously defined in CSW 2 and 3, i.e. the local catalogue is doing live queries to the remote catalogue(s) and presents the results/records without storing them locally.
See http://docs.opengeospatial.org/is/12-168r6/12-168r6.html#58 and http://docs.opengeospatial.org/is/12-176r7/12-176r7.html#85

In this case we need to describe things like, how the records can be grouped/aggregated, how is a list of federated catalogues retrieved etc.
For example:

CSW 3 defined the way to query the remote catalogue:
https://schemas.opengis.net/cat/csw/3.0/examples/Clause_7.3.7_DistributedSearch_Example.xml
CSW 3 defined the way to group results based on the remote catalogue name/url:
https://schemas.opengis.net/cat/csw/3.0/examples/Clause_7.3.7_DestributedSearchResponse_Example.xml

There is interest in the EO domain for the online/federated search case because EO catalogues include millions of products/records and are less easy to harvest/maintain in "offline" mode.

tomkralidis · 2023-05-27T00:17:25Z

I think this capability should not be part of core, but as a conformance class or an extension. Thoughts @pvretano @kalxas @mhogeweg ?

pvretano · 2023-05-27T13:36:14Z

@tomkralidis as I mention above, federated or distributed search would definitely be an extension as harvesting would. So, I agree with you.

tomkralidis · 2023-05-27T13:39:50Z

OK. Perhaps we should move the "Extensions" column out of the "Part 1: Core" project?

tomkralidis · 2023-11-01T14:36:25Z

2023-11-01 OGC API code sprint:

a client can do a simple federated search in code (JavaScript, Python, etc.)
there could be organizations who want to publicize a single catalogue
put forth a proposal in proposals/ for discussion to articulate requirements

tomkralidis added the extension label Jun 16, 2021

tomkralidis mentioned this issue Nov 6, 2023

OARec: add distributed search functionality geopython/pycsw#919

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support distributed search capability #118

support distributed search capability #118

tomkralidis commented Jun 16, 2021

mhogeweg commented Jun 16, 2021

pvretano commented Nov 14, 2022 •

edited by kalxas

Loading

kalxas commented Nov 15, 2022 •

edited

Loading

tomkralidis commented May 27, 2023

pvretano commented May 27, 2023

tomkralidis commented May 27, 2023

tomkralidis commented Nov 1, 2023

support distributed search capability #118

support distributed search capability #118

Comments

tomkralidis commented Jun 16, 2021

mhogeweg commented Jun 16, 2021

pvretano commented Nov 14, 2022 • edited by kalxas Loading

kalxas commented Nov 15, 2022 • edited Loading

tomkralidis commented May 27, 2023

pvretano commented May 27, 2023

tomkralidis commented May 27, 2023

tomkralidis commented Nov 1, 2023

pvretano commented Nov 14, 2022 •

edited by kalxas

Loading

kalxas commented Nov 15, 2022 •

edited

Loading