Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support distributed search capability #118

Open
tomkralidis opened this issue Jun 16, 2021 · 7 comments
Open

support distributed search capability #118

tomkralidis opened this issue Jun 16, 2021 · 7 comments

Comments

@tomkralidis
Copy link
Contributor

Some infrastructures have a metadata catalogue architecture of 1..n distributed endpoints that may harvest or aggregate from one another. While harvesting would be put forth as an extension (see #48), the ability to do a run-time search of a federation of catalogues (who have not harvested/aggregated one another) is also a valuable use case.

Options:

  • OARec server has a pre-configured set of federated catalogues, advertises them
    • client must pick from that set or all catalogues are searched if client wants (i.e. distributedSearch=TRUE a la CSW 2)
  • client can specify any catalogue for an OARec to perform a distributed search against
  • should federation imply only OARec <-> OARec workflow?
  • how to assemble search results (e.g. group by catalogue)

cc @pvretano @cportele @uvoges

@mhogeweg
Copy link
Contributor

I can see it being useful to have some standardization around federated search. We in fact have this in our Geoportal Search Component where we support a vendor-specific parameter to indicate pre-configured sources to search. For example: search ArcGIS Online and Geoportal Server for 'map'.

What needs to be clear in any specification is:

  • how to deal with slow/fast responses from the source.
    • Asynchronous and results are pushed through as they become available?
    • Synchronous but wait for the slowest source catalog to respond?
  • how to handle result ranking?
    • source catalogs may apply different matching/ranking algorithms
  • how to distinguish results from different sources?
    • we basically wrap the CSW response in a JSON and leave it to the client to parse out the responses from the individual sources

@pvretano
Copy link
Contributor

pvretano commented Nov 14, 2022

14-NOV-2022:

Federated search would definitely be an extension ... although I suppose a client, knowing how to query one OAPIR server could query across a bunch of OAPIR servers and then aggregate the results.

We need to be clear about what we mean by harvesting. Three types of harvesting were identified in the SWG call:

  • One type of harvesting is where one catalogue harvests the records of another catalogue on an ongoing basis to support federated queries. This is more of a "sync" type operation where one catalogue sync's itself against one or more other catalogues so that when a query is performed on it, the results include the records from the other catalogues.
  • Another type of harvesting is related to populating a catalogue. Rather than a client reading a resource (like a Sentinel product) and creating one or more records to describe that resource, the server can do that and all the client has to do is point the server to the resource. The server will either recognize the resource and harvest it or throw an exception saying it does not know show to read that resource type. This type of harvesting is related to federated search since it is a superset of the first type of harvesting mentioned in this list.
  • The third type of harvesting is harvesting a crawlable catalogue (static records) in order to transform it into a searchable catalogue.

Another wrinkle here is that the search API for records is really the features API which does not currently include a federated (or cross collection, cross deployment) search capability. This might be another case where we define the functionality here in Records but it eventually gets moved over to Features.

@kalxas
Copy link
Member

kalxas commented Nov 15, 2022

The above 3 types of harvesting cover the "offline" mode of distributed search, i.e. the local catalogue has already done queries to the remote catalogue and has stored the results/records in the local database/model.

We also need to define the "online" mode of distributed search (a.k.a. federated search) that was previously defined in CSW 2 and 3, i.e. the local catalogue is doing live queries to the remote catalogue(s) and presents the results/records without storing them locally.
See http://docs.opengeospatial.org/is/12-168r6/12-168r6.html#58 and http://docs.opengeospatial.org/is/12-176r7/12-176r7.html#85

In this case we need to describe things like, how the records can be grouped/aggregated, how is a list of federated catalogues retrieved etc.
For example:

There is interest in the EO domain for the online/federated search case because EO catalogues include millions of products/records and are less easy to harvest/maintain in "offline" mode.

@tomkralidis
Copy link
Contributor Author

I think this capability should not be part of core, but as a conformance class or an extension. Thoughts @pvretano @kalxas @mhogeweg ?

@pvretano
Copy link
Contributor

@tomkralidis as I mention above, federated or distributed search would definitely be an extension as harvesting would. So, I agree with you.

@tomkralidis
Copy link
Contributor Author

OK. Perhaps we should move the "Extensions" column out of the "Part 1: Core" project?

@tomkralidis
Copy link
Contributor Author

2023-11-01 OGC API code sprint:

  • a client can do a simple federated search in code (JavaScript, Python, etc.)
  • there could be organizations who want to publicize a single catalogue
  • put forth a proposal in proposals/ for discussion to articulate requirements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants