Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loadFragment() with given hash #371

Open
indus opened this issue Jul 31, 2023 · 15 comments
Open

loadFragment() with given hash #371

indus opened this issue Jul 31, 2023 · 15 comments
Labels
improvement Not a bug Pagefind CLI The CLI responsible for indexing content Pagefind Search The web-facing search js/wasm API

Comments

@indus
Copy link

indus commented Jul 31, 2023

Is it possible to directly load a fragment for a hash that was not obtained by the search?
Something like a public version of the loadFragment function: https://github.com/CloudCannon/pagefind/blob/main/pagefind_web_js/lib/coupled_search.ts#L234 ?

@bglw
Copy link
Contributor

bglw commented Jul 31, 2023

That could definitely be exposed, though Pagefind offers no method for finding a hash you're looking for outside of a search result. What's the use-case you're looking to fill here? (How are you planning to get the hash to pass to this function?) There might be a better way to get there 🙂

@indus
Copy link
Author

indus commented Jul 31, 2023 via email

@indus
Copy link
Author

indus commented Jul 31, 2023

I'm not sure how relevant it would be for other applications and what it would mean for the pagefind code, but maybe an attribute like pagefind-hash='<custom_hash>' that allows for a custom user defined hash would make this database like (mis-)usage of the pagefind index even easier and more flexible.
Or as an alternative an option to write a plain JSON file of the index at build time with the hash as key and the metadata as value?!?

@bglw
Copy link
Contributor

bglw commented Jul 31, 2023

Ah, cool! Nice use-case.

The purpose of the hashes is to eliminate any stale caching issues, so I'd be hesitant to provide a custom hash functionality. The option for Pagefind to write a plain JSON file is totally doable, though, I'll look into that. And no reason the explicit call to load a fragment can't be exposed, so I'll tackle that too.

@bglw bglw added improvement Not a bug Pagefind Search The web-facing search js/wasm API Pagefind CLI The CLI responsible for indexing content labels Jul 31, 2023
@indus
Copy link
Author

indus commented Jul 31, 2023 via email

@julbd
Copy link

julbd commented Sep 10, 2024

@bglw I have this issue with the node library. A quick solution could be to return the hash, alongside others data, when the record is created.

@marcuswhybrow
Copy link

A quick solution could be to return the hash, alongside others data, when the record is created.

This is exactly what I need too. In my project I have 500 indexed HTML files all of which I display in my web UI. Displaying and visually filtering this many elements forces me to await data() for every result to get the url to identify each file displayed in my UI.

Work Arounds

My workaround is to create a reverse lookup table from result url (hidden by the await data()) to result id by calling pagefind.search(null) in the background on page load. Awaiting 500 calls to data() massively blocks the event loop leading to UI freeze.

To fix the UI freezing, one needs the Scheduler API or a setTimeout hack to break up the task sizes and allow UI updates some execution time.

I also tried moving await data() and Pagefind searching in general into service worker (so it's on a separate thread and not blocking the main event loop) but Pagefind complains that window doesn't exist (see #605).

Solution?

If pagefind.addHTMLFile (in the node wrapper) returned the result id, this would solve my main issue trying to filter hundreds of results efficiently.

@bglw
Copy link
Contributor

bglw commented Oct 1, 2024

Hi all 👋

I'll be working on this one soon, along with #715

Both will come via a CLI flag to output a file containing information about the index — filters, fragments, etc. This will be output at the conclusion of the build.

The API will gain a matching function, something like await index.getIndexCatalogue() (name pending 🤷). This would be called between adding the last file to the index and writing content, or possibly be a return value from writing content. TBD. In any case, let me know if that sounds like it will be viable 🙂

If pagefind.addHTMLFile (in the node wrapper) returned the result id, this would solve my main issue

Unfortunately this one isn't possible without some more changes. At present, the IDs aren't allocated until the conclusion of indexing, so they aren't known at the point of responding to any of the add* functions.

@julbd
Copy link

julbd commented Oct 1, 2024

Hi @bglw. Thank you for listening for our issues :)

In my use case, the best solution would be to have the record hash directly returned by addCustomRecord().

@bglw
Copy link
Contributor

bglw commented Oct 1, 2024

Hmm, well that needs some more thought 😅

Just to rattle off some thoughts, for context and for myself:

Pagefind uses fairly short page IDs, to reduce the size of the metadata it needs to load up front. The downside of this is that collisions can and do happen, so the IDs are allocated at the end of the indexing, and pages will adjust their hash if it would collide. One goal for this is that both pages should adjust, which means the ID of a page may need to change after it has been allocated.

So the big issue is until all files have been indexed, we don't know how short to make the page ID.

The primary purpose of these hashes is to solve caching issues when the index changes after a build, so I'm hesitant to change the strategy too significantly.

One idea that might work would be to adopt a git-ish concept of short and long IDs, and return the long ID from the add* functions. So your response would come back with a record hash like en_11badb2e36d2246bc6756b4a2f38479d3893692. Ultimately that page will be stored as en_11badb2, or en_11badb2e, or maybe even en_11badb2e3 — in any case it'll be a prefix of the full page hash. Then Pagefind would then allow you to supply a full length page hash and it'll find the relevant fragment.

With that:

  • Calling something like addCustomRecord() would return:
    {
        uniqueWords: 1234,
        url: "....",
        meta: { /* ... */ },
        long_id: "en_11badb2e36d2246bc6756b4a2f38479d3893692"
    }
  • When finished indexing, getIndexCatalogue would be able to return both the long and short ids for any given page.
  • The loadFragment function would accept either a long id or a short id

@marcuswhybrow
Copy link

Hi @bglw

Given the ID de-duplication restrictions you mention, it's fine for my use case to leave addHTMLFile as is in favor of this new index.getIndexCatalogue(). So long as I can, somewhere at compile time, determine which ultimate result id matches which result url it's fine if this happens at the end of indexing.

Cheers.

@julbd
Copy link

julbd commented Oct 2, 2024

Thank you for your elaborated answer. I understand the issue you are facing and why the ID is not already returned on record creation. Your solution would work but I see two issues :

  • If the goal of sort ID is to reduce bandwidth usage (it sounds like micro-optimization, but ok), then you'll have the same bandwidth issue with long ID (which will be used in a client-side reverse-lookup table).
  • KISS (Keep it stupidly simple).

I would rather suggest, if possible, to check the ID availability (and regenerate it if duplicate) at creation time. However, that's fine, I can use getIndexCatalogue() too !


Edit : i'm thinking of the following solution that would address more directly our use-cases. In my case, I have a map with points. I need to know the location of all the points (with and without filters). But I need only need the location. Currently, I am relying on a pre-generated JSON file to retrieve the location from the fragment ID, without having to fetch each fragment individually.

This issue could be resolved with the combination of :

  • Multiples "views", according to the data we need to fetch (all of it or a small portion of it). But this can already be achieved by generating two seperate indexes.
  • Fragments "packs". The search client would prefer to load some pre-generated packs of fragments (by filters, common words, user-defined key, etc...) when it detects than the cost of additional data (do we care with today's traffic speed ?) transfer is lower than the cost of the number of requests (do we care with HTTP2 ?).

@marcuswhybrow
Copy link

marcuswhybrow commented Oct 2, 2024

As a user story, my initial developer ergonomics expectation was that search(null) would return { results } that each had some id that pointed "back" to the indexed content given. Pagefind calls this the url, I think, but the url is inside the fragment. What { results } contains is an id, which, because it contains the word "unknown", lead me to believe I was supposed to supply the id somewhere for each Pagefind indexed item.

After some digging, I now understand that id points "forward" to the fragment henceforth to be loaded by data().

One possibility, perhaps too piecemeal a change, is including url in { results } to open up "backward" referencing to arbitrary data. Personally, I think getIndexCatalogue() is the better alternative since, although a "fragment id to url lookup data-structure" is being transferred to the client in either case, the getIndexCatalogue() approach is an opt-in cost.

@julbd, getIndexCatalogue() alone solves my page load issue (as well as fast filtering).

@bglw however, once that's solved, arbitrary (build time) fragment splitting could indeed massively reduce my UI's search times. For hundreds of indexed items I only need { id, excerpt } from { sub_results }, but for that I'm loading every word in the index document too, since it's all in the same fragment. Currently I'm amortising this cost by loading fragments for visible results first, then loading the out of viewport stuff afterwards.

I think it's fair to say that loading hundreds of fragments could be considered out of scope for Pagefind. I'm sure 95% of applications are paginating results. Also fragment splitting sounds like a major rewrite of core functionality.

Anyway, just food for thought. And thanks for helping. getIndexCatalogue() is probably enough for me. Cheers.

@bglw
Copy link
Contributor

bglw commented Oct 2, 2024

👋 @julbd

you'll have the same bandwidth issue with long ID

Correct! That's the limitation. For people loading them all into a client-side bundle, the recommendation would be to use the indexCatalogue to look up the corresponding short hash — but at that point you may as well just rely on the indexCatalogue for everything.

From my side, I'll continue with the indexCatalogue idea and we'll see how it goes, but we can revisit the idea of returning hashes while indexing if it seems crucial!

I would rather suggest, if possible, to check the ID availability (and regenerate it if duplicate) at creation time.

The main blocker here is that:

  • IDs need to be hashes to make subsequent builds stable
  • Indexing should not be order-dependent

Importantly for the second one, playing through a scenario:

  • We have a page A with hash abcfm
  • We have a page B with hash abcrt
  • Page A is indexed first, and takes the ID abc
  • Page B is indexed and tries to take abc, fails, and instead takes abcr
  • Pagefind runs again, but this time indexes Page B first, so it succeeds in getting ID abc
  • Page A now ends up with the ID of abcf

Now if any user has the hash fragment for abc still cached from build 1, but they search using build 2, they'll get the fragment for Page A but it should be for Page B. Hence, the ideal situation is both pages change due to the clash, and take the IDs abcf and abcr.

We are getting into micro-optimizations here! But these are also all scenarios that have been encountered with Pagefind in practice 😅

Fragments "packs"

This is an interesting idea! I like it 🤔 It feels tangential to this issue, would you mind opening a new one for that? :)

👋 @marcuswhybrow

because it contains the word "unknown", lead me to believe I was supposed to supply the id somewhere for each Pagefind indexed item

Ah, the unknown prefix there is actually the language! Normally you would see the ID as en_... or fr_.... In the case you have no language attribute on your HTML element, you get the unknown_ language prefix (and webassembly).
(Side point, I'd recommend setting the language! In the unknown language you'll miss out on some word stemming)

in either case, the getIndexCatalogue() approach is an opt-in cost

Agreed! The URL not being returned is quite intentional, so I'd be resistant to adding it. (Currently all IDs are loaded up front with Pagefind, and loading the URLs at the same time would start getting heavy). I like that the indexCatalogue concept gives an extension to some of these niche use cases where it's needed without impacting the base case for bandwidth.

arbitrary (build time) fragment splitting

Can you elaborate? The two ways I can read this is:

  • Fragments are combined together into larger files and loaded as large masses
  • Many fields from the fragments are removed to make each one smaller

I'm sure 95% of applications are paginating results

Correct! Or my favorites use an IntersectionObserver to load the fragment when the result enters the viewport :)

@marcuswhybrow
Copy link

@bglw

the unknown prefix there is actually the language

😆 Cheers.

I've sent you PR #719 with two minor additions to the getting stared docs re the lang attribute and it's relationship to result ids. I think the PR prevents my bad interpretation for other new users.

arbitrary (build time) fragment splitting

Can you elaborate?

Option 2 (+ extras): Multple fragments for each indexed file:

  • Currently (I think) Pagefind loads a singular fragment file for each result's data() call (which can be cached).
  • I believe this fragment contains all the data that Pagefind has for that result.
  • At build-time, the UI code may know that only a subset of that fragment data is useful.
  • What if one could call result.data("subset-name") to load a subset of fragment fields.
  • To achieve this, the node API could expose index.defineFragmentSubset("subset-name", fragment => {}) to generate an arbitrary number of "fragment subset" files for each indexed document during index.writeFiles.
  • For most use cases a single call to data("subset-name") could replace all calls to data(), reducing overall bandwidth.
  • In complex scenarios, one could call data("subset-a") and then later data("subset-b").
  • data() would still be available and perform as it, perfect for users who haven't specified any subsets.
  • There's also an opt-in opportunity to never generate full fragments at all, saving hosting space for those who never call data without a subset name.

This opt-in generation of fragment subsets would allow users to make their own trade-offs between the number of HTTP requests required and (even further) reduced bandwidth (reduced search times).

I'm not overly familiar with the Pagefind code-base itself, so take my idea with a pinch of salt, but that's my conceptualisation of @julbd's idea.

my favorites use an IntersectionObserver

I think I'll give that a go!

marcuswhybrow added a commit to marcuswhybrow/ray-peat-rodeo that referenced this issue Oct 3, 2024
After deciding on Web Components, Go types had to be duplicated in JSDoc
syntax for the JS client code. Although rewriting the static site
generation in Node makes SSG slower for a total build, the developer
experience can actually be faster because reloading Web Components is
fast.

Initially native Web Components were working well, but the value of
reactive state mounted. Lit Elements add reactive state to Web Component
standards, so that, unlike React, it's a small dependency, and requires
no build step.

The client UI itself is now a two sidebar layout, both attached to the
left of the viewport, each scrollable. One for filters, and one for
results, with content being loading in the page body.

Pagefind performance improves a lot with this commit following the
adoption of an IntersectionObserver for loading data as discussed in
CloudCannon/pagefind#371. Rurther reductions to initalisation times
to follow with the proposed fixes in that issue.

Client JS now uses modules (as opposed to synchronous plain JS) in the
browser. This approach initially broke support for some older browsers,
specifically iOS 16.3 and it's lack of support for import maps. To solve
this Vite is now being used to compile client code in a way that's
compatible with older browsers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Not a bug Pagefind CLI The CLI responsible for indexing content Pagefind Search The web-facing search js/wasm API
Projects
None yet
Development

No branches or pull requests

4 participants