Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epic: dictionary-based word-breakers 🔬 #12142

Draft
wants to merge 38 commits into
base: master
Choose a base branch
from
Draft

Conversation

mcdurdin
Copy link
Member

@mcdurdin mcdurdin commented Aug 9, 2024

No description provided.

jahorton and others added 21 commits August 9, 2024 09:40
Only wordbreaks anything AFTER the last space / ZWNJ.  Doesn't bother with anything before it.
…ordbreakers/dict-breaker-start' into change/common/models/wordbreakers/unit-test-trie-access
…/models/wordbreakers/fuse-dict-unmatched-chars
…ordbreakers/dict-breaker-start' into change/common/models/wordbreakers/unit-test-trie-access
…common/models/wordbreakers/fuse-dict-unmatched-chars
@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Aug 9, 2024
mcdurdin and others added 3 commits August 22, 2024 08:19
…-breaker

chore: merge master into dict-breaker
…nto feat/common/models/wordbreakers/dict-breaker-start
…ordbreakers/dict-breaker-start' into change/common/models/wordbreakers/unit-test-trie-access
@jahorton
Copy link
Contributor

Jotting down notes of lingering ideas before this goes on pause for a while:

Once we build a model for the user-dictionary data, we may wish to aim to 'hash' the source (or similar) and cache it. Of course, the data is subject to change as the user adds new contacts, etc, but if our 'hash' can indicate that there has been no change to the source data, there's no need to rebuild the user-dictionary model.

Potential design paths:

  • rig up the model compiler within its own WebView or a separate worker; have the host app compile the user-dict model, then pass that in.
    • We don't want the operation to be 'blocking'.
  • build a JSON object or array that can be passed into the... compiler's thread... in order to build the user-dict model.
    • "compiler's thread": it could be the predictive-text worker, a separate worker, or even the main thread of a separate WebView. We haven't explicitly made a design-decision here yet.
    • If it's essentially a JSON-encoding of what would be a .tsv file, the JSON parse should be relatively simple and straightforward.
    • Since the model compiler is in TS, compiled down to JS... we do have to solve the issue of data transfer in one manner or other.

@jahorton
Copy link
Contributor

I got to wondering if there are any "relatively simple" ways to avoid spinning up a WebView to run the model-compiler, should we decide to keep the user-dictionary compilation completely separate from the keyboard.

After a bit of searching, I found this: https://github.com/nodejs-mobile - a library for running Node-oriented JS scripts for mobile devices. That said, it'd be a new dependency.

@jahorton
Copy link
Contributor

Other notable thoughts:

We should probably not associate a language code with user-dictionary data. That is, we collate the data once and use that with any language supporting predictive-text.

My original strategy (as of #11994) was to blend the models into a single, "traversable" model.

  • This would require that the standard lexical model for each language implements the LexiconTraversal interface, though - which is not strictly required for all custom models.
    • We'd need an alternate strategy to support scenarios where a language-specific custom model lacks this feature.
  • Thinking ahead, we'd want a similar strategy to be in place once we start doing 'learning', which would adjust a model's probability data to better suit the user's actual typing patterns.

@mcdurdin previously suggested instead doing multiple correction-searches and picking the best from among their results after applying relative weighting. This would work, though it would also require support for multiple correction searches that does not yet exist.

  • They should likely use the same allotment for total execution time... likely requiring some form of load balancing.
  • We'd likely need the ability to pause and resume whichever search is currently returning 'more likely' paths at the time.

@mcdurdin
Copy link
Member Author

  • we do have to solve the issue of data transfer in one manner or other.

Data transfer into the webview could be via local file: or http: request. This opens up a number of extensibility questions for KeymanWeb itself and how we could make the web-based experience consistent with the Keyman Android/iOS app experience.

@mcdurdin mcdurdin modified the milestones: A18S9, A18S19 Aug 27, 2024
jahorton and others added 9 commits August 27, 2024 10:15
…ers/dict-breaker-start

feat(common/models/wordbreakers): begin development of dictionary-based wordbreaking algorithm 🔬
…akers/unit-test-trie-access

change(common/models/wordbreakers): allow wordbreaker tests to access TrieModel implementation 🔬
…ers/fuse-dict-unmatched-chars

feat(common/models/wordbreakers): fuse adjacent unmatched characters when dictionary-breaking 🔬
…-breaker

chore: merge master into dict-breaker 🔬
…-breaker

chore: merge master into dict-breaker 🔬
@github-actions github-actions bot added common/ and removed common/ labels Oct 11, 2024
@github-actions github-actions bot added common/ and removed common/ labels Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants