Spellcheck all enabled scripts and improve URL detection #613

codokie · 2024-04-01T14:16:33Z

I made some changes to the spellcheck so that it checks enabled locales other than the active one if they are of different scripts. If they are of different script, there is no concern that the spellcheck would validate typos in written other locales.
If no locale could be determined, then the text consists of non-letters only (numbers, special characters, etc.) so it should not be spell checked.

The result is saved in the cache for quick retrieval since the spellcheck service may be called multiple times for the same word.
This change should also make it seem like the spellchecker can "remember which language a word was typed" (#93) even though the TextInfo has no attribute for the locale.

If the experimental setting "URL detection" is enabled, URLs are checked using a regex pattern Matcher to prevent them from being flagged as typos.
For some reason though, URLs containing '?', '!', or ';' are still truncated before they are sent to the spellchecker, so it marks the text after any of those symbols as a typo if its not a valid dictionary word.

This reverts commit 1c5d955.

codokie

1c5d955: Nevermind.. if mLocale is null then it is assigned a value. My bad.

Helium314 · 2024-04-03T18:49:25Z

Are the changes to URL detection and spell check dependent on each other?
If not, I would much prefer to have the changes split up into separate PRs.

codokie · 2024-04-03T19:29:17Z

@Helium314 the URL detection changes are minimal and were made solely for the spellchecker so that it won't mark URI as typos. There might be a branch conflict if I'll separate them.

Helium314 · 2024-04-03T20:00:18Z

Why actually check all enabled scripts? I remember somewhere you wrote that checking the whole text when adding a new word is already inefficient. Checking it in all scripts seems even less efficient than that.
When it comes to performance, you should also be very careful with regexes.

Btw it might be good to define what is the wanted behavior, because from your description I don't think I understand it.

the URL detection changes are minimal and were made solely for the spellchecker

I did not check the code yet, but it looks like your are changing RichInputConnection and InputLogic which are not related to the spell checker.

codokie · 2024-04-03T20:41:33Z

I know this approach is not exactly efficient, but spellchecking is not inherently a light task, and those who enable it should know it requires more processing power. That's why the regex is checked only for those who enable the experimental URL detection in settings. To make it efficient might require a real overhaul and I admit that I am not knowledgeable enough for that task.

The check for determining the locale from the script of the letters is O(n^2) at the worst case if it's a number or similar.
But assuming most people are only bilingual, this should not be too bad probably, and for actual words the script will be found within 2 iterations.

The wanted behavior is as follows: If I type in English and then switch to any subtype of a different script like Greek (for example), the words written in English should be checked in the English dictionary and not the Greek one.

It's just that the current implementation flags way too many legit words, and it's not that efficient either. It even checks numbers against the dictionary which is really a waste.

codokie · 2024-04-25T07:26:00Z

@Helium314 Are there any plans to include this in the upcoming version?
The current state of the spell checker is almost unusable for those who write in multiple scripts.
I think it has something to do with changes made in Android 14. Even numbers are spellchecked which is really weird..
I can revert the suggestion length limit but the URL detection is vital to prevent flagging URLs as typos.

codokie · 2024-04-26T13:54:14Z

I made another change to the spellchecker, it should now use the cache to avoid checking the dictionary when the word has been checked before.
This should also make it seem like it can "remember which language a word was typed" even though the TextInfo has no attribute for the locale.
@RHJihan could you please check whether this solves your old issue with the spellchecker #93?

RHJihan · 2024-04-28T07:45:42Z

@RHJihan could you please check whether this solves your old issue with the spellchecker #93?

Yes, it solves the problem.

Uranusek · 2024-05-23T09:36:51Z

Could you make changes to spellcheck so that it doesn't check single letters and names after @?

For example, when replying to someone on Discord or GitHub, their nick will be underlined and this is a bit irritating. Single letters are also underlined for some reason. Even capital letters of those that are already in the dictionary, e.g. O or Z.

codokie · 2024-05-23T14:49:29Z

@Uranusek I agree the spellchecker should not underline words starting with @
However I have a feeling there may be more special use cases that it does not cover, if you think of any other sensible rule to add, please let me know as that could really help make this feature useful & accurate. Thanks

codokie · 2024-05-23T15:04:08Z

Single letters are also underlined for some reason. Even capital letters of those that are already in the dictionary, e.g. O or Z.

With the current changes in this PR, words with only 1 letter will not be reported as typos, so this issue should be fixed.

Uranusek · 2024-05-24T06:47:13Z

Oh, that's great, thanks! I think spellchecker shouldn't underline email addresses either.

codokie · 2024-05-26T07:16:34Z

spellchecker shouldn't underline email addresses

That should also not happen anymore when this PR is merged, but only if you enable the "Advanced URL detection" (disabled by default) setting.
I don't know how bad is the performance impact of regex pattern matching on low-end phones so it may be best to associate this feature with an experimental setting.

An alternative solution which should barely impact performance is to whitelist all words containing an at sign '@', but it would mean that this, for example, would not be considered a typo: mail@example,,,org

Helium314 · 2024-07-23T20:26:45Z

I finally had a look at this PR, and was reminded why I had postponed this several times. Sorry for this story again, but it really bothers me.
(you even wrote "URLs, numbers and special characters are checked against the dictionary as well. Unsurprisingly, they are then flagged as typos (red underlined). But that's for another issue..")

This should be at least 4 PRs:

check in all enabled scripts
improve URL detection
change cache usage (avoid underlining previously enabled languages)
at least one more PR for the other changes (like treatment of short words, or separating words on dash, which btw I didn't find mentioned anywhere, it just appears in the code)

Advantages:

less time I need to spend for review, because I only need to understand a single change instead of deciphering which change belongs where
faster review, because I can read a short PR in much less time than a long one (time I need to understand grows more than linear with length)
increased probablility I'm actually looking at PRs instead of thinking "not another one of those things, will check later"
I can relatively simply revert an individual change if it turns out to be breaking stuff / introducing bad behavior

Now for the actual code stuff:
I think it has potential for issues (inconsistencies with cache and when things in other languages are underlined or not)
but overall it's an improvement.

Some comments / questions in the code section.

I don't know how bad is the performance impact of regex pattern matching on low-end phones

Performance is fine.

Helium314 · 2024-07-23T20:27:41Z

app/src/main/java/helium314/keyboard/latin/spellcheck/AndroidWordLevelSpellCheckerSession.java

        }
    }

+    // Finds the enabled locale with a script matching one of the letters in the given text.
+    public Locale findLikelyLocaleOfText(final String text) {


The name is somewhat misleading, better rename to reflect what it actualy does: find a locale with matching script

Right.. maybe it should be moved out of the AndroidWordLevelSpellCheckerSession class too because it is not specifically related to the spellchecker except for the local variable localesToCheck (which can be passed as an argument)

I think it's not necessary to move it, but I definitely agree with your reasoning. Possibly could be moved to ScriptUtils or maybe LocaleUtils.

Helium314 · 2024-07-23T20:31:08Z

app/src/main/java/helium314/keyboard/latin/spellcheck/AndroidWordLevelSpellCheckerSession.java

@@ -47,6 +48,8 @@ public abstract class AndroidWordLevelSpellCheckerSession extends Session {

    public final static String[] EMPTY_STRING_ARRAY = new String[0];

+    public final static int FLAG_UNCHECKABLE = 0;


This goes into SuggestionsInfo as suggestionsAttributes, but in the documentation I can't find anything referring to uncheckable. 0 just seems to be when none of the RESULT_ATTR_[...] applies.

ah I may have overlooked that, although a 0 flag is already used elsewhere in the code:
https://github.com/Helium314/HeliBoard/blame/136b45880e6c3aadb6d9e1b68f90851b1fbb94e7/app/src/main/java/helium314/keyboard/latin/spellcheck/AndroidSpellCheckerSession.java#L72
so it might need to be changed as well.
if we don't know what is the fallback behavior in the absence of flag then it would probably be wrong to name it like that with a constant

The comment above says "Neither RESULT_ATTR_IN_THE_DICTIONARY nor RESULT_ATTR_LOOKS_LIKE_TYPO", and this also fits with documentation, which implies that 0 means none of the flags.
There is no information regarding 0 meaning "uncheckable".

If you had spent more time to understand how the code works, you could have inferred that cached suggestions with FLAG_UNCHECKABLE just return AndroidSpellCheckerService.getNotInDictEmptySuggestions(false), so this a flag of 0 is not returned to/handled by the native code/another class.
Essentially this constant could be named however we want as long as its purpose is clear.
The only change I would make is visibility-related: turn it to a private constant.

Helium314 · 2024-07-23T20:38:00Z

app/src/main/java/helium314/keyboard/latin/spellcheck/AndroidWordLevelSpellCheckerSession.java

@@ -171,58 +205,31 @@ public void onClose() {
    }

    private static final int CHECKABILITY_CHECKABLE = 0;
-    private static final int CHECKABILITY_TOO_MANY_NON_LETTERS = 1;


Why is this removed?

I would like to be notified if I wrote "My phone number is0123456789" <- missed a space after "is"

The problem I see here is that this might not be wanted by some people (it might have been introduced for a reason), and with the current style of PR I can't just simply revert this change if it turns out clearly unpopular.

codokie · 2024-07-24T08:23:58Z

@Helium314 thanks for the review, you've raised some good points.

Admittedly I'm somewhat lazy and don't like fixing merge conflicts so that's why it's all wound up in a single PR.
Even if it's bad practice, I don't think it's uncommon to do a overhaul of a single feature (in this case, the spellchecker) in one PR. Especially if only about 100 lines of code were modified.

As you know from my other PRs I kind of lost interest with the project and have less time currently so I don't know when/if to rework this PR. Sorry about that..

If you would still like to merge this PR (you could also split it as you had suggested), I'll do my best to answer (hopefully not too many) questions here (within reasonable time).

Helium314 · 2024-08-25T15:30:24Z

Even if it's bad practice, I don't think it's uncommon to do a overhaul of a single feature (in this case, the spellchecker) in one PR.

Such an overhaul can be done, but needs to be in a better discussed and documented way. Things like removing CHECKABILITY_TOO_MANY_NON_LETTERS is not in the linked issue, and not in the description of your PR. I wouldn't be surprised if this change is disliked by a bunch of users.
Reviewing and understanding changes to code I have barely read once (which is the case for the spell checker) is quite time consuming, and even more so when you find undocumented extras and you have to decipher whether they are actually undocumented changes, or just related to something you don't yet understand. With my current understanding of the spell checker, I never would have agreed to a general overhaul in a single PR. And even with more knowledge, such things should be first discussed with other community members (personally I don't really care because I don't use the spell checker).

If you would still like to merge this PR

I will merge it with some smaller changes as mentioned above.

codokie · 2024-08-26T09:35:42Z

Things like removing CHECKABILITY_TOO_MANY_NON_LETTERS

What I suggested is also the behavior of some spellcheckers like the one present in Microsoft Word.
No one seems to be upset that currently the spellchecker labels numbers as typos, so I highly doubt you will see someone complain if a mix of letters and numbers will be labeled as a typo.

The reason why your fork of OpenBoard interested me is because you added some neat features to it. It seems that (perhaps due to time constraints) you are no longer focused on renovating the keyboard anymore, only to maintain it. That's understandable, but you did not convince me that my ideas are not desirable by others.

I had spent some considerable time debugging the spellchecker and I think I have more than a basic understanding of how it works. I assure you that the person who originally added this functionality did not get it completely right (not trying to claim that I know much better.)

I won't be participating in this PR anymore because I do not think that your review is open-minded and you will likely heavily change things you do not really understand or able to test.
So feel free to alter my undecipherable code as you see fit, or to close the PR if you feel offended (not exactly my intention)

BlackyHawky · 2024-08-26T09:59:24Z

(My comment below is off-topic so I'm hiding it by default.)

Read my comment

It's so sad to read this comment, even though I can totally understand it. ☹️

@Helium314
I hope you'll find a compromise so that @codokie can work on this application again.
His ideas need to be exploited, and creating a developer group may be a good idea so that you can freely discuss future improvements.
I think 2, 3 or 4 people maximum to manage the application might be a good idea.

I hope this will be taken into consideration.

codokie added 3 commits April 1, 2024 15:43

spellcheck all enabled scripts and improve url detection

bac48f2

don't add null locale to list

1c5d955

Revert "don't add null locale to list"

16cc84a

This reverts commit 1c5d955.

codokie commented Apr 1, 2024

View reviewed changes

codokie and others added 4 commits April 11, 2024 01:13

replace constant of suggestion limit

0f4e1cf

Merge branch 'Helium314:main' into spellcheck-scripts-url

1945723

fix checking words containing a period.

a6906e2

properly check words with word connectors

21e4c1e

codokie and others added 4 commits April 26, 2024 16:45

Return quickly when suggestions are cached

a710b95

Condition 'null == keyboard' is always 'false'

0010191

Revert changes not related to the spellchecker

595e6d1

Merge branch 'Helium314:main' into spellcheck-scripts-url

0520491

code rearrangement

6b569de

change condition to check word with apostrophe correctly

f38bd5d

don't check words that start with @

010196a

Helium314 reviewed Jul 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spellcheck all enabled scripts and improve URL detection #613

Spellcheck all enabled scripts and improve URL detection #613

codokie commented Apr 1, 2024 •

edited

Loading

codokie left a comment •

edited

Loading

Helium314 commented Apr 3, 2024

codokie commented Apr 3, 2024

Helium314 commented Apr 3, 2024

codokie commented Apr 3, 2024

codokie commented Apr 25, 2024

codokie commented Apr 26, 2024

RHJihan commented Apr 28, 2024

Uranusek commented May 23, 2024

codokie commented May 23, 2024

codokie commented May 23, 2024 •

edited

Loading

Uranusek commented May 24, 2024

codokie commented May 26, 2024

Helium314 commented Jul 23, 2024

Helium314 Jul 23, 2024

codokie Jul 24, 2024

Helium314 Aug 25, 2024

Helium314 Jul 23, 2024

codokie Jul 24, 2024

Helium314 Aug 25, 2024

codokie Aug 26, 2024 •

edited

Loading

Helium314 Jul 23, 2024

codokie Jul 24, 2024

Helium314 Aug 25, 2024

codokie commented Jul 24, 2024 •

edited

Loading

Helium314 commented Aug 25, 2024

codokie commented Aug 26, 2024

BlackyHawky commented Aug 26, 2024

		@@ -47,6 +48,8 @@ public abstract class AndroidWordLevelSpellCheckerSession extends Session {

		public final static String[] EMPTY_STRING_ARRAY = new String[0];

		public final static int FLAG_UNCHECKABLE = 0;

Spellcheck all enabled scripts and improve URL detection #613

Are you sure you want to change the base?

Spellcheck all enabled scripts and improve URL detection #613

Conversation

codokie commented Apr 1, 2024 • edited Loading

codokie left a comment • edited Loading

Choose a reason for hiding this comment

Helium314 commented Apr 3, 2024

codokie commented Apr 3, 2024

Helium314 commented Apr 3, 2024

codokie commented Apr 3, 2024

codokie commented Apr 25, 2024

codokie commented Apr 26, 2024

RHJihan commented Apr 28, 2024

Uranusek commented May 23, 2024

codokie commented May 23, 2024

codokie commented May 23, 2024 • edited Loading

Uranusek commented May 24, 2024

codokie commented May 26, 2024

Helium314 commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codokie Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codokie commented Jul 24, 2024 • edited Loading

Helium314 commented Aug 25, 2024

codokie commented Aug 26, 2024

BlackyHawky commented Aug 26, 2024

codokie commented Apr 1, 2024 •

edited

Loading

codokie left a comment •

edited

Loading

codokie commented May 23, 2024 •

edited

Loading

codokie Aug 26, 2024 •

edited

Loading

codokie commented Jul 24, 2024 •

edited

Loading