Segmenter results for thai sentence seems incorrect. #3208

riajain0412 · 2023-03-20T07:29:20Z

riajain0412
Mar 20, 2023

I ran the below code to check the breakpoint for thai sentence:

#include <stdio.h>
#include <iostream>
#include <string_view>
#include "include/ICU4XDataProvider.hpp"
#include "include/ICU4XWordSegmenter.hpp"

using namespace std;

void print_ruler(size_t str_len) {
    for (size_t i = 0; i < str_len; i++) {
        if (i % 10 == 0) {
            cout << "0";
        } else if (i % 5 == 0) {
            cout << "5";
        } else {
            cout << ".";
        }
    }
    cout << endl;
}

template <typename Iterator>
void iterate_breakpoints(Iterator& iterator) {
    while (true) {
        int32_t breakpoint = iterator.next();
        if (breakpoint == -1) {
            break;
        }
        cout << " here " << breakpoint;
    }
    cout << endl;
}

int main(){
	const auto provider = ICU4XDataProvider::create_test();
    const auto segmenter_auto =
        ICU4XWordSegmenter::create_auto(provider).ok().value();
    const auto segmenter_lstm =
        ICU4XWordSegmenter::create_lstm(provider).ok().value();
    const auto segmenter_dictionary =
        ICU4XWordSegmenter::create_dictionary(provider).ok().value();

    const ICU4XWordSegmenter* segmenters[] = {&segmenter_auto, &segmenter_lstm,
                                              &segmenter_dictionary};

    std::string_view str;

    str = "และความมั่นคงแห่งตัวตน";

    for (const auto* segmenter : segmenters) {
        cout << "Finding word breakpoints in string:" << endl << str << endl;
        print_ruler(str.size());

        cout << "Word breakpoints:";
        auto iterator = segmenter->segment_utf8(str);
        iterate_breakpoints(iterator);
    }
}

And it gave the result this: 0 9 21 39 51 60 66.

However, the above thai sentence only have 17-18 characters so howcome ICU4X segmenter giving 39,51,60 etc as breakpoints?

Is this an expected resulted? If yes, then how should I take these indices as?

sffc · 2023-03-20T08:02:58Z

sffc
Mar 20, 2023
Maintainer

The breakpoints are in terms of UTF-8 indices.

8 replies

sffc Mar 20, 2023
Maintainer

Don't use segment_latin1 unless your input is latin1, which is not possible if your string contains Thai letters.

Both Rust and C++ use utf8 indices. There is nothing unusual about them

riajain0412 Mar 21, 2023
Author

Okay, Got it. Thank you.

I have one more query that does ICU4X provide converter functionality so that I can convert UnicodeString to UTF-8. I checked your repo and docs but unable to find any converter functionality.

Manishearth Mar 21, 2023
Maintainer

UnicodeString is not a standard library type. Whose UnicodeString are you talking about? I would expect that API to expose the relevant conversion functions.

ICU4C UnicodeString is utf16 and you can just use the segment_utf16 API here.

riajain0412 Mar 22, 2023
Author

Yes, but segment_utf16 function is not accepting UnicodeString as an argument.

sffc Mar 22, 2023
Maintainer

If you're using ICU4C UnicodeString, you can find functions in unistr.h to convert it to UTF-8. Alternatively, you can get the char16_t* representation of your UnicodeString and pass it to ICU4X's segment_utf16 function.

riajain0412 · 2023-03-28T09:10:58Z

riajain0412
Mar 28, 2023
Author

Okay. Thank You for your help.
Have another query that does ICU4X offers segmenter functionality with offset? Similar to ICU4C, in which we can specify the offset and can get the boundary after that offset?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmenter results for thai sentence seems incorrect. #3208

{{title}}

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Segmenter results for thai sentence seems incorrect. #3208

riajain0412 Mar 20, 2023

Replies: 2 comments · 8 replies

sffc Mar 20, 2023 Maintainer

sffc Mar 20, 2023 Maintainer

riajain0412 Mar 21, 2023 Author

Manishearth Mar 21, 2023 Maintainer

riajain0412 Mar 22, 2023 Author

sffc Mar 22, 2023 Maintainer

riajain0412 Mar 28, 2023 Author

riajain0412
Mar 20, 2023

Replies: 2 comments 8 replies

sffc
Mar 20, 2023
Maintainer

sffc Mar 20, 2023
Maintainer

riajain0412 Mar 21, 2023
Author

Manishearth Mar 21, 2023
Maintainer

riajain0412 Mar 22, 2023
Author

sffc Mar 22, 2023
Maintainer

riajain0412
Mar 28, 2023
Author