Some word segmentation results are different than we get in ICU4C #3522

riajain0412 · 2023-06-10T08:21:30Z

riajain0412
Jun 10, 2023

I was comparing results of text segmentation between ICU4C and ICU4X for SEA languages but I found some disparity between the results. Listing down the few strings which are having different result in ICU4X and ICU4C.

Khmer string មនុស្សទាំងអស់ is giving 13 index as a breakpoint in ICU4X while ICU4C gives 6
ຮ່ສົ່ສີ 5 in ICU4C while 7 in ICU4X
กระเพรา 3 in ICU4C while 7 in ICU4X
and many other strings
So, I wanted to confirm that are these expected results?

I'm using the full data blob with all keys and locales.

sffc · 2023-06-12T06:44:29Z

sffc
Jun 12, 2023
Maintainer

Which LSTM constructors are you using? Please verify that you can reproduce these results with the dictionary constructors.

3 replies

riajain0412 Jun 12, 2023
Author

I'm doing this with dictionary constructors.

sffc Jun 12, 2023
Maintainer

OK, dictionary behavior is intended to be the same between ICU4C and ICU4X. @makotokato or @aethanyc can you reproduce?

aethanyc Jun 12, 2023
Maintainer

I didn't reproduce on ICU4C. Here my test result for the word break points:

Testcase 1 Khmer string មនុស្សទាំងអស់ , I got [0, 6, 13].
Testcase 2 Lao string ຮ່ສົ່ສີ, I got [0, 3, 4, 6, 7].
Testcase 3 Thai string กระเพรา, I got [0, 3, 7].

I reproduce the results by using WordSegmenter::try_new_dictionary_with_buffer_provider and replacing s in

icu4x/components/segmenter/tests/complex_word.rs

Lines 57 to 71 in a8ef673

    
           #[test] 
        
           fn word_break_my() { 
        
               let segmenter = 
        
                   WordSegmenter::try_new_auto_with_buffer_provider(&get_segmenter_testdata_provider()) 
        
                       .expect("Data exists"); 
        
               let s = "မြန်မာစာမြန်မာစာမြန်မာစာ"; 
        
               let utf16: Vec<u16> = s.encode_utf16().collect(); 
        
               let iter = segmenter.segment_utf16(&utf16); 
        
               assert_eq!( 
        
                   iter.collect::<Vec<usize>>(), 
        
                   vec![0, 8, 16, 22, 24], 
        
                   "word segmenter with Burmese" 
        
               ); 
        
           }

Then running cargo test --all-features word_break_my under components/segmenter in icu4x repo.

If index 13 in testcase 1 and index 7 in testcase 2 are the only word break points in the results (other than index 0), I guess the dictionary data for Khmer or Lao is probably not loaded.

riajain0412 · 2023-06-13T04:49:37Z

riajain0412
Jun 13, 2023
Author

I'm loading full data blob in my C++ code. How to confirm that whether dictionaries are loaded or not?

And also which all keys are needed for word segmenter? I was trying to create a data blob for dictionary based word segmenter for SEA language only. I'm including only segmenter/word@1 and segmenter/dictionary/wl_ext@1.

19 replies

robertbastian Jun 14, 2023
Maintainer

Ah of course, https://unicode-org.github.io/icu4x/docs/ffi/cpp/logging_ffi.html

riajain0412 Jun 15, 2023
Author

Okay, I'll try this out. Thank you.

riajain0412 Jun 16, 2023
Author

@robertbastian , is cargo command the only way to create icu_capi_staticlib?

robertbastian Jun 16, 2023
Maintainer

Cargo is the Rust package manager. You don't need it, you can call rustc directly, but then you either need some other package manager, or compile every dependency manually.

riajain0412 Jun 19, 2023
Author

Ohh Okay. Thank you for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some word segmentation results are different than we get in ICU4C #3522

{{title}}

Replies: 2 comments 22 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Some word segmentation results are different than we get in ICU4C #3522

riajain0412 Jun 10, 2023

Replies: 2 comments · 22 replies

sffc Jun 12, 2023 Maintainer

riajain0412 Jun 12, 2023 Author

sffc Jun 12, 2023 Maintainer

aethanyc Jun 12, 2023 Maintainer

riajain0412 Jun 13, 2023 Author

robertbastian Jun 14, 2023 Maintainer

riajain0412 Jun 15, 2023 Author

riajain0412 Jun 16, 2023 Author

robertbastian Jun 16, 2023 Maintainer

riajain0412 Jun 19, 2023 Author

riajain0412
Jun 10, 2023

Replies: 2 comments 22 replies

sffc
Jun 12, 2023
Maintainer

riajain0412 Jun 12, 2023
Author

sffc Jun 12, 2023
Maintainer

aethanyc Jun 12, 2023
Maintainer

riajain0412
Jun 13, 2023
Author

robertbastian Jun 14, 2023
Maintainer

riajain0412 Jun 15, 2023
Author

riajain0412 Jun 16, 2023
Author

robertbastian Jun 16, 2023
Maintainer

riajain0412 Jun 19, 2023
Author