Skip to content

Segmenter using ICU4C #461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jun 4, 2025
Merged

Segmenter using ICU4C #461

merged 21 commits into from
Jun 4, 2025

Conversation

sven-oly
Copy link
Collaborator

Also adds some test case hacks to the generator to make some LINE segmentation pass. However, it doesn't pass with Chinese, Japanese, or Korean.

Note that the generator could be implemented with ICU4J to get better expected data.

Also, this depends on PR #458 , so cannot be submitted before that PR.

sven-oly added 10 commits May 14, 2025 17:01
sven-oly added 9 commits May 22, 2025 16:46
@sven-oly sven-oly self-assigned this Jun 3, 2025
@sven-oly sven-oly added the enhancement New feature or request label Jun 3, 2025
Copy link
Collaborator Author

@sven-oly sven-oly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready for review. It will add Segmenter tests in ICU4C in 3 ICU versions.

Note that there's a difference between ICU4C and ICU4J results in the line break for Japanese.

Copy link
Collaborator

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

json_object *options_obj = json_object_object_get(json_in, "options");

json_object *granularity_obj = json_object_object_get(options_obj, "granularity");
string granularity_value = json_object_get_string(granularity_obj);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: granularity_value is an enum, but we're leaving it as a C++ string. We do properly convert enum values into Java enums in the ICU4J executor. Not for this PR but for the future, it would be better to convert enums into C++ enums.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Co-authored-by: Elango Cheran <[email protected]>
@sven-oly sven-oly merged commit d8476cc into unicode-org:main Jun 4, 2025
8 checks passed
@sven-oly sven-oly deleted the segmenter_cpp branch July 17, 2025 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants