-
-
Notifications
You must be signed in to change notification settings - Fork 14
Segmenter using ICU4C #461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ready for review. It will add Segmenter tests in ICU4C in 3 ICU versions.
Note that there's a difference between ICU4C and ICU4J results in the line break for Japanese.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
json_object *options_obj = json_object_object_get(json_in, "options"); | ||
|
||
json_object *granularity_obj = json_object_object_get(options_obj, "granularity"); | ||
string granularity_value = json_object_get_string(granularity_obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: granularity_value
is an enum, but we're leaving it as a C++ string
. We do properly convert enum values into Java enums in the ICU4J executor. Not for this PR but for the future, it would be better to convert enums into C++ enums.
Co-authored-by: Elango Cheran <[email protected]>
Also adds some test case hacks to the generator to make some LINE segmentation pass. However, it doesn't pass with Chinese, Japanese, or Korean.
Note that the generator could be implemented with ICU4J to get better expected data.
Also, this depends on PR #458 , so cannot be submitted before that PR.