-
Notifications
You must be signed in to change notification settings - Fork 27
Add Czech stemmer #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
patrik-simunic-cz
commented
Apr 27, 2024
- Added the Czech stemming algorithm in SBL, copied from https://snowballstem.org
- Compiled Czech SBL into Rust, added support for Czech
- Updated the README file
Update: fixed incorrectly generated |
Is this project abandoned? |
@elvircrn pretty much 😅😅 I needed to add a stemmer here because a full-text search engine Tantivy depends on this package. But since it seems to be abandoned, I took inspiration from this project and created a new package testuj-to/tantivy-stemmers that incorporates many more languages and even exposes them as Cargo features so you don't have to compile and bundle any languages you don't need. While the testuj-to/tantivy-stemmers crate is primarily intended for the use in connection with Tantivy engine, it's not necessarily dependent on it. I have seen that someone has used it without Tantivy - https://github.com/kemingy/tocken/blob/main/src/tokenizer.rs#L11 You can install whatever language (algorithm) you need as a Cargo feature [dependencies]
tantivy-stemmers = { version = "0.4.0", features = ["default", "norwegian_bokmal"] } and then you should be able to just import and use the stemming function directly from the use tantivy_stemmers::algorithms::norwegian_bokmal as stem_norwegian;
fn main() {
let input_phrase = &"The quick brown fox jumps over the lazy dog";
for word in input_phrase.split(" ") {
let input = word.to_lowercase(); // Input is must be single word in lowercase
/*
* Pass the input as &str - all the algorithms are of type
* pub type Algorithm = fn(&str) -> Cow<str>;
*/
let output = stem_norwegian(input.as_str());
println!("{}", output);
}
} |
Ah that cargo feature is definitely useful, I went ahead and added a missing language to both rust-stemmres and tantivity-stemmers, but thought that tantivity-stemmers was abanded in favor of rust-stemmers, since tantitivity depends on it. Seems like it's the other way around. :) |
(thanks for the detailed answer) |