Skip to content

Add Czech stemmer #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

patrik-simunic-cz
Copy link

  • Added the Czech stemming algorithm in SBL, copied from https://snowballstem.org
  • Compiled Czech SBL into Rust, added support for Czech
  • Updated the README file

@patrik-simunic-cz
Copy link
Author

patrik-simunic-cz commented Apr 27, 2024

Update: fixed incorrectly generated Context fields type (switched i32 for usize) as per suggestion in an open PR: #20

@elvircrn
Copy link

elvircrn commented Apr 7, 2025

Is this project abandoned?

@patrik-simunic-cz
Copy link
Author

@elvircrn pretty much 😅😅

I needed to add a stemmer here because a full-text search engine Tantivy depends on this package. But since it seems to be abandoned, I took inspiration from this project and created a new package testuj-to/tantivy-stemmers that incorporates many more languages and even exposes them as Cargo features so you don't have to compile and bundle any languages you don't need.

While the testuj-to/tantivy-stemmers crate is primarily intended for the use in connection with Tantivy engine, it's not necessarily dependent on it. I have seen that someone has used it without Tantivy - https://github.com/kemingy/tocken/blob/main/src/tokenizer.rs#L11

You can install whatever language (algorithm) you need as a Cargo feature

[dependencies]
tantivy-stemmers = { version = "0.4.0", features = ["default", "norwegian_bokmal"] }

and then you should be able to just import and use the stemming function directly from the ::algorithms namespace:

use tantivy_stemmers::algorithms::norwegian_bokmal as stem_norwegian;

fn main() {
    let input_phrase = &"The quick brown fox jumps over the lazy dog";

    for word in input_phrase.split(" ") {
        let input = word.to_lowercase(); // Input is must be single word in lowercase

        /*
         * Pass the input as &str - all the algorithms are of type
         * pub type Algorithm = fn(&str) -> Cow<str>;
         */
        let output = stem_norwegian(input.as_str());

        println!("{}", output);
    }
}

@elvircrn
Copy link

elvircrn commented Apr 8, 2025

Ah that cargo feature is definitely useful, I went ahead and added a missing language to both rust-stemmres and tantivity-stemmers, but thought that tantivity-stemmers was abanded in favor of rust-stemmers, since tantitivity depends on it. Seems like it's the other way around. :)

@elvircrn
Copy link

elvircrn commented Apr 8, 2025

(thanks for the detailed answer)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants