Add Czech stemmer #22

patrik-simunic-cz · 2024-04-27T20:44:12Z

Added the Czech stemming algorithm in SBL, copied from https://snowballstem.org
Compiled Czech SBL into Rust, added support for Czech
Updated the README file

patrik-simunic-cz · 2024-04-27T21:47:30Z

Update: fixed incorrectly generated Context fields type (switched i32 for usize) as per suggestion in an open PR: #20

elvircrn · 2025-04-07T22:15:00Z

Is this project abandoned?

patrik-simunic-cz · 2025-04-08T12:17:46Z

@elvircrn pretty much 😅😅

I needed to add a stemmer here because a full-text search engine Tantivy depends on this package. But since it seems to be abandoned, I took inspiration from this project and created a new package testuj-to/tantivy-stemmers that incorporates many more languages and even exposes them as Cargo features so you don't have to compile and bundle any languages you don't need.

While the testuj-to/tantivy-stemmers crate is primarily intended for the use in connection with Tantivy engine, it's not necessarily dependent on it. I have seen that someone has used it without Tantivy - https://github.com/kemingy/tocken/blob/main/src/tokenizer.rs#L11

You can install whatever language (algorithm) you need as a Cargo feature

[dependencies]
tantivy-stemmers = { version = "0.4.0", features = ["default", "norwegian_bokmal"] }

and then you should be able to just import and use the stemming function directly from the ::algorithms namespace:

use tantivy_stemmers::algorithms::norwegian_bokmal as stem_norwegian;

fn main() {
    let input_phrase = &"The quick brown fox jumps over the lazy dog";

    for word in input_phrase.split(" ") {
        let input = word.to_lowercase(); // Input is must be single word in lowercase

        /*
         * Pass the input as &str - all the algorithms are of type
         * pub type Algorithm = fn(&str) -> Cow<str>;
         */
        let output = stem_norwegian(input.as_str());

        println!("{}", output);
    }
}

elvircrn · 2025-04-08T12:29:02Z

Ah that cargo feature is definitely useful, I went ahead and added a missing language to both rust-stemmres and tantivity-stemmers, but thought that tantivity-stemmers was abanded in favor of rust-stemmers, since tantitivity depends on it. Seems like it's the other way around. :)

elvircrn · 2025-04-08T12:29:12Z

(thanks for the detailed answer)

patrik-simunic-cz added 4 commits April 27, 2024 20:45

Add Czech snowball stemmer

6ff487d

Add compiled Czech stemmer

9b4a326

Add Czech stemmer support

f7b9101

Update README (czech stemmer added)

13f7415

patrik-simunic-cz mentioned this pull request Apr 27, 2024

Add Czech stemmer #23

Open

Fix incorrect Context fields type

3a15be5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Czech stemmer #22

Add Czech stemmer #22

Uh oh!

patrik-simunic-cz commented Apr 27, 2024

Uh oh!

patrik-simunic-cz commented Apr 27, 2024 •

edited

Loading

Uh oh!

elvircrn commented Apr 7, 2025

Uh oh!

patrik-simunic-cz commented Apr 8, 2025

Uh oh!

elvircrn commented Apr 8, 2025

Uh oh!

elvircrn commented Apr 8, 2025

Uh oh!

Uh oh!

Add Czech stemmer #22

Are you sure you want to change the base?

Add Czech stemmer #22

Uh oh!

Conversation

patrik-simunic-cz commented Apr 27, 2024

Uh oh!

patrik-simunic-cz commented Apr 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elvircrn commented Apr 7, 2025

Uh oh!

patrik-simunic-cz commented Apr 8, 2025

Uh oh!

elvircrn commented Apr 8, 2025

Uh oh!

elvircrn commented Apr 8, 2025

Uh oh!

Uh oh!

patrik-simunic-cz commented Apr 27, 2024 •

edited

Loading