Skip to content

[TextTransform] Char n-grams are different when using with/without word n-grams. #530

Closed
@zeahmed

Description

@zeahmed

System information

Not relevant

Issue

When using TextTransform to compute n-grams, it has been observed that character n-grams produced are not consistent when computing it with/without word n-grams option. For example, for the following sentence

This is a cat

The character n-grams produced are as follows. where <STX>, <ETX> and <SP> are start of sentence, end of sentence and space control characters respectively.

Char 3-gram Char 3-gram + Word 3-gram
<STX>|t|h <STX>|t|h
t|h|i t|h|i
h|i|s h|i|s
i|s|<SP> i|s|<ETX>
s|<SP>|i s|<ETX>|<STX>
<SP>|i|s <ETX>|<STX>|i
s|<SP>|a <STX>|i|s
<SP>|a|<SP> <ETX>|<STX>|a
a|<SP>|c <STX>|a|<ETX>
<SP>|c|a a|<ETX>|<STX>
c|a|t <ETX>|<STX>|c
a|t|<ETX> <STX>|c|a
- c|a|t
- a|t|<ETX>

Source code / logs

The cause of the problem is word tokenizer which is applied at the following location in the code.

if (tparams.NeedsWordTokenizationTransform)

The NeedsWordTokenizationTransform property is set according to following criteria

public bool NeedsWordTokenizationTransform { get { return WordExtractorFactory != null || NeedsRemoveStopwordsTransform || OutputTextTokens; } }

This means whenever word n-grams are being computed the tokenization is performed first and character n-gram extractor computes n-grams on words instead of sentences i.e.

instead of computing char n-grams on

<STX>This<SP>is<SP>a<SP>cat<ETX>

it computes char n-grams on

<STX>This<ETX>
<STX>is<ETX>
<STX>a<ETX>
<STX>cat<ETX>

First of all, is the expected behavior?
I my point of view NOT because in this way character n-gram is adding noise and losing important information regarding the sentence which in some cases may give superior performance.

Solution

Apply char n-gram extractor on IDataView that was not used for word processing in the code.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions