-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
System information
Not relevant
Issue
When using TextTransform to compute n-grams, it has been observed that character n-grams produced are not consistent when computing it with/without word n-grams option. For example, for the following sentence
This is a cat
The character n-grams produced are as follows. where <STX>
, <ETX>
and <SP>
are start of sentence, end of sentence and space control characters respectively.
Char 3-gram | Char 3-gram + Word 3-gram |
---|---|
<STX>|t|h | <STX>|t|h |
t|h|i | t|h|i |
h|i|s | h|i|s |
i|s|<SP> | i|s|<ETX> |
s|<SP>|i | s|<ETX>|<STX> |
<SP>|i|s | <ETX>|<STX>|i |
s|<SP>|a | <STX>|i|s |
<SP>|a|<SP> | <ETX>|<STX>|a |
a|<SP>|c | <STX>|a|<ETX> |
<SP>|c|a | a|<ETX>|<STX> |
c|a|t | <ETX>|<STX>|c |
a|t|<ETX> | <STX>|c|a |
- | c|a|t |
- | a|t|<ETX> |
Source code / logs
The cause of the problem is word tokenizer which is applied at the following location in the code.
if (tparams.NeedsWordTokenizationTransform) |
The NeedsWordTokenizationTransform
property is set according to following criteria
public bool NeedsWordTokenizationTransform { get { return WordExtractorFactory != null || NeedsRemoveStopwordsTransform || OutputTextTokens; } } |
This means whenever word n-grams are being computed the tokenization is performed first and character n-gram extractor computes n-grams on words instead of sentences i.e.
instead of computing char n-grams on
<STX>This<SP>is<SP>a<SP>cat<ETX>
it computes char n-grams on
<STX>This<ETX>
<STX>is<ETX>
<STX>a<ETX>
<STX>cat<ETX>
First of all, is the expected behavior?
I my point of view NOT
because in this way character n-gram is adding noise and losing important information regarding the sentence which in some cases may give superior performance.
Solution
Apply char n-gram extractor on IDataView
that was not used for word processing in the code.