Description
System information
Not relevant
Issue
When using TextTransform to compute n-grams, it has been observed that character n-grams produced are not consistent when computing it with/without word n-grams option. For example, for the following sentence
This is a cat
The character n-grams produced are as follows. where <STX>
, <ETX>
and <SP>
are start of sentence, end of sentence and space control characters respectively.
Char 3-gram | Char 3-gram + Word 3-gram |
---|---|
<STX>|t|h | <STX>|t|h |
t|h|i | t|h|i |
h|i|s | h|i|s |
i|s|<SP> | i|s|<ETX> |
s|<SP>|i | s|<ETX>|<STX> |
<SP>|i|s | <ETX>|<STX>|i |
s|<SP>|a | <STX>|i|s |
<SP>|a|<SP> | <ETX>|<STX>|a |
a|<SP>|c | <STX>|a|<ETX> |
<SP>|c|a | a|<ETX>|<STX> |
c|a|t | <ETX>|<STX>|c |
a|t|<ETX> | <STX>|c|a |
- | c|a|t |
- | a|t|<ETX> |
Source code / logs
The cause of the problem is word tokenizer which is applied at the following location in the code.
The NeedsWordTokenizationTransform
property is set according to following criteria
This means whenever word n-grams are being computed the tokenization is performed first and character n-gram extractor computes n-grams on words instead of sentences i.e.
instead of computing char n-grams on
<STX>This<SP>is<SP>a<SP>cat<ETX>
it computes char n-grams on
<STX>This<ETX>
<STX>is<ETX>
<STX>a<ETX>
<STX>cat<ETX>
First of all, is the expected behavior?
I my point of view NOT
because in this way character n-gram is adding noise and losing important information regarding the sentence which in some cases may give superior performance.
Solution
Apply char n-gram extractor on IDataView
that was not used for word processing in the code.