[TextTransform] Char n-grams are different when using with/without word n-grams.

### System information

`Not relevant`

### Issue
When using TextTransform to compute n-grams, it has been observed that character n-grams produced are not consistent when computing it with/without word n-grams option. For example, for the following sentence 
```
This is a cat
```
The character n-grams produced are as follows. where `<STX>`, `<ETX>` and `<SP>` are start of sentence, end of sentence and space control characters respectively.

| Char 3-gram | Char 3-gram + Word 3-gram|
|-|-|
|\<STX\>\|t\|h | \<STX\>\|t\|h
| t\|h\|i | t\|h\|i
| h\|i\|s | h\|i\|s
| i\|s\|\<SP\> | i\|s\|\<ETX\>
| s\|\<SP\>\|i | s\|\<ETX\>\|\<STX\>
| \<SP\>\|i\|s | \<ETX\>\|\<STX\>\|i
| s\|\<SP\>\|a | \<STX\>\|i\|s
| \<SP\>\|a\|\<SP\> | \<ETX\>\|\<STX\>\|a
| a\|\<SP\>\|c | \<STX\>\|a\|\<ETX\>
| \<SP\>\|c\|a | a\|\<ETX\>\|\<STX\>
| c\|a\|t | \<ETX\>\|\<STX\>\|c
| a\|t\|\<ETX\> | \<STX\>\|c\|a
|-| c\|a\|t
|-| a\|t\|\<ETX\>

### Source code / logs
The cause of the problem is word tokenizer which is applied at the following location in the code.

https://github.com/dotnet/machinelearning/blob/669f4fad33184c9c558314f8bc758f7928ad62bf/src/Microsoft.ML.Transforms/Text/TextTransform.cs#L266

The `NeedsWordTokenizationTransform` property is set according to following criteria
https://github.com/dotnet/machinelearning/blob/669f4fad33184c9c558314f8bc758f7928ad62bf/src/Microsoft.ML.Transforms/Text/TextTransform.cs#L169

This means whenever word n-grams are being computed the tokenization is performed first and character n-gram extractor computes n-grams on words instead of sentences i.e. 

instead of computing char n-grams on 
```
<STX>This<SP>is<SP>a<SP>cat<ETX>
```
it computes char n-grams on
```
<STX>This<ETX>
<STX>is<ETX>
<STX>a<ETX>
<STX>cat<ETX>
```
First of all, is the expected behavior?
I my point of view `NOT` because in this way character n-gram is adding noise and losing important information regarding the sentence which in some cases may give superior performance.

### Solution
Apply char n-gram extractor on `IDataView` that was not used for word processing in the code.

Char 3-gram	Char 3-gram + Word 3-gram
<STX>\|t\|h	<STX>\|t\|h
t\|h\|i	t\|h\|i
h\|i\|s	h\|i\|s
i\|s\|<SP>	i\|s\|<ETX>
s\|<SP>\|i	s\|<ETX>\|<STX>
<SP>\|i\|s	<ETX>\|<STX>\|i
s\|<SP>\|a	<STX>\|i\|s
<SP>\|a\|<SP>	<ETX>\|<STX>\|a
a\|<SP>\|c	<STX>\|a\|<ETX>
<SP>\|c\|a	a\|<ETX>\|<STX>
c\|a\|t	<ETX>\|<STX>\|c
a\|t\|<ETX>	<STX>\|c\|a
-	c\|a\|t
-	a\|t\|<ETX>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TextTransform] Char n-grams are different when using with/without word n-grams. #530

System information

Issue

Source code / logs

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[TextTransform] Char n-grams are different when using with/without word n-grams. #530

Description

System information

Issue

Source code / logs

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions