Support convert categorical features to SparseTensor for embedding in Keras

## Why we need to convert Keras inputs to sparse tensor?
In the real dataset, feature values of samples may be missing and we want to ignore the missing value during transformation. So we need to use `tf.SparseTensor` to represent the input. For example:

|education | marital-status|
| --------- | ------------|
| Master | Divorced
|    | Never-married
| Bachelor |    |

The input tensor from dataset for `tf.keras.layers.Input` is
```python
{
    "education":[["Master"], [""], ["Bachelor"]],
    "marital-status":[["Divorced"], ["Never-married"], [""]]
}
```
In the case, we may want to ingore the empty string and convert other values to zero-based integer ids for embedding. So we need to convert the inputs to `tf.SparseTensor` like:
```python
{
    "education":SparseTensor(
       indices=tf.Tensor([[0, 0], [2, 0]], shape=(2, 2), dtype=int64), 
       values=tf.Tensor(["Master", "Bachelor"], shape=(2,), dtype=string),
       dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
   )
   "marital-status":SparseTensor(
       indices=tf.Tensor([[0, 0], [1, 0]], shape=(2, 2), dtype=int64), 
       values=tf.Tensor(["Divorced", "Never-married"], shape=(2,), dtype=string),
       dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
   )
```
Then, we can use keras preprocessing layers proposed in the [RFC](https://github.com/tensorflow/community/pull/188) like `Lookup` and `Hash` to convert the values of `tf.SparseTensor` to zero-based integer ids. And we can feed them to the customized [SparseEmbedding](https://github.com/sql-machine-learning/elasticdl/blob/develop/elasticdl/python/keras/layers/sparse_embedding.py) with combiner in ElasticDL. For example, we use `Lookup` with education vocabulary ["Master", "Bachelor"] and marital-status vocabulary ["Divorced", "Never-married"] and can get

```python
{
    "education":SparseTensor(
       indices=tf.Tensor([[0, 0], [2, 0]], shape=(2, 2), dtype=int64), 
       values=tf.Tensor([0, 1], shape=(2,), dtype=string),
       dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
   )
   "marital-status":SparseTensor(
       indices=tf.Tensor([[0, 0], [1, 0]], shape=(2, 2), dtype=int64), 
       values=tf.Tensor([0, 1], shape=(2,), dtype=int64),
       dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
   )
```

When the number of categorical features is very large, we may need to concatenate the integer ids of multiple categorical features before embedding and we have described the feature in the [issue](https://github.com/tensorflow/tensorflow/issues/37521).  For example, we add the ids of marital-status with 3 to avoid the conflicts with education ids and then concatenate the two `SparseTensor` to a `SparseTensor`:
```python
SparseTensor(
       indices=tf.Tensor([[0, 0], [2, 0], [0, 1], [1, 1]], shape=(2, 2), dtype=int64), 
       values=tf.Tensor([0, 1, 3, 4], shape=(2,), dtype=int64),
       dense_shape=tf.Tensor([3, 2], shape=(2,), dtype=int64)
   )
```


## The problem of filling the missing value with a default integer id before embedding.
Besides ignoring the missing value, we also can fill the missing value with a default integer id during transformation.
The vocabulary of education is ["Master", "Bachelor"] and the id of missing value is 2.
The vocabulary of martial-status is ["Divorced", "Never-married"] and the id of missing value is 2.
After lookup, we can get the transformation result:
```python
{"education": [[0], [2], [1]], "marital-status": [[0], [1], [2]]}
```
Then we may want to concatenate the integer ids like the [issue](https://github.com/tensorflow/tensorflow/issues/37521).  Before concatenating, we add the ids of marital-status with 3 to avoid the conflicts with education ids. So, the concatenation result is
```python
 [[0, 3], [2, 4], [1, 5]]
```
Then we feed the result into  `tf.keras.layers.Embedding` and use `tf.reduce_sum` to combine the embedding output to 3x2 dense tensor. The logic of embedding and `tf.reduce_sum` is the same as `SparseEmbedding` with the sum combiner. But, we add the embedding values of missing values in the result. **The combined result of each sample may be bigger than using sparse tensor without missing values. The embedding values of missing values may be noise for training.**

For example:
Suppose that the embedding table is 6x2 matrix:
```python
 [[0.1, 0.1], [0.2, 0.2], [0.3, 0.3], [0.4, 0.4], [0.5, 0.5], [0.6, 0.6]]
```
If we feed the above concatenated sparse tensor into `SparseEmbedding` with sum combiner, the output is
```python
[[0.5, 0.5],  [0.5, 0.5], [0.2, 0.2]]
```
If we feed the concatenated tensor into `tf.keras.layers.Embedding` and  use `tf.reduce_sum` to combine, the output is
```python
[[0.5, 0.5],  [0.8, 0.8], [0.8, 0.8]]
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support convert categorical features to SparseTensor for embedding in Keras #1844

Why we need to convert Keras inputs to sparse tensor?

The problem of filling the missing value with a default integer id before embedding.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support convert categorical features to SparseTensor for embedding in Keras #1844

Description

Why we need to convert Keras inputs to sparse tensor?

The problem of filling the missing value with a default integer id before embedding.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions