Skip to content

Support convert categorical features to SparseTensor for embedding in Keras #1844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
workingloong opened this issue Mar 17, 2020 · 0 comments · Fixed by #1860
Closed

Support convert categorical features to SparseTensor for embedding in Keras #1844

workingloong opened this issue Mar 17, 2020 · 0 comments · Fixed by #1860
Assignees

Comments

@workingloong
Copy link
Collaborator

workingloong commented Mar 17, 2020

Why we need to convert Keras inputs to sparse tensor?

In the real dataset, feature values of samples may be missing and we want to ignore the missing value during transformation. So we need to use tf.SparseTensor to represent the input. For example:

education marital-status
Master Divorced
Never-married
Bachelor

The input tensor from dataset for tf.keras.layers.Input is

{
    "education":[["Master"], [""], ["Bachelor"]],
    "marital-status":[["Divorced"], ["Never-married"], [""]]
}

In the case, we may want to ingore the empty string and convert other values to zero-based integer ids for embedding. So we need to convert the inputs to tf.SparseTensor like:

{
    "education":SparseTensor(
       indices=tf.Tensor([[0, 0], [2, 0]], shape=(2, 2), dtype=int64), 
       values=tf.Tensor(["Master", "Bachelor"], shape=(2,), dtype=string),
       dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
   )
   "marital-status":SparseTensor(
       indices=tf.Tensor([[0, 0], [1, 0]], shape=(2, 2), dtype=int64), 
       values=tf.Tensor(["Divorced", "Never-married"], shape=(2,), dtype=string),
       dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
   )

Then, we can use keras preprocessing layers proposed in the RFC like Lookup and Hash to convert the values of tf.SparseTensor to zero-based integer ids. And we can feed them to the customized SparseEmbedding with combiner in ElasticDL. For example, we use Lookup with education vocabulary ["Master", "Bachelor"] and marital-status vocabulary ["Divorced", "Never-married"] and can get

{
    "education":SparseTensor(
       indices=tf.Tensor([[0, 0], [2, 0]], shape=(2, 2), dtype=int64), 
       values=tf.Tensor([0, 1], shape=(2,), dtype=string),
       dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
   )
   "marital-status":SparseTensor(
       indices=tf.Tensor([[0, 0], [1, 0]], shape=(2, 2), dtype=int64), 
       values=tf.Tensor([0, 1], shape=(2,), dtype=int64),
       dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
   )

When the number of categorical features is very large, we may need to concatenate the integer ids of multiple categorical features before embedding and we have described the feature in the issue. For example, we add the ids of marital-status with 3 to avoid the conflicts with education ids and then concatenate the two SparseTensor to a SparseTensor:

SparseTensor(
       indices=tf.Tensor([[0, 0], [2, 0], [0, 1], [1, 1]], shape=(2, 2), dtype=int64), 
       values=tf.Tensor([0, 1, 3, 4], shape=(2,), dtype=int64),
       dense_shape=tf.Tensor([3, 2], shape=(2,), dtype=int64)
   )

The problem of filling the missing value with a default integer id before embedding.

Besides ignoring the missing value, we also can fill the missing value with a default integer id during transformation.
The vocabulary of education is ["Master", "Bachelor"] and the id of missing value is 2.
The vocabulary of martial-status is ["Divorced", "Never-married"] and the id of missing value is 2.
After lookup, we can get the transformation result:

{"education": [[0], [2], [1]], "marital-status": [[0], [1], [2]]}

Then we may want to concatenate the integer ids like the issue. Before concatenating, we add the ids of marital-status with 3 to avoid the conflicts with education ids. So, the concatenation result is

 [[0, 3], [2, 4], [1, 5]]

Then we feed the result into tf.keras.layers.Embedding and use tf.reduce_sum to combine the embedding output to 3x2 dense tensor. The logic of embedding and tf.reduce_sum is the same as SparseEmbedding with the sum combiner. But, we add the embedding values of missing values in the result. The combined result of each sample may be bigger than using sparse tensor without missing values. The embedding values of missing values may be noise for training.

For example:
Suppose that the embedding table is 6x2 matrix:

 [[0.1, 0.1], [0.2, 0.2], [0.3, 0.3], [0.4, 0.4], [0.5, 0.5], [0.6, 0.6]]

If we feed the above concatenated sparse tensor into SparseEmbedding with sum combiner, the output is

[[0.5, 0.5],  [0.5, 0.5], [0.2, 0.2]]

If we feed the concatenated tensor into tf.keras.layers.Embedding and use tf.reduce_sum to combine, the output is

[[0.5, 0.5],  [0.8, 0.8], [0.8, 0.8]]
@workingloong workingloong changed the title Convert keras input to SparseTensor ignoring the missing value. Support convert categorical features to SparseTensor for embedding in Keras Mar 17, 2020
@workingloong workingloong self-assigned this Mar 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants