You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why we need to convert Keras inputs to sparse tensor?
In the real dataset, feature values of samples may be missing and we want to ignore the missing value during transformation. So we need to use tf.SparseTensor to represent the input. For example:
education
marital-status
Master
Divorced
Never-married
Bachelor
The input tensor from dataset for tf.keras.layers.Input is
In the case, we may want to ingore the empty string and convert other values to zero-based integer ids for embedding. So we need to convert the inputs to tf.SparseTensor like:
Then, we can use keras preprocessing layers proposed in the RFC like Lookup and Hash to convert the values of tf.SparseTensor to zero-based integer ids. And we can feed them to the customized SparseEmbedding with combiner in ElasticDL. For example, we use Lookup with education vocabulary ["Master", "Bachelor"] and marital-status vocabulary ["Divorced", "Never-married"] and can get
When the number of categorical features is very large, we may need to concatenate the integer ids of multiple categorical features before embedding and we have described the feature in the issue. For example, we add the ids of marital-status with 3 to avoid the conflicts with education ids and then concatenate the two SparseTensor to a SparseTensor:
The problem of filling the missing value with a default integer id before embedding.
Besides ignoring the missing value, we also can fill the missing value with a default integer id during transformation.
The vocabulary of education is ["Master", "Bachelor"] and the id of missing value is 2.
The vocabulary of martial-status is ["Divorced", "Never-married"] and the id of missing value is 2.
After lookup, we can get the transformation result:
Then we may want to concatenate the integer ids like the issue. Before concatenating, we add the ids of marital-status with 3 to avoid the conflicts with education ids. So, the concatenation result is
[[0, 3], [2, 4], [1, 5]]
Then we feed the result into tf.keras.layers.Embedding and use tf.reduce_sum to combine the embedding output to 3x2 dense tensor. The logic of embedding and tf.reduce_sum is the same as SparseEmbedding with the sum combiner. But, we add the embedding values of missing values in the result. The combined result of each sample may be bigger than using sparse tensor without missing values. The embedding values of missing values may be noise for training.
For example:
Suppose that the embedding table is 6x2 matrix:
workingloong
changed the title
Convert keras input to SparseTensor ignoring the missing value.
Support convert categorical features to SparseTensor for embedding in Keras
Mar 17, 2020
Uh oh!
There was an error while loading. Please reload this page.
Why we need to convert Keras inputs to sparse tensor?
In the real dataset, feature values of samples may be missing and we want to ignore the missing value during transformation. So we need to use
tf.SparseTensor
to represent the input. For example:The input tensor from dataset for
tf.keras.layers.Input
isIn the case, we may want to ingore the empty string and convert other values to zero-based integer ids for embedding. So we need to convert the inputs to
tf.SparseTensor
like:Then, we can use keras preprocessing layers proposed in the RFC like
Lookup
andHash
to convert the values oftf.SparseTensor
to zero-based integer ids. And we can feed them to the customized SparseEmbedding with combiner in ElasticDL. For example, we useLookup
with education vocabulary ["Master", "Bachelor"] and marital-status vocabulary ["Divorced", "Never-married"] and can getWhen the number of categorical features is very large, we may need to concatenate the integer ids of multiple categorical features before embedding and we have described the feature in the issue. For example, we add the ids of marital-status with 3 to avoid the conflicts with education ids and then concatenate the two
SparseTensor
to aSparseTensor
:The problem of filling the missing value with a default integer id before embedding.
Besides ignoring the missing value, we also can fill the missing value with a default integer id during transformation.
The vocabulary of education is ["Master", "Bachelor"] and the id of missing value is 2.
The vocabulary of martial-status is ["Divorced", "Never-married"] and the id of missing value is 2.
After lookup, we can get the transformation result:
Then we may want to concatenate the integer ids like the issue. Before concatenating, we add the ids of marital-status with 3 to avoid the conflicts with education ids. So, the concatenation result is
Then we feed the result into
tf.keras.layers.Embedding
and usetf.reduce_sum
to combine the embedding output to 3x2 dense tensor. The logic of embedding andtf.reduce_sum
is the same asSparseEmbedding
with the sum combiner. But, we add the embedding values of missing values in the result. The combined result of each sample may be bigger than using sparse tensor without missing values. The embedding values of missing values may be noise for training.For example:
Suppose that the embedding table is 6x2 matrix:
If we feed the above concatenated sparse tensor into
SparseEmbedding
with sum combiner, the output isIf we feed the concatenated tensor into
tf.keras.layers.Embedding
and usetf.reduce_sum
to combine, the output isThe text was updated successfully, but these errors were encountered: