Description
Why we need to convert Keras inputs to sparse tensor?
In the real dataset, feature values of samples may be missing and we want to ignore the missing value during transformation. So we need to use tf.SparseTensor
to represent the input. For example:
education | marital-status |
---|---|
Master | Divorced |
Never-married | |
Bachelor |
The input tensor from dataset for tf.keras.layers.Input
is
{
"education":[["Master"], [""], ["Bachelor"]],
"marital-status":[["Divorced"], ["Never-married"], [""]]
}
In the case, we may want to ingore the empty string and convert other values to zero-based integer ids for embedding. So we need to convert the inputs to tf.SparseTensor
like:
{
"education":SparseTensor(
indices=tf.Tensor([[0, 0], [2, 0]], shape=(2, 2), dtype=int64),
values=tf.Tensor(["Master", "Bachelor"], shape=(2,), dtype=string),
dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
)
"marital-status":SparseTensor(
indices=tf.Tensor([[0, 0], [1, 0]], shape=(2, 2), dtype=int64),
values=tf.Tensor(["Divorced", "Never-married"], shape=(2,), dtype=string),
dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
)
Then, we can use keras preprocessing layers proposed in the RFC like Lookup
and Hash
to convert the values of tf.SparseTensor
to zero-based integer ids. And we can feed them to the customized SparseEmbedding with combiner in ElasticDL. For example, we use Lookup
with education vocabulary ["Master", "Bachelor"] and marital-status vocabulary ["Divorced", "Never-married"] and can get
{
"education":SparseTensor(
indices=tf.Tensor([[0, 0], [2, 0]], shape=(2, 2), dtype=int64),
values=tf.Tensor([0, 1], shape=(2,), dtype=string),
dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
)
"marital-status":SparseTensor(
indices=tf.Tensor([[0, 0], [1, 0]], shape=(2, 2), dtype=int64),
values=tf.Tensor([0, 1], shape=(2,), dtype=int64),
dense_shape=tf.Tensor([3,1], shape=(2,), dtype=int64)
)
When the number of categorical features is very large, we may need to concatenate the integer ids of multiple categorical features before embedding and we have described the feature in the issue. For example, we add the ids of marital-status with 3 to avoid the conflicts with education ids and then concatenate the two SparseTensor
to a SparseTensor
:
SparseTensor(
indices=tf.Tensor([[0, 0], [2, 0], [0, 1], [1, 1]], shape=(2, 2), dtype=int64),
values=tf.Tensor([0, 1, 3, 4], shape=(2,), dtype=int64),
dense_shape=tf.Tensor([3, 2], shape=(2,), dtype=int64)
)
The problem of filling the missing value with a default integer id before embedding.
Besides ignoring the missing value, we also can fill the missing value with a default integer id during transformation.
The vocabulary of education is ["Master", "Bachelor"] and the id of missing value is 2.
The vocabulary of martial-status is ["Divorced", "Never-married"] and the id of missing value is 2.
After lookup, we can get the transformation result:
{"education": [[0], [2], [1]], "marital-status": [[0], [1], [2]]}
Then we may want to concatenate the integer ids like the issue. Before concatenating, we add the ids of marital-status with 3 to avoid the conflicts with education ids. So, the concatenation result is
[[0, 3], [2, 4], [1, 5]]
Then we feed the result into tf.keras.layers.Embedding
and use tf.reduce_sum
to combine the embedding output to 3x2 dense tensor. The logic of embedding and tf.reduce_sum
is the same as SparseEmbedding
with the sum combiner. But, we add the embedding values of missing values in the result. The combined result of each sample may be bigger than using sparse tensor without missing values. The embedding values of missing values may be noise for training.
For example:
Suppose that the embedding table is 6x2 matrix:
[[0.1, 0.1], [0.2, 0.2], [0.3, 0.3], [0.4, 0.4], [0.5, 0.5], [0.6, 0.6]]
If we feed the above concatenated sparse tensor into SparseEmbedding
with sum combiner, the output is
[[0.5, 0.5], [0.5, 0.5], [0.2, 0.2]]
If we feed the concatenated tensor into tf.keras.layers.Embedding
and use tf.reduce_sum
to combine, the output is
[[0.5, 0.5], [0.8, 0.8], [0.8, 0.8]]