You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TomFinley opened this issue
Jan 18, 2019
· 1 comment
Labels
APIIssues pertaining the friendly APIbugSomething isn't workingimageBugs related image datatype tasksP1Priority of the issue for triage purpose: Needs to be fixed soon.
As the name suggests, this "key to vector" transformer this produces a vector valued column out of a key-valued column. That key-valued column can be either a scalar, vector, or unknown size vector. Much of the logic in this method did not seem to conform to my understanding of what the transformer is doing (and, in one case, should be doing but isn't). It may be that we need some more test coverage for this, in addition to fixing the bugs, if I am right about any of these.
/// The 'raw' type of column item: must be a primitive type or a structured type.
/// </summary>
publicreadonlyColumnTypeItemType;
col.ItemType is PrimitiveType is totally redundant given that we already had the much stronger test that it was one of the "known" datakinds. (Which, again, is still completely wrong.)
So, I think practically, we might have a much stronger test if we dropped practically all this matter and just tested col.IsKey. Maybe also verify that the raw type is not ulong or something. But, the rest of this doesn't make a lot of sense to me.
Bug 2: Error message
Now what does that test lead to... it leads to this exception being thrown:
return$"Could not find {columnRole} column '{columnName}'";
Which is of course not right. "Luckily," the test above was so screwed up that the only conceivable way a user could run into it is if they fed in an image type, if I interpret it correctly.
The first batch of metadata should verify that the type is of col.Kind == SchemaShape.Column.VectorKind.Vector. Instead, it only verifies it's not a variable vector. but apparently scalar values are just fine. :)
For the second metadata, this is sort of a bug more on the transformer itself than the estimator: given that the behavior of bagging is only relevant on vectors of keys, CategoricalSlotRanges should be added as metadata on a scalar source column only (or, I guess, ideally on a vector of length 1); if it's not a scalar (or vector of length 1), then it could not be a one-hot vector anyway, and so could not be encoding a categorical value. Might be more a @codemzs problem.
The third metadata looks actually reasonable, at least at first glance.
Bug 4: The column type.
If bagging is off, and the input is a variable sized vector, then the output will be a variable sized vector as well. Yet we see the result is unconditionally a .Vector, which does not cover that case:
As I kept reviewing, I noticed that value-to-key has some very similar problems. Not going to go over that in as much detail, but they're roughly in the same vein.
TomFinley
changed the title
Tests in KeyToVectorMappingEstimator.GetOutputSchema seem wrong
Conditions in KeyToVectorMappingEstimator.GetOutputSchema seem wrong
Jan 27, 2019
APIIssues pertaining the friendly APIbugSomething isn't workingimageBugs related image datatype tasksP1Priority of the issue for triage purpose: Needs to be fixed soon.
While reviewing #2176, I incidentally read what I take to be some bugs in this method, the schema shape propagation for key values:
machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs
Line 762 in bb92c06
As the name suggests, this "key to vector" transformer this produces a vector valued column out of a key-valued column. That key-valued column can be either a scalar, vector, or unknown size vector. Much of the logic in this method did not seem to conform to my understanding of what the transformer is doing (and, in one case, should be doing but isn't). It may be that we need some more test coverage for this, in addition to fixing the bugs, if I am right about any of these.
Bug 1: The Type Test
Let's review the type test:
machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs
Line 770 in bb92c06
I see some problems here:
This is the "key to vector" transform, yet, there is no test to see if it is a key?
Instead first we see that it fails if it's of an unknown
DataKind
(which seems totally besides the point).We also see that we have this totally unnecessary
ItemType.GetItemType()
, which considering the description is not necessary.Related: later on we see
col.ItemType is VectorType
, which according to the documentation here could never be the case anyway:machinelearning/src/Microsoft.ML.Core/Data/IEstimator.cs
Lines 46 to 49 in bb92c06
col.ItemType is PrimitiveType
is totally redundant given that we already had the much stronger test that it was one of the "known" datakinds. (Which, again, is still completely wrong.)So, I think practically, we might have a much stronger test if we dropped practically all this matter and just tested
col.IsKey
. Maybe also verify that the raw type is notulong
or something. But, the rest of this doesn't make a lot of sense to me.Bug 2: Error message
Now what does that test lead to... it leads to this exception being thrown:
machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs
Line 771 in bb92c06
If you trace through to what becomes the exception message, it becomes this:
machinelearning/src/Microsoft.ML.Core/Utilities/Contracts.cs
Line 463 in bb92c06
Which is of course not right. "Luckily," the test above was so screwed up that the only conceivable way a user could run into it is if they fed in an image type, if I interpret it correctly.
Bug 3: The metadata tests.
machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs
Lines 774 to 780 in bb92c06
The first batch of metadata should verify that the type is of
col.Kind == SchemaShape.Column.VectorKind.Vector
. Instead, it only verifies it's not a variable vector. but apparently scalar values are just fine. :)For the second metadata, this is sort of a bug more on the transformer itself than the estimator: given that the behavior of bagging is only relevant on vectors of keys,
CategoricalSlotRanges
should be added as metadata on a scalar source column only (or, I guess, ideally on a vector of length 1); if it's not a scalar (or vector of length 1), then it could not be a one-hot vector anyway, and so could not be encoding a categorical value. Might be more a @codemzs problem.The third metadata looks actually reasonable, at least at first glance.
Bug 4: The column type.
If bagging is off, and the input is a variable sized vector, then the output will be a variable sized vector as well. Yet we see the result is unconditionally a
.Vector
, which does not cover that case:machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs
Line 782 in bb92c06
/cc @Ivanidzo4ka
The text was updated successfully, but these errors were encountered: