-
Notifications
You must be signed in to change notification settings - Fork 1.9k
What is "Slot" (PFI documentation suggestions) #5954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So the name "Slot 48416" just comes if there isn't a name for that slot/index in the feature vector column. That can happen for various different reasons, like the original column not having a name, but its also very possible we aren't adding it correctly. I am interested in the fact that you only have about 2000 features yet it seems like the feature column ends up with a lot more columns then that. Can you check the schema of that column in your pipeline and let me know what it says? We may be able to use this to trace down if there is a bug or something we are missing when either naming the slots or something else. Its also possible it is completely working as intended, we will just need more information to see. Running time for sure seems to be longer than O(n), though honestly I am not sure what it is. @justinormont may have a better understanding of the time required. For your other questions at the end I will need to ask a few people. I am not the most familiar with how PFI itself actually works under the hood. |
Unnamed slots Fixing on ML․NET dev side -- This is an issue that should be fixed. Ideally each transform would provide good names for each feature created. Alternatively, instead of fixing individual transforms, a less clean but easier fix is naming all slots in only the Before a fix is in, you can backtrack the slot's purpose by looking at your Slow PFI The Linear model Trees are a bit more complex for runtime; their Speeding up PFI To speed up PFI, you can use |
Global Feature Index works perfectly, thank you. The features in PFI without label seems to be categorical string values. GFI reports the names in format "CityCode.HEL" / columnName.Value. |
@michaelgsharp It seems this is creating high count of weights without names: machinelearning/src/Microsoft.ML.AutoML/TransformInference/TransformInference.cs Line 279 in 3055403
I commented the below part, and started getting problems with GFI. In debug inspection I noted count of weights is much higher than count of slot names. I am using the sample with slight modifications from justnormont's link above. machinelearning/src/Microsoft.ML.AutoML/TransformInference/TransformInference.cs Lines 258 to 263 in 3055403
lastTransformer.Model.SubModel.GetFeatureWeights(ref weights); give very high count of items (in the last dataset something like 200k) output.Schema["Features"].GetSlotNames(ref slotNames); still gives as expected, in this case 7000. I did not get into this further yet, I need to first complete the main task. So, it is possible I am misunderstanding something here. |
@torronen One-hot hashing transform has the option of creating slot names: machinelearning/src/Microsoft.ML.Transforms/OneHotHashEncoding.cs Lines 104 to 107 in 0577957
When AutoML creates a one-hot hashing transform, it is not using the machinelearning/src/Microsoft.ML.AutoML/EstimatorExtensions/EstimatorExtensions.cs Line 221 in 3055403
The default of One-hot hashing is used when the cardinality of the column is large; standard one-hot is used for lower cardinalities: machinelearning/src/Microsoft.ML.AutoML/TransformInference/TransformInference.cs Lines 258 to 267 in 3055403
The slot names are created as: Multiple strings can map to the same hash bucket, giving a slot name of Ideally, any empty slot names would auto-created lazily (as mentioned above) and filled in. This would require a fix to ML․NET. Instead of using one-hot hashing, if you use the standard one-hot transform, it will produce a slot name for each slot. |
When you say created lazily are you meaning we would figure out which column the slot originally came from? Or if not how so since we won't know what was hashed to get to that slot originally. Right now for PFI (the new API's) if the slot isn't known it just fills in "Slot X". |
I am running the new PFI API (main branch with #5934) for a FastTreeBinary loaded model created by AutoML API.
Main question:
I receive items like "Slot 48416" from
MLContext.BinaryClassification.PermutationFeatureImportanceNonCalibrated().
I did not find documentation about how to interpret these items. What do they mean? I am stuck with this issue.
As I understand it comes from the features vector, for slots without a name.
I am confused about why my features vector has these additional items, and how can I backtrack which original feature they belong to? I have about 2000 features in my dataset.
Side items / suggestion for documentation:
I notice there are some logging code in PFI which seem to set the progress of PFI to ProgressHeader, but I could not find documentation on how can I read the progress?
pch.SetHeader(new ProgressHeader("processed slots"), e => e.SetProgress(0, processedCnt));
There is also another GitHub issue about the recommended value for permutation and number of examples, and estimation of the running time. It seems the number of examples maybe has higher running time than O(n) but I have still not understood the source or concept of PFI adequately. It would be also useful to know if increasing number of examples or increasing number of permutations would result in more accurate results. Do I understand correctly the accuracy does increase until number of permutations reaches number of features? Increasing number of examples would increase chance the dataset is adequately represented, is it correct?
The text was updated successfully, but these errors were encountered: