-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Enable Binary Classification Metric Calculation on Huge Datasets #3838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting... I had assumed our AUC calculation was streaming. Perhaps we should adopt TF's streaming AUC. TF AUC Docs: |
The TF calculation requires the values to be between 0 and 1, while ML.NET computes the AUC using the raw scores, to make it more accurate (and since the probabilities are not always available). |
@yaeldekel: If we need the values within 0..1, we can always squash them using a sigmoid or tanh. Both sigmoid and tanh will preserve the ordering (monotonic -- if The sigmoid and tanh both saturate rather quickly, and float32 doesn't have infinite precision like idealized math, so there could be corner cases where two non-equal could get mapped to the same output value. Given that AUC is calculated as an approx using a histogram, this is likely ignorable, though we should check when they saturate. |
From experimenting, it seems like calculating binary classification metrics does not scale to huge datasets. Taking a heap dump to examine the high memory usage (before the program runs out of memory), I see a list of floats used by
UnweightedAucAggregator
. It looks like, to calculate AUC, every prediction is kept in memory. It also looks like there is already substantial logic to account for this scenario -- there's logic to reservoir sample predictions, and then calculate AUC on the sample. However, it looks like the size of the internal parameterMaxAucExamples
to control the size of this reservoir sample is always set to -1, and not exposed to the end user?machinelearning/src/Microsoft.ML.Data/Evaluators/BinaryClassifierEvaluator.cs
Line 45 in 610ffcb
Perhaps we should somehow expose this parameter to enable binary metric calculation on huge datasets, or set the parameter to some reasonable default
@justinormont, @vinodshanbhag
The text was updated successfully, but these errors were encountered: