Skip to content

Enable Binary Classification Metric Calculation on Huge Datasets #3838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
daholste opened this issue Jun 7, 2019 · 4 comments
Open

Enable Binary Classification Metric Calculation on Huge Datasets #3838

daholste opened this issue Jun 7, 2019 · 4 comments
Labels
enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point.

Comments

@daholste
Copy link
Contributor

daholste commented Jun 7, 2019

From experimenting, it seems like calculating binary classification metrics does not scale to huge datasets. Taking a heap dump to examine the high memory usage (before the program runs out of memory), I see a list of floats used by UnweightedAucAggregator. It looks like, to calculate AUC, every prediction is kept in memory. It also looks like there is already substantial logic to account for this scenario -- there's logic to reservoir sample predictions, and then calculate AUC on the sample. However, it looks like the size of the internal parameter MaxAucExamples to control the size of this reservoir sample is always set to -1, and not exposed to the end user?


Perhaps we should somehow expose this parameter to enable binary metric calculation on huge datasets, or set the parameter to some reasonable default
@justinormont, @vinodshanbhag

@daholste
Copy link
Contributor Author

daholste commented Jun 7, 2019

image
over a GB of floats being used by UnweightedAucAggregator

@justinormont
Copy link
Contributor

Interesting... I had assumed our AUC calculation was streaming.

Perhaps we should adopt TF's streaming AUC.

TF AUC Docs:
https://www.tensorflow.org/api_docs/python/tf/metrics/auc

TF AUC Code:
https://github.com/tensorflow/tensorflow/blob/93dd14dce2e8751bcaab0a0eb363d55eb0cc5813/tensorflow/python/ops/metrics_impl.py#L628-L891

@yaeldekel
Copy link

The TF calculation requires the values to be between 0 and 1, while ML.NET computes the AUC using the raw scores, to make it more accurate (and since the probabilities are not always available).
We could add an option to compute it in a streaming fashion using the probabilities, but that would also require exposing a parameter to the binary classification evaluator, similar to exposing the sample size to use for the existing AUC calculation.

@yaeldekel yaeldekel added the enhancement New feature or request label Jun 11, 2019
@Lynx1820 Lynx1820 added the P2 Priority of the issue for triage purpose: Needs to be fixed at some point. label Jan 10, 2020
@justinormont
Copy link
Contributor

The TF calculation requires the values to be between 0 and 1, while ML.NET computes the AUC using the raw scores, to make it more accurate (and since the probabilities are not always available).

@yaeldekel: If we need the values within 0..1, we can always squash them using a sigmoid or tanh. Both sigmoid and tanh will preserve the ordering (monotonic -- if a<b, then tanh(a)<tanh(b)), so I expect the calculated AUC would still be correct.

The sigmoid and tanh both saturate rather quickly, and float32 doesn't have infinite precision like idealized math, so there could be corner cases where two non-equal could get mapped to the same output value. Given that AUC is calculated as an approx using a histogram, this is likely ignorable, though we should check when they saturate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Projects
None yet
Development

No branches or pull requests

4 participants