Skip to content

Clean up our auto-caching #1604

Closed
Closed
@Zruty0

Description

@Zruty0

Currently, some of our trainers cache the data prior to training, without a possibility to disable that.

I believe the good incremental step would be to disable all auto-caching, and rely on user to call AppendCacheCheckpoint prior to multi-pass training.

This is not really ideal, since the default setup for multi-pass trainers will train slower. I still think it is better to have a consistent story about our 'smarts' (that is, we have no auto-normalization, no auto-caching and no auto-calibration), and use extensive documentation (and tooling, in the future) to cover these pitfalls.

cc @GalOshri @TomFinley @eerhardt

Activity

justinormont

justinormont commented on Nov 13, 2018

@justinormont
Contributor

I would prefer we make a good model by default for the user w/ auto-normalization & auto-calibration.

The auto-caching is simply a matter of speed, not final model quality, but is quite related to the overall user happiness.

Zruty0

Zruty0 commented on Nov 13, 2018

@Zruty0
ContributorAuthor

I think this is not the first time we have this argument.
Again, the reasons we don't want to make these auto-smarts part of the core API is because sometimes they are making mistakes, and sometimes costly mistakes:

  • For auto-caching, we may blow up the machine's memory by caching the training data, where a non-cached training would've succeeded (maybe slower).
  • Auto-caching assumes that the original training data is slow to access, but if it's a memory-backed dataset (or another cache), this is not true. So auto-caching may make training slower, and ALSO consume more memory.
  • Auto-calibration happens on the training set. We assume that most of the time it's OK, given that we only learn 2 parameters, but we already saw cases where model quality degrades because of that.
  • Auto-normalization may normalize data that is otherwise pretty regular, potentially making model larger, and training slower.
  • Auto-normalization has a potential for user confusion: they thought that they merely trained a linear model, but in reality they train a pair of models.

Because of the above, we don't want any of these smarts to be part of the core ML.NET API. We need to expose the API to normalize, cache and calibrate at user's request. Our existing smarts can be converted into tooling (VS code analyzer warnings etc.).

added
APIIssues pertaining the friendly API
usabilitySmoothing user interaction or experience
on Nov 13, 2018
TomFinley

TomFinley commented on Nov 14, 2018

@TomFinley
Contributor

This feeds back into a general principle (not just for ML.NET) that APIs are best explicit, not implicit. Tools can get away with implicit behavior, APIs should not. Programming against an unfamiliar API is hard enough without having to worry about the API essentially rewriting your program for you and doing things you didn't ask for because it "knows better." Like @Zruty0 I'm actually a little surprised we are still having this argument.

sfilipi

sfilipi commented on Nov 21, 2018

@sfilipi
Member

Checking the impl mechanism: so we'll have configs for norm/calibration/cache (or we already have them through TrainContext/TrainInfo) and default normalization, calibration and caching will be no?

Zruty0

Zruty0 commented on Nov 21, 2018

@Zruty0
ContributorAuthor

No, we just remove all auto-caching, calibration and normalization, period.

The users will be responsible to normalize the data if needed (via mlContext.Normalize), cache in memory if desired (via mlContext.Data.Cache or pipeline.AppendCacheCheckpoint), and calibrate if desired (after #1622 is done).

GalOshri

GalOshri commented on Nov 21, 2018

@GalOshri
Contributor

Would it be feasible to provide some documentation/hints on when normalization/caching/calibration are important? For example, if a learner today is configured to add normalization, should we update the docs for that learner to suggest that normalization is important? Or perhaps just explain in the docs on normalization in which situations it might be important.

/cc @JRAlexander

Zruty0

Zruty0 commented on Nov 21, 2018

@Zruty0
ContributorAuthor

For example, if a learner today is configured to add normalization,

We already changed it long ago, no learners are adding normalization. Or calibration. This work item is to also remove auto-caching, everything else is already gone.

The cookbook has a section on normalization.

self-assigned this
on Nov 28, 2018
wschin

wschin commented on Dec 3, 2018

@wschin
Member

It looks like both of mlContext.Data.Cache and pipeline.AppendCacheCheckPoint can't work with dynamic pipeline (the only examples I can find are in CachingTests.cs). Do we have any caching mechanism for static world?

ghost locked as resolved and limited conversation to collaborators on Mar 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

APIIssues pertaining the friendly APIusabilitySmoothing user interaction or experience

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @GalOshri@wschin@justinormont@TomFinley@sfilipi

      Issue actions

        Clean up our auto-caching · Issue #1604 · dotnet/machinelearning